Reference
pachctl

Deploy Pachyderm on AWS

Learn how to deploy a Pachyderm cluster on AWS.

March 30, 2023

For a quick test installation of Pachyderm on AWS (suitable for development), jump to our Quickstart page.

For deployments in production, refer to the following diagram and follow these step-by-step instructions:

AWS Arch

💡

Before your start your installation process.

  • Refer to our generic “Helm Install” page for more information on how to install and get started with Helm.
  • Read our infrastructure recommendations. You will find instructions on how to set up an ingress controller, a load balancer, or connect an Identity Provider for access control.
  • Pachyderm comes with a web UI (Console) for visualizing running pipelines and exploring your data. Note that, unless your deployment is LOCAL (i.e., on a local machine for development only, for example, on Minikube or Docker Desktop), the deployment of Console requires, at a minimum, the set up of an Ingress.
âš ī¸

We are now shipping Pachyderm with an embedded proxy allowing your cluster to expose one single port externally. This deployment setup is optional.

If you choose to deploy Pachyderm with a Proxy, check out our new recommended architecture and deployment instructions as they alter the instructions below.

The following section walks you through deploying a Pachyderm cluster on Amazon Elastic Kubernetes Service (EKS).

In particular, you will:

  1. Make a few client installations before you start.
  2. Deploy Kubernetes.
  3. Create an S3 bucket for your data and grant Pachyderm access.
  4. Enable Persistent Volumes Creation
  5. Create An AWS Managed PostgreSQL Instance
  6. Deploy Pachyderm
  7. Finally, you will need to install pachctl to interact with your cluster.
  8. And check that your cluster is up and running

1. Prerequisites #

Before you can deploy Pachyderm on an EKS cluster, verify that you have the following prerequisites installed and configured:

2. Deploy Kubernetes by using eksctl #

âš ī¸

Pachyderm requires running your cluster on Kubernetes 1.19.0 and above.

Use the eksctl tool to deploy an EKS cluster in your Amazon AWS environment. The eksctl create cluster command creates a virtual private cloud (VPC), a security group, and an IAM role for Kubernetes to create resources. For detailed instructions, see Amazon documentation.

To deploy an EKS cluster, complete the following steps:

  1. Deploy an EKS cluster:

    eksctl create cluster --name <name> --version <version> \
    --nodegroup-name <name> --node-type <vm-flavor> \
    --nodes <number-of-nodes> --nodes-min <min-number-nodes> \
    --nodes-max <max-number-nodes> --node-ami auto

    Example

    eksctl create cluster --name pachyderm-cluster --region us-east-2 --profile <your named profile>
  2. Verify the deployment:

    kubectl get all

    System Response:

    NAME                 TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
    service/kubernetes   ClusterIP   10.100.0.1   <none>        443/TCP   23h

Once your Kubernetes cluster is up, and your infrastructure is configured, you are ready to prepare for the installation of Pachyderm. Some of the steps below will require you to keep updating the values.yaml started during the setup of the recommended infrastructure.

â„šī¸

Secrets Manager. Pachyderm recommends securing and managing your secrets in a Secret Manager. Learn about the set up and configuration of your EKS cluster to retrieve the relevant secrets from AWS Secrets Manager then resume the following installation steps.

3. Create an S3 bucket #

Create an S3 object store bucket for data #

Pachyderm needs an S3 bucket (Object store) to store your data. You can create the bucket by running the following commands:

âš ī¸

The S3 bucket name must be globally unique across the entire Amazon region.

You now need to give Pachyderm access to your bucket either by:

📖

IAM roles provide finer grained user management and security capabilities than access keys. Pachyderm recommends the use of IAM roles for production deployments.

Add An IAM Role And Policy To Your Service Account #

Before you can make sure that the containers in your pods have the right permissions to access your S3 bucket, you will need to Create an IAM OIDC provider for your cluster.

Then follow the steps detailled in Create an IAM Role And Policy for your Service Account.

In short, you will:

  1. Retrieve your OpenID Connect provider URL:

    1. Go to the AWS Management console.
    2. Select your cluster instance in Amazon EKS.
    3. In the Configuration tab of your EKS cluster, find your OpenID Connect provider URL and save it. You will need it when creating your IAM Role.
  2. Create an IAM policy that gives access to your bucket:

    1. Create a new Policy from your IAM Console.
    2. Select the JSON tab.
    3. Copy/Paste the following text in the JSON tab:
    {
          "Version": "2012-10-17",
          "Statement": [
                {
          "Effect": "Allow",
                "Action": [
                      "s3:ListBucket"
                ],
                "Resource": [
                      "arn:aws:s3:::<your-bucket>"
                ]},{
          "Effect": "Allow",
                "Action": [
                      "s3:PutObject",
                      "s3:GetObject",
                      "s3:DeleteObject"
                ],
                "Resource": [
                      "arn:aws:s3:::<your-bucket>/*"
                ]}
          ]
    }

    Replace <your-bucket> with the name of your S3 bucket.

  3. Create an IAM role as a Web Identity using the cluster OIDC procider as the identity provider.

    1. Create a new Role from your IAM Console.
    2. Select the Web identity Tab.
    3. In the Identity Provider drop down, select the OpenID Connect provider URL of your EKS and sts.amazonaws.com as the Audience.
    4. Attach the newly created permission to the Role.
    5. Name it.
    6. Retrieve the Role arn. You will need it in your values.yaml annotations when deploying Pachyderm.

(Optional) Set Up Bucket Encryption #

Amazon S3 supports two types of bucket encryption — server-side encryption (SSE-S3) and AWS Key Management Service (AWS KMS), which stores customer master keys. When creating a bucket for your Pachyderm cluster, you can set up either of them. Because Pachyderm requests that buckets do not include encryption information, the method that you select for the bucket is applied.

📖

Setting up communication between Pachyderm object storage clients and AWS KMS to append encryption information to Pachyderm requests is not supported and not recommended.

To set up bucket encryption, see Amazon S3 Default Encryption for S3 Buckets.

4. Enable Your Persistent Volumes Creation #

etcd and PostgreSQL (metadata storage) each claim the creation of a pv.

💡

The metadata services generally require a small persistent volume size (i.e. 10GB) but high IOPS (1500).

Note that Pachyderm out-of-the-box deployment comes with gp2 default EBS volumes. While it might be easier to set up for test or development environments, we highly recommend to use SSD gp3 in production. A gp3 EBS volume delivers a baseline performance of 3,000 IOPS and 125MB/s at any volume size. Any other disk choice may require to oversize the volume significantly to ensure enough IOPS.

See volume types.

If you plan on using gp2 EBS volumes:

For gp3 volumes, you will need to deploy an Amazon EBS CSI driver to your cluster as detailed below.

For your EKS cluster to successfully create two Elastic Block Storage (EBS) persistent volumes (PV), follow the steps detailled in deploy Amazon EBS CSI driver to your cluster.

In short, you will:

  1. Create an IAM OIDC provider for your cluster. You might already have completed this step if you chose to create an IAM Role and Policy to give your containers permission to access your S3 bucket.
  2. Create a CSI Driver service account whose IAM Role will be granted the permission (policy) to make calls to AWS APIs.
  3. Install Amazon EBS Container Storage Interface (CSI) driver on your cluster configured with your created service account.

If you expect your cluster to be very long running or scale to thousands of jobs per commits, you might need to add more storage. However, you can easily increase the size of the persistent volume later.

5. Create an AWS Managed PostgreSQL Database #

By default, Pachyderm runs with a bundled version of PostgreSQL. For production environments, it is strongly recommended that you disable the bundled version and use an RDS PostgreSQL instance.

âš ī¸

Note that Aurora Serverless PostgreSQL is not supported and will not work.

This section will provide guidance on the configuration settings you will need to:

â„šī¸

It is assumed that you are already familiar with RDS, or will be working with an administrator who is.

Create An RDS Instance #

📖

Find the details of all the steps highlighted below in AWS Documentation: “Getting Started” hands-on tutorial.

In the RDS console, create a database in the region matching your Pachyderm cluster. Choose the PostgreSQL engine and select a PostgreSQL version >= 13.3.

Configure your DB instance as follows.

SETTINGRecommended value
DB instance identifierFill in with a unique name across all of your DB instances in the current region.
Master usernameChoose your Admin username.
Master passwordChoose your Admin password.
DB instance classThe standard default should work. You can change the instance type later on to optimize your performances and costs.
Storage type and Allocated storageIf you choose gp2, remember that Pachyderm’s metadata services require high IOPS (1500). Oversize the disk accordingly (>= 1TB).
If you select io1, keep the 100 GiB default size.
Read more information on Storage for RDS on Amazon’s website.
Storage autoscalingIf your workload is cyclical or unpredictable, enable storage autoscaling to allow RDS to scale up your storage when needed.
Standby instanceWe highly recommend creating a standby instance for production environments.
VPCSelect the VPC of your Kubernetes cluster. Attention: After a database is created, you can’t change its VPC.
Read more on VPCs and RDS on Amazon documentation.
Subnet groupPick a Subnet group or Create a new one.
Read more about DB Subnet Groups on Amazon documentation.
Public accessSet the Public access to No for production environments.
VPC security groupCreate a new VPC security group and open the postgreSQL port or use an existing one.
Password authentication or Password and IAM database authenticationChoose one or the other.
Database nameIn the Database options section, enter Pachyderm’s Database name (We are using pachyderm in this example.) and click Create database to create your PostgreSQL service. Your instance is running.
Warning: If you do not specify a database name, Amazon RDS does not create a database.
âš ī¸

One Last Step

Once your instance is created:

  • If you plan to deploy a standalone cluster (i.e., if you do not plan to register your cluster with a separate enterprise server, you will need to create a second database named “dex” in your RDS instance for Pachyderm’s authentication service. Note that the database must be named dex. Read more about dex on PostgreSQL in Dex’s documentation. This second database is not needed when your cluster is managed by an enterprise server.
  • Additionally, create a new user account and grant it full CRUD permissions to both pachyderm and (when applicable) dex databases. Read about managing PostgreSQL users and roles in this blog. Pachyderm will use the same username to connect to pachyderm as well as to dex.

Update your values.yaml #

Once your databases have been created, add the following fields to your Helm values:

global:
  postgresql:
    postgresqlUsername: "username"
    postgresqlPassword: "password" 
    # The name of the database should be Pachyderm's ("pachyderm" in the example above), not "dex" 
    # See also 
    # postgresqlExistingSecretName: "<yoursecretname>"
    postgresqlDatabase: "databasename"
    # The postgresql database host to connect to. Defaults to postgres service in subchart
    postgresqlHost: "RDS CNAME"
    # The postgresql database port to connect to. Defaults to postgres server in subchart
    postgresqlPort: "5432"

postgresql:
  # turns off the install of the bundled postgres.
  # If not using the built in Postgres, you must specify a Postgresql
  # database server to connect to in global.postgresql
  enabled: false

6. Deploy Pachyderm #

You have set up your infrastructure, created your S3 bucket and an AWS Managed PostgreSQL instance, and granted your cluster access to both: you can now finalize your values.yaml and deploy Pachyderm.

Update Your Values.yaml #

â„šī¸

If you have not created a Managed PostgreSQL RDS instance, replace the Postgresql section below with postgresql:enabled: true in your values.yaml. This setup is not recommended in production environments.

For gp3 EBS Volumes #

Check out our example of values.yaml for gp3 or use our minimal example below.

Gp3 + Service account annotations #
deployTarget: AMAZON
# This uses GP3 which requires the CSI Driver https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html
# And a storageclass configured named gp3
etcd:
  storageClass: gp3
pachd:
  storage:
    amazon:
      bucket: blah
      region: us-east-2
  serviceAccount:
    additionalAnnotations:
      eks.amazonaws.com/role-arn: arn:aws:iam::<ACCOUNT_ID>:role/pachyderm-bucket-access
  worker:
    serviceAccount:
      additionalAnnotations:
        eks.amazonaws.com/role-arn: arn:aws:iam::<ACCOUNT_ID>:role/pachyderm-bucket-access
  externalService:
    enabled: true
global:
  postgresql:
    postgresqlUsername: "username"
    postgresqlPassword: "password" 
    # The name of the database should be Pachyderm's ("pachyderm" in the example above), not "dex" 
    postgresqlDatabase: "databasename"
    # The postgresql database host to connect to. Defaults to postgres service in subchart
    postgresqlHost: "RDS CNAME"
    # The postgresql database port to connect to. Defaults to postgres server in subchart
    postgresqlPort: "5432"

postgresql:
  # turns off the install of the bundled postgres.
  # If not using the built in Postgres, you must specify a Postgresql
  # database server to connect to in global.postgresql
  enabled: false
Gp3 + AWS Credentials #
deployTarget: AMAZON
# This uses GP3 which requires the CSI Driver https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html
# And a storageclass configured named gp3
etcd:
  storageClass: gp3
pachd:
  storage:
    amazon:
      bucket: blah
      region: us-east-2
      # this is an example access key ID taken from https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html
      id: AKIAIOSFODNN7EXAMPLE
      # this is an example secret access key taken from https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html
      secret: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
  externalService:
    enabled: true           
global:
  postgresql:
    postgresqlUsername: "username"
    postgresqlPassword: "password" 
    # The name of the database should be Pachyderm's ("pachyderm" in the example above), not "dex" 
    postgresqlDatabase: "databasename"
    # The postgresql database host to connect to. Defaults to postgres service in subchart
    postgresqlHost: "RDS CNAME"
    # The postgresql database port to connect to. Defaults to postgres server in subchart
    postgresqlPort: "5432"

postgresql:
  # turns off the install of the bundled postgres.
  # If not using the built in Postgres, you must specify a Postgresql
  # database server to connect to in global.postgresql
  enabled: false

For gp2 EBS Volumes #

Check out our example of values.yaml for gp2 or use our minimal example below.

For Gp2 + Service account annotations #
deployTarget: AMAZON      
etcd:
  etcd.storageSize: 500Gi
pachd:
  storage:
    amazon:
      bucket: blah
      region: us-east-2
  serviceAccount:
    additionalAnnotations:
      eks.amazonaws.com/role-arn: arn:aws:iam::190146978412:role/pachyderm-bucket-access
  worker:
    serviceAccount:
      additionalAnnotations:
        eks.amazonaws.com/role-arn: arn:aws:iam::190146978412:role/pachyderm-bucket-access
  externalService:
    enabled: true
global:
  postgresql:
    postgresqlUsername: "username"
    postgresqlPassword: "password" 
    # The name of the database should be Pachyderm's ("pachyderm" in the example above), not "dex" 
    postgresqlDatabase: "databasename"
    # The postgresql database host to connect to. Defaults to postgres service in subchart
    postgresqlHost: "RDS CNAME"
    # The postgresql database port to connect to. Defaults to postgres server in subchart
    postgresqlPort: "5432"

postgresql:
  # turns off the install of the bundled postgres.
  # If not using the built in Postgres, you must specify a Postgresql
  # database server to connect to in global.postgresql
  enabled: false
For Gp2 + AWS Credentials #
deployTarget: AMAZON      
etcd:
  etcd.storageSize: 500Gi
pachd:
  storage:
    amazon:
      bucket: blah
      region: us-east-2
      # this is an example access key ID taken from https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html
      id: AKIAIOSFODNN7EXAMPLE            
      # this is an example secret access key taken from https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html           
      secret: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
  externalService:
    enabled: true
global:
  postgresql:
    postgresqlUsername: "username"
    postgresqlPassword: "password" 
    # The name of the database should be Pachyderm's ("pachyderm" in the example above), not "dex" 
    postgresqlDatabase: "databasename"
    # The postgresql database host to connect to. Defaults to postgres service in subchart
    postgresqlHost: "RDS CNAME"
    # The postgresql database port to connect to. Defaults to postgres server in subchart
    postgresqlPort: "5432"

postgresql:
  # turns off the install of the bundled postgres.
  # If not using the built in Postgres, you must specify a Postgresql
  # database server to connect to in global.postgresql
  enabled: false

Check the list of all available helm values at your disposal in our reference documentation or on Github.

💡

Retain (ideally in version control) a copy of the Helm values used to deploy your cluster. It might be useful if you need to restore a cluster from a backup.

Deploy Pachyderm On The Kubernetes Cluster #

7. Have ‘pachctl’ And Your Cluster Communicate #

Assuming your pachd is running as shown above, make sure that pachctl can talk to the cluster.

If you are exposing your cluster publicly:

  1. Retrieve the external IP address of your TCP load balancer or your domain name: s kubectl get services | grep pachd-lb | awk '{print $4}'

  2. Update the context of your cluster with their direct url, using the external IP address/domain name above:

    echo '{"pachd_address": "grpc://<external-IP-address-or-domain-name>:30650"}' | pachctl config set context "<your-cluster-context-name>" --overwrite
    pachctl config set active-context "<your-cluster-context-name>"
  3. Check that your are using the right context:

    pachctl config get active-context

    Your cluster context name should show up.

If you’re not exposing pachd publicly, you can run:

# Background this process because it blocks.
pachctl port-forward

8. Check That Your Cluster Is Up And Running #

âš ī¸

If Authentication is activated (When you deploy with an enterprise key already set, for example), you need to run pachct auth login, then authenticate to Pachyderm with your User, before you use pachctl.

pachctl version

System Response:

COMPONENT           VERSION
pachctl             2.3.9
pachd               2.3.9