Reference
pachctl

Google Cloud Platform

Learn how to deploy a Pachyderm cluster on Google's GKE.

March 22, 2023

For a quick test installation of Pachyderm on GCP (suitable for development), jump to our Quickstart page.

💡

Before your start your installation process.

  • Refer to our generic “Helm Install” page for more information on how to install and get started with Helm.
  • Read our infrastructure recommendations. You will find instructions on how to set up an ingress controller, a load balancer, or connect an Identity Provider for access control.
  • Pachyderm comes with a web UI (Console) for visualizing running pipelines and exploring your data. Note that, unless your deployment is LOCAL (i.e., on a local machine for development only, for example, on Minikube or Docker Desktop), the deployment of Console requires, at a minimum, the set up of an Ingress.
âš ī¸

We are now shipping Pachyderm with an embedded proxy allowing your cluster to expose one single port externally. This deployment setup is optional.

If you choose to deploy Pachyderm with a Proxy, check out our new recommended architecture and deployment instructions as they alter the instructions below.

The following section walks you through deploying a Pachyderm cluster on Google Kubernetes Engine (GKE).

In particular, you will:

  1. Make a few client installations before you start.
  2. Deploy Kubernetes.
âš ī¸

For users who want to use an existing cluster, make sure to enable Workload Identity. For users creating a new cluster, the script provided in this documentation takes care of it.

  1. Create an GCS bucket for your data and grant Pachyderm access.
  2. Enable The Creation of Persistent Volumes
  3. Create A GCP Managed PostgreSQL Instance
  4. Deploy Pachyderm
  5. Finally, you will need to install pachctl to interact with your cluster.
  6. And check that your cluster is up and running
💡

TL;DR:

This script will create a GKE cluster, the workload identity service accounts and permissions you need, a static IP, the cloud SQL instance and databases, and a cloud storage bucket. It will also install Pachyderm into the cluster.

  • Before running it, update the global variables at the top of the script and make sure to go through the prerequisites, as we are assuming that you have created a project and enabled the necessary APIs.
    Note that it will also create a file called ${NAME}.values.yaml in the current directory.

  • Once your script has run, configure your context and check that your cluster is up and running.

1. Prerequisites #

Install the following clients:

If this is the first time you use the SDK, follow the Google SDK QuickStart Guide.

💡

You can install kubectl by using the Google Cloud SDK and running the following command:

gcloud components install kubectl

Additionally, before you begin your installation:

2. Deploy Kubernetes #

âš ī¸

Pachyderm recommends running your cluster on Kubernetes 1.19.0 and above.

To create a new Kubernetes cluster by using GKE, run:

CLUSTER_NAME=<any unique name, e.g. "pach-cluster">

GCP_ZONE=<a GCP availability zone. e.g. "us-west1-a">

gcloud config set compute/zone ${GCP_ZONE}

gcloud config set container/cluster ${CLUSTER_NAME}

MACHINE_TYPE=<machine type for the k8s nodes, we recommend "n1-standard-4" or larger>
â„šī¸

Adding --scopes storage-rw to the gcloud container clusters create ${CLUSTER_NAME} --machine-type ${MACHINE_TYPE} command below will grant the rw scope to whatever service account is on the cluster, which, if you don’t provide it, is the default compute service account for the project with Editor permissions. While this is not recommended in any production settings, this option can be useful for a quick setup in development. In that scenario, you do not need any service account or additional GCP Bucket permission (see below).

# By default the following command spins up a 3-node cluster. You can change the default with `--num-nodes VAL`.

gcloud container clusters create ${CLUSTER_NAME} \
 --machine-type=${MACHINE_TYPE} \
 --workload-pool=${PROJECT_ID}.svc.id.goog \
 --enable-ip-alias \
 --create-subnetwork="" \
 --enable-stackdriver-kubernetes \
 --enable-dataplane-v2 \
 --enable-shielded-nodes \
 --release-channel="regular" \
 --workload-metadata="GKE_METADATA" \
 --enable-autorepair \
 --enable-autoupgrade \
 --disk-type="pd-ssd" \
 --image-type="COS_CONTAINERD"
# By default, GKE clusters have RBAC enabled. To allow the 'helm install' to give the 'pachyderm' service account
# the requisite privileges via clusterrolebindings, you will need to grant *your user account* the privileges
# needed to create those clusterrolebindings.
#
# Note that this command is simple and concise, but gives your user account more privileges than necessary. See
# https://docs.pachyderm.io/en/latest/deploy-manage/deploy/rbac/ for the complete list of privileges that the
# pachyderm serviceaccount needs.
kubectl create clusterrolebinding cluster-admin-binding --clusterrole=cluster-admin --user=$(gcloud config get-value account)

This might take a few minutes to start up. You can check the status on the GCP Console. A kubeconfig entry is automatically generated and set as the current context. As a sanity check, make sure your cluster is up and running by running the following kubectl command:

# List all pods in the kube-system namespace.
kubectl get pods -n kube-system

System Response:

NAME                                                        READY   STATUS    RESTARTS   AGE
event-exporter-gke-67986489c8-j4jr8                         2/2     Running   0          3m21s
fluentbit-gke-499hn                                         2/2     Running   0          3m6s
fluentbit-gke-7xp2f                                         2/2     Running   0          3m6s
fluentbit-gke-jx7wt                                         2/2     Running   0          3m6s
gke-metrics-agent-jmqsl                                     1/1     Running   0          3m6s
gke-metrics-agent-rd5pr                                     1/1     Running   0          3m6s
gke-metrics-agent-xxl52                                     1/1     Running   0          3m6s
kube-dns-6c7b8dc9f9-ff4bz                                   4/4     Running   0          3m16s
kube-dns-6c7b8dc9f9-mfjrt                                   4/4     Running   0          2m27s
kube-dns-autoscaler-58cbd4f75c-rl2br                        1/1     Running   0          3m16s
kube-proxy-gke-nad-cluster-default-pool-2e5710dd-38wz       1/1     Running   0          105s
kube-proxy-gke-nad-cluster-default-pool-2e5710dd-4b7j       1/1     Running   0          3m6s
kube-proxy-gke-nad-cluster-default-pool-2e5710dd-zmzh       1/1     Running   0          3m5s
l7-default-backend-66579f5d7-2q64d                          1/1     Running   0          3m21s
metrics-server-v0.3.6-6c47ffd7d7-k2hmc                      2/2     Running   0          2m38s
pdcsi-node-7dtbc                                            2/2     Running   0          3m6s
pdcsi-node-bcbcl                                            2/2     Running   0          3m6s
pdcsi-node-jl8hl                                            2/2     Running   0          3m6s
stackdriver-metadata-agent-cluster-level-85d6d797b4-4l457   2/2     Running   0          2m14s

If you don’t see something similar to the above output, you can point kubectl to the new cluster manually by running the following command:

# Update your kubeconfig to point at your newly created cluster.
gcloud container clusters get-credentials ${CLUSTER_NAME}

Once your Kubernetes cluster is up, and your infrastructure configured, you are ready to prepare for the installation of Pachyderm. Some of the steps below will require you to keep updating the values.yaml started during the setup of the recommended infrastructure:

3. Create a GCS Bucket #

Create an GCS object store bucket for your data #

Pachyderm needs a GCS bucket (Object store) to store your data. You can create the bucket by running the following commands:

âš ī¸

The GCS bucket name must be globally unique.

You now need to give Pachyderm access to your GCP resources.

Set Up Your GCP Service Account #

To access your GCP resources, Pachyderm uses a GCP Project Service Account with permissioned access to your desired resources.

More information about the creation and management of a Service account on GCP documentation.

Configure Your Service Account Permissions #

For Pachyderm to access your Google Cloud Resources, run the following:

For a set of standard roles, read the GCP IAM permissions documentation.

4. Persistent Volumes Creation #

etcd and PostgreSQL (metadata storage) each claim the creation of a persistent disk.

If you plan to deploy Pachyderm with its default bundled PostgreSQL instance, read the warning below, and jump to the deployment section:

📖

When deploying Pachyderm on GCP, your persistent volumes are automatically created and assigned the default disk size of 50 GBs. Note that StatefulSets is a default as well.

âš ī¸

Each persistent disk generally requires a small persistent volume size but high IOPS (1500). If you choose to overwrite the default disk size, depending on your disk choice, you may need to oversize the volume significantly to ensure enough IOPS. For reference, 1GB should work fine for 1000 commits on 1000 files. 10GB is often a sufficient starting size, though we recommend provisioning at least 1500 write IOPS, which requires at least 50GB of space on SSD-based PDs and 1TB of space on Standard PDs.

If you plan to deploy a managed PostgreSQL instance (CloudSQL), read the following section. Note that this is the recommended setup in production.

5. Create a GCP Managed PostgreSQL Database #

By default, Pachyderm runs with a bundled version of PostgreSQL. For production environments, it is strongly recommended that you disable the bundled version and use a CloudSQL instance.

This section will provide guidance on the configuration settings you will need to:

â„šī¸

It is assumed that you are already familiar with CloudSQL, or will be working with an administrator who is.

Create A CloudSQL Instance #

Find the details of the steps and available parameters to create a CloudSQL instance in GCP Documentation: “Create instances: CloudSQL for PostgreSQL”.

gcloud sql instances create ${INSTANCE_NAME} \
--database-version=POSTGRES_13 \
--cpu=2 \
--memory=7680MB \
--zone=${GCP_ZONE} \
--availability-type=ZONAL \
--storage-size=50GB \
--storage-type=SSD \
--storage-auto-increase \
--root-password="<InstanceRootPassword>"

When you create a new Cloud SQL for PostgreSQL instance, a default admin user Username: "postgres" is created. It will later be used by Pachyderm to access its databases. Note that the --root-password flag above sets the password for this user.

Check out Google documentation for more information on how to Create and Manage PostgreSQL Users.

Create Your Databases #

After your instance is created, you will need to create Pachyderm’s database(s).

If you plan to deploy a standalone cluster (i.e., if you do not plan to register your cluster with a separate enterprise server), you will need to create a second database named “dex” in your Cloud SQL instance for Pachyderm’s authentication service. Note that the database must be named dex. This second database is not needed when your cluster is managed by an enterprise server.

â„šī¸

Run the first or both commands depending on your use case.

gcloud sql databases create pachyderm -i ${INSTANCE_NAME}
gcloud sql databases create dex -i ${INSTANCE_NAME}

Pachyderm will use the same user “postgres” to connect to pachyderm as well as to dex.

Update your values.yaml #

Once your databases have been created, add the following fields to your Helm values:

â„šī¸

To identify a Cloud SQL instance, you can find the INSTANCE_NAME on the Overview page for your instance in the Google Cloud Console, or by running the following command: gcloud sql instances describe INSTANCE_NAME For example: myproject:myregion:myinstance.

You will need to retrieve the name of your Cloud SQL connection:

CLOUDSQL_CONNECTION_NAME=$(gcloud sql instances describe ${INSTANCE_NAME} --format=json | jq ."connectionName")
cloudsqlAuthProxy:
  enabled: true
  connectionName: "<CLOUDSQL_CONNECTION_NAME>"
  serviceAccount: "<SERVICE_ACCOUNT>"
  resources:
    requests:
      memory: "500Mi"
      cpu:    "250m"

postgresql:
  # turns off the install of the bundled postgres.
  # If not using the built in Postgres, you must specify a Postgresql
  # database server to connect to in global.postgresql
  enabled: false

global:
  postgresql:
    # The postgresql database host to connect to. Defaults to postgres service in subchart
    postgresqlHost: "cloudsql-auth-proxy.default.svc.cluster.local."
    # The postgresql database port to connect to. Defaults to postgres server in subchart
    postgresqlPort: "5432"
    postgresqlSSL: "disable"
    postgresqlUsername: "postgres"
    postgresqlPassword: "<InstanceRootPassword>"

6. Deploy Pachyderm #

You have set up your infrastructure, created your GCP bucket and a CloudSQL instance, and granted your cluster access to both: you can now finalize your values.yaml and deploy Pachyderm.

Update Your Values.yaml #

See an example of values.yaml here.

You might want to create a static IP address to access your cluster externally. Refer to our infrastructure documentation for more details or check the example below:

STATIC_IP_NAME=<your address name>

gcloud compute addresses create ${STATIC_IP_NAME} --region=${GCP_REGION}

STATIC_IP_ADDR=$(gcloud compute addresses describe ${STATIC_IP_NAME} --region=${GCP_REGION} --format=json --flatten=address | jq '.[]' )
â„šī¸

If you have not created a Managed CloudSQL instance, replace the Postgresql section below with postgresql:enabled: true in your values.yaml and remove the cloudsqlAuthProxy fields. This setup is not recommended in production environments.

Retrieve these additional variables, then fill in their values in the YAML file below:

echo $BUCKET_NAME
echo $SERVICE_ACCOUNT
echo $CLOUDSQL_CONNECTION_NAME
deployTarget: GOOGLE

pachd:
  enabled: true
  externalService:
    enabled: true
    apiGRPCport:    31400
    loadBalancerIP: "<STATIC_IP_ADDR>"
  storage:
    google:
      bucket: "<BUCKET_NAME"
  serviceAccount:
    additionalAnnotations:
      iam.gke.io/gcp-service-account: "<SERVICE_ACCOUNT>"
    name:   "pachyderm"
  worker:
    serviceAccount:
      additionalAnnotations:
        iam.gke.io/gcp-service-account: "<SERVICE_ACCOUNT>"
    name:   "pachyderm-worker"

cloudsqlAuthProxy:
  enabled: true
  connectionName: "<CLOUDSQL_CONNECTION_NAME>"
  serviceAccount: "<SERVICE_ACCOUNT>"
  resources:
    requests:
      memory: "500Mi"
      cpu:    "250m" 

postgresql:
  enabled: false

global:
  postgresql:
    postgresqlHost: "cloudsql-auth-proxy.default.svc.cluster.local."
    postgresqlPort: "5432"
    postgresqlSSL: "disable"
    postgresqlUsername: "postgres"
    postgresqlPassword: "<InstanceRootPassword>"
âš ī¸

Attention Proxy users Replace the pachd.externalService section:

pachd:
  enabled: true
  externalService:
    enabled: true
    apiGRPCport:    31400
    loadBalancerIP: "<STATIC_IP_ADDR>"

with:

proxy:
  enabled: true
  service:
    type: LoadBalancer
    loadBalancerIP:  "<STATIC_IP_ADDR>"
pachd:
  enabled: true
â„šī¸

Check the list of all available helm values at your disposal in our reference documentation or on github.

Deploy Pachyderm on the Kubernetes cluster #

âš ī¸

If RBAC authorization is a requirement or you run into any RBAC errors see Configure RBAC.

It may take a few minutes for the pachd nodes to be running because Pachyderm pulls containers from DockerHub. You can see the cluster status with kubectl, which should output the following when Pachyderm is up and running:

kubectl get pods

Once the pods are up, you should see a pod for pachd running (alongside etcd, pg-bouncer or postgres, console, depending on your installation).

System Response:

NAME                     READY   STATUS    RESTARTS   AGE
etcd-0                   1/1     Running   0          4m50s
pachd-5db79fb9dd-b2gdq   1/1     Running   2          4m49s
postgres-0               1/1     Running   0          4m50s

If you see a few restarts on the pachd pod, you can safely ignore them. That simply means that Kubernetes tried to bring up those containers before other components were ready, so it restarted them.

7. Have ‘pachctl’ and your Cluster Communicate #

Assuming your pachd is running as shown above, make sure that pachctl can talk to the cluster.

If you are exposing your cluster publicly:

  1. Retrieve the external IP address of your TCP load balancer or your domain name: s kubectl get services | grep pachd-lb | awk '{print $4}'

  2. Update the context of your cluster with their direct url, using the external IP address/domain name above:

    echo '{"pachd_address": "grpc://<external-IP-address-or-domain-name>:30650"}' | pachctl config set context "<your-cluster-context-name>" --overwrite
    pachctl config set active-context "<your-cluster-context-name>"
  3. Check that your are using the right context:

    pachctl config get active-context

    Your cluster context name should show up.

If you’re not exposing pachd publicly, you can run:

# Background this process because it blocks.
pachctl port-forward

8. Check That Your Cluster Is Up And Running #

You are done! You can make sure that your cluster is working by running pachctl version or creating a new repo.

âš ī¸

If Authentication is activated (When you deploy with an enterprise key, for example), you will need to run pachct auth login, then authenticate to Pachyderm with your User, before you use pachctl.

pachctl version

System Response:

COMPONENT           VERSION
pachctl             2.3.9
pachd               2.3.9