Reference
pachctl

Azure

Learn how to deploy a Pachyderm cluster on Microsoft Azure.

March 30, 2023

For a quick test installation of Pachyderm on Azure (suitable for development), jump to our Quickstart page.

💡

Before your start your installation process.

  • Refer to our generic “Helm Install” page for more information on how to install and get started with Helm.
  • Read our infrastructure recommendations. You will find instructions on how to set up an ingress controller, a load balancer, or connect an Identity Provider for access control.
  • Pachyderm comes with a web UI (Console) for visualizing running pipelines and exploring your data. Note that, unless your deployment is LOCAL (i.e., on a local machine for development only, for example, on Minikube or Docker Desktop), the deployment of Console requires, at a minimum, the set up of an Ingress.
⚠️

We are now shipping Pachyderm with an embedded proxy allowing your cluster to expose one single port externally. This deployment setup is optional.

If you choose to deploy Pachyderm with a Proxy, check out our new recommended architecture and deployment instructions as they alter the instructions below.

The following section walks you through deploying a Pachyderm cluster on Microsoft® Azure® Kubernetes Service environment (AKS).

In particular, you will:

1. Install Prerequisites #

Before your start creating your cluster, install the following clients on your machine. If not explicitly specified, use the latest available version of the components listed below.

ℹ️

This page assumes that you have an Azure Subsciption.

2. Deploy Kubernetes #

You can deploy Kubernetes on Azure by following the official Azure Kubernetes Service documentation, use the quickstart walkthrough, or follow the steps in this section.

⚠️

Pachyderm recommends running your cluster on Kubernetes 1.19.0 and above.

At a minimum, you will need to specify the parameters below:

VariableDescription
RESOURCE_GROUPA unique name for the resource group where Pachyderm is deployed. For example, pach-resource-group.
LOCATIONAn Azure availability zone where AKS is available. For example, centralus.
NODE_SIZEThe size of the Kubernetes virtual machine (VM) instances. To avoid performance issues, Pachyderm recommends that you set this value to at least Standard_DS4_v2 which gives you 8 CPUs, 28 Gib of Memory, 56 Gib SSD.

In any case, use VMs that support premium storage. See Azure VM sizes for details around which sizes support Premium storage.
CLUSTER_NAMEA unique name for the Pachyderm cluster. For example, pach-aks-cluster.

You can choose to follow the guided steps in Azure Service Portal’s Kubernetes Services or use Azure CLI.

  1. Log in to Azure:

    az login

    This command opens a browser window. Log in with your Azure credentials. Resources can now be provisioned on the Azure subscription linked to your account.

  2. Create an Azure resource group or retrieve an existing group.

    az group create --name ${RESOURCE_GROUP} --location ${LOCATION}

    Example:

    az group create --name test-group --location centralus

    System Response:

    {
      "id": "/subscriptions/6c9f2e1e-0eba-4421-b4cc-172f959ee110/resourceGroups/pach-resource-group",
      "location": "centralus",
      "managedBy": null,
      "name": "pach-resource-group",
      "properties": {
        "provisioningState": "Succeeded"
      },
      "tags": null,
      "type": null
    }
  3. Create an AKS cluster in the resource group/location:

    For more configuration options: Find the list of all available flags of the az aks create command.

    az aks create --resource-group ${RESOURCE_GROUP} --name ${CLUSTER_NAME} --node-vm-size ${NODE_SIZE} --node-count <node_pool_count> --location ${LOCATION}

    Example:

    az aks create --resource-group test-group --name test-cluster --generate-ssh-keys --node-vm-size Standard_DS4_v2 --location centralus
  4. Confirm the version of the Kubernetes server by running kubectl version.

ℹ️

“See Also:” - Azure Virtual Machine sizes

Once your Kubernetes cluster is up, and your infrastructure configured, you are ready to prepare for the installation of Pachyderm. Some of the steps below will require you to keep updating the values.yaml started during the setup of the recommended infrastructure:

3. Create an Azure Storage Container For Your Data #

Pachyderm needs an Azure Storage Container (Object store) to store your data.

To access your data, Pachyderm uses a Storage Account with permissioned access to your desired container. You can either use an existing account or create a new one in your default subscription, then use the JSON key associated with the account and pass it on to Pachyderm.

To create a new storage account, follow the steps below:

⚠️

The storage account name must be unique in the Azure location.

ℹ️
Find the generated key in the **Storage accounts > Access keys**
section in the [Azure Portal](https://portal.azure.com/) or by running the following command `az storage account keys list --account-name=${STORAGE_ACCOUNT}`.

4. Persistent Volumes Creation #

etcd and PostgreSQL (metadata storage) each claim the creation of a pv.

If you plan to deploy Pachyderm with its default bundled PostgreSQL instance, read the warning below and jump to the deployment section:

⚠️

The metadata service (Persistent disk) generally requires a small persistent volume size (i.e. 10GB) but high IOPS (1500), therefore, depending on your disk choice, you may need to oversize the volume significantly to ensure enough IOPS.

If you plan to deploy a managed PostgreSQL instance (Recommended in production), read the following section.

5. Create an Azure Managed PostgreSQL Server Database #

By default, Pachyderm runs with a bundled version of PostgreSQL. For production environments, we strongly recommend that you disable the bundled version and use a PostgreSQL Server instance.

This section will provide guidance on the configuration settings you will need to:

ℹ️

It is assumed that you are already familiar with PostgreSQL Server, or will be working with an administrator who is.

Create A PostgreSQL Server Instance¶ #

📖

Find the details of the steps and available parameters to create a PostgreSQL Server instance with Azure Console in Azure Documentation “Create an Azure Database for PostgreSQL server by using the Azure portal”.

Alternatively, you can use the cli and run az postgres server create with your relevant parameters.

In the Azure console, choose the Azure Database for PostgreSQL servers service. You will be asked to pick your server type: Single Server or Hyperscale (for multi-tenant applications), then configure your DB instance as follows.

SETTINGRecommended value
subscription and resource groupPick your existing resource group.

Important Your Cluster and your Database must be deployed in the same resource group.
server nameName your instance.
locationCreate a database in the region matching your Pachyderm cluster.
compute + storageThe standard instance size (GP_Gen5_4 = Gen5 VMs with 4 cores) should work. Remember that Pachyderm’s metadata services require high IOPS (1500). Oversize the disk accordingly
Master usernameChoose your Admin username. (“postgres”)
Master passwordChoose your Admin password.

You are ready to create your instance.

Example #

az postgres server create \
    --resource-group <your_resource_group> \
    --name <your_server_name>  \
    --location westus \
    --sku-name GP_Gen5_2 \
    --admin-user <server_admin_username> \
    --admin-password <server_admin_password> \
    --ssl-enforcement Disabled \
    --version 11
⚠️
  • Make sure that your PostgreSQL version is >= 11
  • Keep the SSL setting Disabled.

Once created, go back to your newly created database, and:

ℹ️

Azure provides two options for pods running on an AKS worker nodes to access a PostgreSQL DB instance, pick what fit you best:

  • Create a firewall rule on the Azure DB Server with a range of IP addresses that encompasses all IPs of the AKS Cluster nodes (this can be a very large range if using node auto-scaling).
  • Create a VNet Rule on the Azure DB Server that allows access from the subnet the AKS nodes are in. This is used in conjunction with the Microsoft.Sql VNet Service Endpoint enabled on the cluster subnet.

You can also choose the more secure option to deny public access to your PostgreSQL instance then Create a private endpoint in the K8s vnet. Read more about how to configure a private link using CLI on Azure’s documentation

Alternativelly, in the Connection Security of your newly created server, Allow access to Azure services (This is equivalent to running az postgres server firewall-rule create --server-name <your_server_name> --resource-group <your_resource_group> --name AllowAllAzureIps --start-ip-address 0.0.0.0 --end-ip-address 0.0.0.0).

Instance overview page

Create Your Databases #

After your instance is created, you will need to create Pachyderm’s database(s).

If you plan to deploy a standalone cluster (i.e., if you do not plan to register your cluster with a separate enterprise server, you will need to create a second database named “dex” in your PostgreSQL Server instance for Pachyderm’s authentication service. Note that the database must be named dex. This second database is not needed when your cluster is managed by an enterprise server.

ℹ️

Pachyderm will use the same user to connect to pachyderm as well as to dex.

Update your yaml values #

Once your databases have been created, add the following fields to your Helm values:

global:
  postgresql:
    postgresqlUsername: "see admin username above"
    postgresqlPassword: "password"
    # The server name of the instance
    postgresqlDatabase: "pachyderm"
    # The postgresql database host to connect to. 
    postgresqlHost: "see server name above"
    # The postgresql database port to connect to. Defaults to postgres server in subchart
    postgresqlPort: "5432"

postgresql:
  # turns off the install of the bundled postgres.
  # If not using the built in Postgres, you must specify a Postgresql
  # database server to connect to in global.postgresql
  enabled: false

6. Deploy Pachyderm #

You have set up your infrastructure, created your data container and a Managed PostgreSQL instance, and granted your cluster access to both: you can now finalize your values.yaml and deploy Pachyderm.

Update Your Values.yaml #

ℹ️

If you have not created a Managed PostgreSQL Server instance, replace the Postgresql section below with postgresql:enabled: true in your values.yaml. This setup is not recommended in production environments.

If you have previously tried to run Pachyderm locally, make sure that you are using the right Kubernetes context first.

  1. Verify cluster context:

    kubectl config current-context

    This command should return the name of your Kubernetes cluster that runs on Azure.

    If you have a different context displayed, configure kubectl to use your Azure configuration:

    az aks get-credentials --resource-group ${RESOURCE_GROUP} --name ${CLUSTER_NAME}

    System Response:

    Merged "${CLUSTER_NAME}" as current context in /Users/test-user/.kube/config
  2. Update your values.yaml

    Update your values.yaml with your container name (see example of values.yaml here) or use our minimal example below.

    deployTarget: "MICROSOFT"
    pachd:
      storage:
        microsoft:
          # storage container name
          container: "container_name"
          # storage account name
          id: "AKIAIOSFODNN7EXAMPLE"
          # storage account key
          secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
      externalService:
        enabled: true
    global:
      postgresql:
        postgresqlUsername: "see admin username above"
        postgresqlPassword: "password"
        # The server name of the instance
        postgresqlDatabase: "pachyderm"
        # The postgresql database host to connect to. 
        postgresqlHost: "see server name above"
        # The postgresql database port to connect to. Defaults to postgres server in subchart
        postgresqlPort: "5432"
    postgresql:
      # turns off the install of the bundled postgres.
      # If not using the built in Postgres, you must specify a Postgresql
      # database server to connect to in global.postgresql
      enabled: false

    Check the list of all available helm values at your disposal in our reference documentation or on Github.

Deploy Pachyderm On The Kubernetes Cluster #

7. Have ‘pachctl’ And Your Cluster Communicate #

Assuming your pachd is running as shown above, make sure that pachctl can talk to the cluster.

If you are exposing your cluster publicly:

  1. Retrieve the external IP address of your TCP load balancer or your domain name:

    kubectl get services | grep pachd-lb | awk '{print $4}'
  2. Update the context of your cluster with their direct url, using the external IP address/domain name above:

    echo '{"pachd_address": "grpc://<external-IP-address-or-domain-name>:30650"}' | pachctl config set context "<your-cluster-context-name>" --overwrite
    pachctl config set active-context "<your-cluster-context-name>"
  3. Check that your are using the right context:

    pachctl config get active-context

    Your cluster context name should show up.

If you’re not exposing pachd publicly, you can run:

# Background this process because it blocks.
pachctl port-forward

8. Check That Your Cluster Is Up And Running #

⚠️

If Authentication is activated (When you deploy with an enterprise key already set, for example), you need to run pachct auth login, then authenticate to Pachyderm with your User, before you use pachctl.

pachctl version

System Response:

COMPONENT           VERSION
pachctl             2.3.9
pachd               2.3.9