Azure
Learn how to deploy a Pachyderm cluster on Microsoft Azure.
March 22, 2023
The following article walks you through deploying a Pachyderm cluster on Microsoft® Azure® Kubernetes Service environment (AKS).
Before You Start #
Before you can deploy Pachyderm on an AKS cluster, verify that you have the following prerequisites installed and configured:
1. Deploy Kubernetes #
You can deploy Kubernetes on Azure by following the official Azure Kubernetes Service documentation, use the quickstart walkthrough, or follow the steps in this section.
At a minimum, you will need to specify the parameters below:
Variable | Description |
---|---|
RESOURCE_GROUP | A unique name for the resource group where Pachyderm is deployed. For example, pach-resource-group . |
LOCATION | An Azure availability zone where AKS is available. For example, centralus . |
NODE_SIZE | The size of the Kubernetes virtual machine (VM) instances. To avoid performance issues, Pachyderm recommends that you set this value to at least Standard_DS4_v2 which gives you 8 CPUs, 28 Gib of Memory, 56 Gib SSD.In any case, use VMs that support premium storage. See Azure VM sizes for details around which sizes support Premium storage. |
CLUSTER_NAME | A unique name for the Pachyderm cluster. For example, pach-aks-cluster . |
You can choose to follow the guided steps in Azure Service Portal’s Kubernetes Services or use Azure CLI.
-
Log in to Azure:
az login
This command opens a browser window. Log in with your Azure credentials. Resources can now be provisioned on the Azure subscription linked to your account.
-
Create an Azure resource group or retrieve an existing group.
az group create --name ${RESOURCE_GROUP} --location ${LOCATION}
Example:
az group create --name test-group --location centralus
System Response:
{ "id": "/subscriptions/6c9f2e1e-0eba-4421-b4cc-172f959ee110/resourceGroups/pach-resource-group", "location": "centralus", "managedBy": null, "name": "pach-resource-group", "properties": { "provisioningState": "Succeeded" }, "tags": null, "type": null }
-
Create an AKS cluster in the resource group/location:
For more configuration options: Find the list of all available flags of the
az aks create
command.az aks create --resource-group ${RESOURCE_GROUP} --name ${CLUSTER_NAME} --node-vm-size ${NODE_SIZE} --node-count <node_pool_count> --location ${LOCATION}
Example:
az aks create --resource-group test-group --name test-cluster --generate-ssh-keys --node-vm-size Standard_DS4_v2 --location centralus
-
Confirm the version of the Kubernetes server by running
kubectl version
.
2. Create an Azure Storage Container For Your Data #
-
Set up the following variables:
STORAGE_ACCOUNT
: The name of the storage account where you store your data.CONTAINER_NAME
: The name of the Azure blob container where you store your data.
-
Create an Azure storage account:
az storage account create \ --resource-group="${RESOURCE_GROUP}" \ --location="${LOCATION}" \ --sku=Premium_LRS \ --name="${STORAGE_ACCOUNT}" \ --kind=BlockBlobStorage
System response:
{ "accessTier": null, "creationTime": "2019-06-20T16:05:55.616832+00:00", "customDomain": null, "enableAzureFilesAadIntegration": null, "enableHttpsTrafficOnly": false, "encryption": { "keySource": "Microsoft.Storage", "keyVaultProperties": null, "services": { "blob": { "enabled": true, ...
Make sure that you set Stock Keeping Unit (SKU) to
Premium_LRS
and thekind
parameter is set toBlockBlobStorage
. This configuration results in a storage that uses SSDs rather than standard Hard Disk Drives (HDD). If you set this parameter to an HDD-based storage option, your Pachyderm cluster will be too slow and might malfunction. -
Verify that your storage account has been successfully created:
az storage account list
-
Obtain the key for the storage account (
STORAGE_ACCOUNT
) and the resource group to be used to deploy Pachyderm:STORAGE_KEY="$(az storage account keys list \ --account-name="${STORAGE_ACCOUNT}" \ --resource-group="${RESOURCE_GROUP}" \ --output=json \ | jq '.[0].value' -r )"
Find the generated key in the Storage accounts > Access keys
section in the Azure Portal or by running the following command az storage account keys list --account-name=${STORAGE_ACCOUNT}
.
-
Create a new storage container within your storage account:
az storage container create --name ${CONTAINER_NAME} \ --account-name ${STORAGE_ACCOUNT} \ --account-key "${STORAGE_KEY}"
3. Persistent Volumes Creation #
etcd and PostgreSQL (metadata storage) each claim the creation of a pv.
If you plan to deploy Pachyderm with its default bundled PostgreSQL instance, read the warning below and jump to the deployment section:
The metadata service (Persistent disk) generally requires a small persistent volume size (i.e. 10GB) but high IOPS (1500), therefore, depending on your disk choice, you may need to oversize the volume significantly to ensure enough IOPS.
If you plan to deploy a managed PostgreSQL instance (Recommended in production), read the following section.
4. Create an Azure Managed PostgreSQL Server Database #
By default, Pachyderm runs with a bundled version of PostgreSQL. For production environments, we strongly recommend that you disable the bundled version and use a PostgreSQL Server instance.
Create A PostgreSQL Server Instance #
In the Azure console, choose the Azure Database for PostgreSQL servers service. You will be asked to pick your server type: Single Server
or Hyperscale
(for multi-tenant applications), then configure your DB instance as follows.
SETTING | Recommended value |
---|---|
subscription and resource group | Pick your existing resource group. Important Your Cluster and your Database must be deployed in the same resource group. |
server name | Name your instance. |
location | Create a database in the region matching your Pachyderm cluster. |
compute + storage | The standard instance size (GP_Gen5_4 = Gen5 VMs with 4 cores) should work. Remember that Pachyderm’s metadata services require high IOPS (1500). Oversize the disk accordingly |
Master username | Choose your Admin username. (“postgres”) |
Master password | Choose your Admin password. |
You are ready to create your instance.
Example #
az postgres server create \
--resource-group <your_resource_group> \
--name <your_server_name> \
--location westus \
--sku-name GP_Gen5_2 \
--admin-user <server_admin_username> \
--admin-password <server_admin_password> \
--ssl-enforcement Disabled \
--version 11
For detailed steps, see the official Azure documentation.
- Make sure that your PostgreSQL version is
>= 11
- Keep the SSL setting
Disabled
.
Once created, go back to your newly created database, and:
- Open the access to your instance:
Azure provides two options for pods running on an AKS worker nodes to access a PostgreSQL DB instance, pick what fit you best:
- Create a firewall rule on the Azure DB Server with a range of IP addresses that encompasses all IPs of the AKS Cluster nodes (this can be a very large range if using node auto-scaling).
- Create a VNet Rule on the Azure DB Server that allows access from the subnet the AKS nodes are in. This is used in conjunction with the Microsoft.Sql VNet Service Endpoint enabled on the cluster subnet.
You can also choose the more secure option to deny public access to your PostgreSQL instance then Create a private endpoint in the K8s vnet. Read more about how to configure a private link using CLI on Azure’s documentation
Alternativelly, in the Connection Security of your newly created server, Allow access to Azure services (This is equivalent to running az postgres server firewall-rule create --server-name <your_server_name> --resource-group <your_resource_group> --name AllowAllAzureIps --start-ip-address 0.0.0.0 --end-ip-address 0.0.0.0
).
- In the Essentials page of your instance, find the full server name and admin username that will be required in your values.yaml.
Create Your Databases #
After your instance is created, you will need to create Pachyderm’s database(s).
If you plan to deploy a standalone cluster (i.e., if you do not plan to register your cluster with a separate enterprise server, you will need to create a second database named “dex” in your PostgreSQL Server instance for Pachyderm’s authentication service. Note that the database must be named dex
. This second database is not needed when your cluster is managed by an enterprise server.
Read more about dex on PostgreSQL in Dex’s documentation.
Pachyderm will use the same user to connect to pachyderm
as well as to dex
.
Update your yaml values #
Once your databases have been created, add the following fields to your Helm values:
global:
postgresql:
postgresqlUsername: "see admin username above"
postgresqlPassword: "password"
# The server name of the instance
postgresqlDatabase: "pachyderm"
# The postgresql database host to connect to.
postgresqlHost: "see server name above"
# The postgresql database port to connect to. Defaults to postgres server in subchart
postgresqlPort: "5432"
postgresql:
# turns off the install of the bundled postgres.
# If not using the built in Postgres, you must specify a Postgresql
# database server to connect to in global.postgresql
enabled: false
5. Deploy Pachyderm #
You have set up your infrastructure, created your data container and a Managed PostgreSQL instance, and granted your cluster access to both: you can now finalize your values.yaml and deploy Pachyderm.
Update Your Values.yaml #
If you have not created a Managed PostgreSQL Server instance, replace the Postgresql section below with postgresql:enabled: true
in your values.yaml. This setup is not recommended in production environments.
If you have previously tried to run Pachyderm locally, make sure that you are using the right Kubernetes context first.
-
Verify cluster context:
kubectl config current-context
This command should return the name of your Kubernetes cluster that runs on Azure.
If you have a different context displayed, configure
kubectl
to use your Azure configuration:az aks get-credentials --resource-group ${RESOURCE_GROUP} --name ${CLUSTER_NAME}
System Response:
Merged "${CLUSTER_NAME}" as current context in /Users/test-user/.kube/config
-
Update your values.yaml
Update your values.yaml with your container name (see example of values.yaml here) or use our minimal example below.
deployTarget: "MICROSOFT" pachd: storage: microsoft: # storage container name container: "container_name" # storage account name id: "AKIAIOSFODNN7EXAMPLE" # storage account key secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" externalService: enabled: true global: postgresql: postgresqlUsername: "see admin username above" postgresqlPassword: "password" # The server name of the instance postgresqlDatabase: "pachyderm" # The postgresql database host to connect to. postgresqlHost: "see server name above" # The postgresql database port to connect to. Defaults to postgres server in subchart postgresqlPort: "5432" postgresql: # turns off the install of the bundled postgres. # If not using the built in Postgres, you must specify a Postgresql # database server to connect to in global.postgresql enabled: false
Check the list of all available helm values at your disposal in our reference documentation or on Github.
Deploy Pachyderm On The Kubernetes Cluster #
-
Now you can deploy a Pachyderm cluster by running this command:
helm repo add pach https://helm.pachyderm.com helm repo update helm install pachd -f values.yaml pach/pachyderm --version <version-of-the-chart>
System Response:
NAME: pachd LAST DEPLOYED: Mon Jul 12 18:28:59 2021 NAMESPACE: default STATUS: deployed REVISION: 1
Refer to our generic Helm documentation for more information on how to select your chart version.
Pachyderm pulls containers from DockerHub. It might take some time before the
pachd
pods start. You can check the status of the deployment by periodically runningkubectl get all
.When pachyderm is up and running, get the information about the pods:
kubectl get pods
Once the pods are up, you should see a pod for
pachd
running (alongside etcd, pg-bouncer, postgres, or console, depending on your installation).System Response:
NAME READY STATUS RESTARTS AGE pachd-1971105989-mjn61 1/1 Running 0 54m ...
Note: Sometimes Kubernetes tries to start
pachd
nodes before theetcd
nodes are ready which might result in thepachd
nodes restarting. You can safely ignore those restarts. -
Finally, make sure that
pachctl
talks with your cluster.
6. Have ‘pachctl’ And Your Cluster Communicate #
Assuming your pachd
is running as shown above, make sure that pachctl
can talk to the cluster.
If you are exposing your cluster publicly:
-
Retrieve the external IP address of your TCP load balancer or your domain name:
kubectl get services | grep pachd-lb | awk '{print $4}'
-
Update the context of your cluster with their direct url, using the external IP address/domain name above:
pachctl connect grpc://localhost:80
-
Check that your are using the right context:
pachctl config get active-context
Your cluster context name should show up.
If you’re not exposing pachd
publicly, you can run:
# Background this process because it blocks.
pachctl port-forward
7. Check That Your Cluster Is Up And Running #
If Authentication is activated (When you deploy with an enterprise key already set, for example), you need to run pachct auth login
, then authenticate to Pachyderm with your User, before you use pachctl
.
pachctl version
System Response:
COMPONENT VERSION
pachctl 2.5.2
pachd 2.5.2