On Premises

This document is broken down into the following sections, available at the links below

Need information on a particular flavor of Kubernetes or object store? Check out the see also section.

Troubleshooting a deployment? Check out Troubleshooting Deployments.

Introduction

Deploying Pachyderm successfully on-premises requires a few prerequisites and some planning. Pachyderm is built on Kubernetes. Before you can deploy Pachyderm, you or your Kubernetes administrator will need to perform the following actions:

  1. Deploy Kubernetes on-premises.
  2. Deploy a Kubernetes persistent volume that Pachyderm will use to store administrative data.
  3. Deploy an on-premises object store using a storage provider like MinIO, EMC’s ECS, or SwiftStack to provide S3-compatible access to your on-premises storage.
  4. Create a Pachyderm manifest by running the pachctl deploy custom command with appropriate arguments and the --dry-run flag to create a Kubernetes manifest for the Pachyderm deployment.
  5. Edit the Pachyderm manifest for your particular Kubernetes deployment

In this series of documents, we’ll take you through the steps unique to Pachyderm. We assume you have some Kubernetes knowledge. We will point you to external resources for the general Kubernetes steps to give you background.

Best practices

Infrastructure as code

We highly encourage you to apply the best practices used in developing software to managing the deployment process.

  1. Create scripts that automate as much of your processes as possible and keep them under version control.
  2. Keep copies of all artifacts, such as manifests, produced by those scripts and keep those under version control.
  3. Document your practices in the code and outside it.

Infrastructure in general

Be sure that you design your Kubernetes infrastructure in accordance with recommended guidelines. Don’t mix on-premises Kubernetes and cloud-based storage. It’s important that bandwidth to your storage deployment meet the guidelines of your storage provider.

Prerequisites

Software you will need

  1. kubectl
  2. pachctl

Setting up to deploy on-premises

Deploying Kubernetes

The Kubernetes docs have instructions for deploying Kubernetes in a variety of on-premise scenarios. We recommend following one of these guides to get Kubernetes running on premise.

Deploying a persistent volume

Persistent volumes: how do they work?

A Kubernetes persistent volume is used by Pachyderm’s etcd for storage of system metatada. In Kubernetes, persistent volumes are a mechanism for providing storage for consumption by the users of the cluster. They are provisioned by the cluster administrators. In a typical enterprise Kubernetes deployment, the administrators have configured persistent volumes that your Pachyderm deployment will consume by means of a persistent volume claim in the Pachyderm manifest you generate.

You can deploy PV’s to Pachyderm using our command-line arguments in three ways: using a static PV, with StatefulSets, or with StatefulSets using a StorageClass.

If your administrators are using selectors, or you want to use StorageClasses in a different way, you’ll need to edit the Pachyderm manifest appropriately before applying it.

Static PV

In this case, etcd will be deployed in Pachyderm as a ReplicationController with one (1) pod that uses a static PV. This is a common deployment for testing.

StatefulSets

StatefulSets are a mechanism provided in Kubernetes 1.9 and newer to manage the deployment and scaling of applications. It uses either Persistent Volume Provisioning or pre-provisioned PV’s.

If you’re using StatefulSets in your Kubernetes cluster, you will need to find out the particulars of your cluster’s PV configuration and use appropriate flags to pachctl deploy custom

StorageClasses

If your administrators require specification of classes to consume persistent volumes, you will need to find out the particulars of your cluster’s PV configuration and use appropriate flags to pachctl deploy custom.

Common tasks to all types of PV deployments

Sizing the PV

You’ll need to use a PV with enough space for the metadata associated with the data you plan to store in Pachyderm. We’re currently developing good rules of thumb for scaling this storage as your Pachyderm deployment grows, but it looks like 10G of disk space is sufficient for most purposes.

Creating the PV

In the case of cloud-based deployments, the pachctl deploy command for AWS, GCP and Azure creates persistent volumes for you, when you follow the instructions for those infrastructures.

In the case of on-premises deployments, the kind of PV you provision will be dependent on what kind of storage your Kubernetes administrators have attached to your cluster and configured, and whether you are expected to consume that storage as a static PV, with Persistent Volume Provisioning or as a StorageClass.

For example, many on-premises deployments use Network File System (NFS) to access to some kind of enterprise storage. Persistent volumes are provisioned in Kubernetes like all things in Kubernetes: by means of a manifest. You can learn about creating volumes and persistent volumes in the Kubernetes documentation.

You or your Kubernetes administrators will be responsible for configuring the PVs you create to be consumable as static PV’s, with Persistent Volume Provisioning or as a StorageClass.

What you’ll need for Pachyderm configuration of PV storage

Keep the information below at hand for when you run pachctl deploy custom further on

Configuring with static volumes

You’ll need the name of the PV and the amount of space you can use, in gigabytes. We’ll refer to those, respectively, as PVC_STORAGE_NAME and PVC_STORAGE_SIZE further on. With this kind of PV, you’ll use the flag --static-etcd-volume with PVC_STORAGE_NAME as its argument in your deployment.

Note: this will override any attempt to configure with StorageClasses, below.

Configuring with StatefulSets

If you’re deploying using StatefulSets, you’ll just need the amount of space you can use, in gigabytes, which we’ll refer to as PVC_STORAGE_SIZE further on..

Note: The --etcd-storage-class flag and argument will be ignored if you use the flag --static-etcd-volume along with it.

Configuring with StatefulSets using StorageClasses

If you’re deploying using StatefulSets with StorageClasses, you’ll need the name of the storage class and the amount of space you can use, in gigabytes. We’ll refer to those, respectively, as PVC_STORAGECLASS and PVC_STORAGE_SIZE further on. With this kind of PV, you’ll use the flag --etcd-storage-class with PVC_STORAGECLASS as its argument in your deployment.

Note: The --etcd-storage-class flag and argument will be ignored if you use the flag --static-etcd-volume along with it.

Deploying an object store

Object store: what’s it for?

An object store is used by Pachyderm’s pachd for storing all your data. The object store you use must be accessible via a low-latency, high-bandwidth connection like Gigabit or 10G Ethernet.

For an on-premises deployment, it’s not advisable to use a cloud-based storage mechanism. Don’t deploy an on-premises Pachyderm cluster against cloud-based object stores such as S3 from AWS, GCS from Google Cloud Platform, Azure Blob Storage from Azure.

Object store prerequisites

Object stores are accessible using the S3 protocol, created by Amazon. Storage providers like MinIO, EMC’s ECS, or SwiftStack provide S3-compatible access to enterprise storage for on-premises deployment. You can find links to instructions for providers of particular object stores in the See also section.

Sizing the object store

Size your object store generously. Once you start using Pachyderm, you’ll start versioning all your data. We’re currently developing good rules of thumb for scaling your object store as your Pachyderm deployment grows, but it’s a good idea to start with a large multiple of your current data set size.

What you’ll need for Pachyderm configuration of the object store

You’ll need four items to configure the object store. We’re prefixing each item with how we’ll refer to it further on.

  1. OS_ENDPOINT: The access endpoint. For example, MinIO’s endpoints are usually something like minio-server:9000. Don’t begin it with the protocol; it’s an endpoint, not an url.
  2. OS_BUCKET_NAME: The bucket name you’re dedicating to Pachyderm. Pachyderm will need exclusive access to this bucket.
  3. OS_ACCESS_KEY_ID: The access key id for the object store. This is like a user name for logging into the object store.
  4. OS_SECRET_KEY: The secret key for the object store. This is like the above user’s password.

Keep this information handy.

Next step: creating a custom deploy manifest for Pachyderm

Once you have Kubernetes deployed, your persistent volume created, and your object store configured, it’s time to create the Pachyderm manifest for deploying to Kubernetes.

See Also

Kubernetes variants

Object storage variants