Backup & Restore Your Cluster
Learn how to back-up and restore the state of a production cluster.
May 26, 2023
This page will walk you through the main steps required to manually back up and restore the state of a Pachyderm cluster in production.
Details on how to perform those steps might vary depending on your infrastructure and cloud provider / on-premises setup.
Refer to your provider’s documentation.
Overview #
Pachyderm state is stored in two main places (See our high-level architecture diagram):
- an object-store holding Pachyderm’s data.
- a PostgreSQL instance made up of one or two databases:
pachyderm
holding Pachyderm’s metadata anddex
holding authentication data.
Backing up a Pachyderm cluster involves snapshotting both the object store and the PostgreSQL database(s) (see above), in a consistent state, at a given point in time.
Restoring it involves re-populating the database(s) and the object store using those backups, then recreating a Pachyderm cluster.
- Make sure that you have a bucket for backup use, separate from the object store used by your cluster.
- Depending on the reasons behind your cluster recovery, you might choose to use an existing vs. a new instance of PostgreSQL and/or the object store.
Manual Back Up Of A Pachyderm Cluster #
Before any manual backup:
- Make sure to retain a copy of the Helm values used to deploy your cluster.
- Then, suspend any state-mutating operations.
- Backups incur downtime until operations are resumed.
- Operational best practices include notifying Pachyderm users of the outage and providing an estimated time when downtime will cease.
- Downtime duration is a function of the size of the data be to backed up and the networks involved; Testing before going into production and monitoring backup times on an ongoing basis might help make accurate predictions.
Suspend Operations #
Pause any external automated process ingressing data to Pachyderm input repos, or queue/divert those as they will fail to connect to the cluster while the backup occurs.
Suspend all mutation of state by scaling
pachd
and the worker pods down:
Before starting, make sure that your context points to the server you want to pause by running pachctl config get active-context
. Find more information on how to set your context in our deployment section.
To pause Pachyderm:
If you are an Enterprise user: Run the
pachctl enterprise pause
command.Alternatively, you can use
kubectl
:Before starting, make sure that
kubectl
points to the right cluster. Runkubectl config get-contexts
to list all available clusters and contexts (the current context is marked with a*
), thenkubectl config use-context <your-context-name>
to set the proper active context.kubectl scale deployment pachd --replicas 0 kubectl scale rc --replicas 0 -l suite=pachyderm,component=worker
Note that it takes some time for scaling down to take effect;
Run the
watch
command to monitor the state ofpachd
and worker pods terminating:watch -n 5 kubectl get pods
Back Up The Databases And The Object Store #
This step is specific to your database and object store hosting.
If your PostgreSQL instance is solely dedicated to Pachyderm, you can use PostgreSQL’s tools, like
pg_dumpall
, to dump your entire PostgreSQL state.Alternatively, you can use targeted
pg_dump
commands to dump thepachyderm
anddex
databases, or use your Cloud Provider’s backup product.
In any case, make sure to use TLS. Note that if you are using a cloud provider, you might choose to use the provider’s method of making PostgreSQL backups.
A production setting of Pachyderm implies that you are running a managed PostgreSQL instance.
For on-premises Kubernetes deployments, check the vendor documentation for your on-premises PostgreSQL for details on backing up and restoring your databases.
- To back up the object store, you can either download all objects or
use the object store provider’s backup method.
The latter is preferable since it will typically not incur egress costs.
For on-premises Kubernetes deployments, check the vendor documentation for your on-premises object store for details on backing up and restoring a bucket.
Resuming operations #
Once your backup is completed, resume your normal operations by scaling pachd
back up. It will take care of restoring the worker pods:
Enterprise users: run
pachctl enterprise unpause
.Alternatively, if you used
kubectl
:kubectl scale deployment pachd --replicas 1
Restore Pachyderm #
There are two primary use cases for restoring a cluster:
- Your data have been corrupted, preventing your cluster from functioning correctly. You want the same version of Pachyderm re-installed on the latest uncorrupted data set.
- You have upgraded a cluster and are encountering problems. You decide to uninstall the current version and restore the latest backup of a previous version of Pachyderm.
Depending on your scenario, pick all or a subset of the following steps:
- Populate new
pachyderm
anddex
(if required) databases on your PostgreSQL instance - Populate a new bucket or use the backed-up object-store (note that, in that case, it will no longer be a backup)
- Create a new empty Kubernetes cluster and give it access to your databases and bucket
- Deploy Pachyderm into your new cluster
Find the detailed installations instructions of your PostgreSQL instance, bucket, Kubernetes cluster, permissions setup, and Pachyderm deployment for each Cloud Provider in the Deploy section of our Documentation
Restore The Databases And Objects #
- Restore PostgreSQL backups into your new databases using the appropriate method (this is most straightforward when using a cloud provider).
- Copy the objects from the backed-up object store to your new bucket or re-use your backup.
Deploy Pachyderm Into The New Cluster #
Finally, update the copy of your original Helm values to point Pachyderm to the new databases and the new object store, then use Helm to install Pachyderm into the new cluster.
The values needing an update and deployment instructions are detailed in the Chapter 6 of all our cloud installation pages. For example, in the case of GCP, check the deploy Pachyderm
chapter
Connect ‘pachctl’ To Your Restored Cluster #
…and check that your cluster is up and running.
Backup/Restore A Stand-Alone Enterprise Server #
Backing up / restoring an Enterprise Server is similar to the back up / restore of a regular cluster (see above), with two slight variations:
- The name of its Kubernetes deployment is
pach-enterprise
versuspachd
in the case of a regular cluster. - The Enterprise Server does not use an Object Store.
- An Enterprise server only requires a
dex
database.
Backup A Standalone Enterprise Server #
Make sure that pachctl
and kubectl
are pointing to the right cluster. Check your Enterprise Server context: pachctl config get active-enterprise-context
, or pachctl config set active-enterprise-context <my-enterprise-context-name> --overwrite
to set it.
- Pause the Enterprise Server like you would pause a regular cluster by running
pachctl enterprise pause
(Enterprise users), or usingkubectl
.
kubectl users:
There is a difference with the pause of a regular cluster. The deployment of the enterprise server is named pach-enterprise
; therefore, the first command should be:
kubectl scale deployment pach-enterprise --replicas 0
There is no need to pause all the Pachyderm clusters registered to the Enterprise Server to backup the enterprise server; however, pausing the Enterprise server will result in your clusters becoming unavailable.
As a reminder, the Enterprise Server does not use any object-store. Therefore, the backup of the Enterprise Server only consists in backing up the database
dex
.Resume the operations on your Enterprise Server by running
pachctl enterprise unpause
(Enterprise users) to scale thepach-enterprise
deployment back up. Alternatively, if you usedkubectl
, run:kubectl scale deployment pach-enterprise --replicas 1
Restore An Enterprise Server #
Follow the steps above while skipping all tasks related to creating and populating a new object-store.
Once your cluster is up and running, check that all your clusters are automatically registered with your new Enterprise Server.
Additional Info #
For additional questions about backup / restore, you can post them in the community #help channel on Slack, or reach out to your TAM if you are an Enterprise customer.