Backups
This page will walk you through the main steps required to manually back up the state of a Pachyderm cluster in production. Details on how to perform those steps might vary depending on your infrastructure and setup. Refer to your provider’s documentation when applicable.
Before You Start #
- Make sure to retain a copy of the Helm values used to deploy your cluster
- Suspend any state-mutating operations
- Make sure that you have a bucket for backup use, separate from the object store used by your cluster
Downtime Considerations #
- Backups incur downtime until operations are resumed
- Operational best practices include notifying Pachyderm users of the outage and providing an estimated time when downtime will cease
- Downtime duration is dependent on the size of the data to be backed up and the networks involved
- Testing before going into production and monitoring backup times on an ongoing basis might help make accurate predictions
How to Create a Backup #
Pachyderm state is stored in two main places:
- An object-store holding Pachyderm’s data.
- A PostgreSQL instance made up of one or two databases:
pachyderm
holding Pachyderm’s metadatadex
holding authentication data
Backing up a Pachyderm cluster involves snapshotting both the object store and the PostgreSQL database(s), in a consistent state, at a given point in time. Restoring a cluster involves re-populating the database(s) and the object store using those backups, then recreating a Pachyderm cluster.
- Review any cloud-specific backup and restore procedures for your PostgresSQL instance.
- Retain a copy of the Helm values file used to deploy your cluster.
helm get values <release-name> > /path/to/values.yaml
- Pause or queue/divert any external automated process ingressing data to Pachyderm input repos.
- Suspend all mutation of state by scaling
pachd
and the worker pods down. - Dump your PostgresSQL state using
pg_dumpall
orpg_dump
depending on whether the database is solely used by Pachyderm or shared with other applications. - Backup your object store. Refer to your cloud provider’s documentation for details.
How to Resume Operations #
Once your backup is completed, resume your normal operations by scaling pachd
back up. It will take care of restoring the worker pods:
- Enterprise:
pachctl enterprise unpause
. - CE:
kubectl scale deployment pachd --replicas 1