General Troubleshooting

Here are some common issues by symptom along with steps to resolve them. They are organized into the following categories:

Deploying A Pachyderm Cluster

Pod stuck in CrashLoopBackoff


The pachd pod keeps crashing/restarting:

$ kubectl get all
NAME                        READY     STATUS             RESTARTS   AGE
po/etcd-281005231-qlkzw     1/1       Running            0          7m
po/pachd-1333950811-0sm1p   0/1       CrashLoopBackOff   6          7m

NAME             CLUSTER-IP       EXTERNAL-IP   PORT(S)                       AGE
svc/etcd    <nodes>       2379:30938/TCP                7m
svc/kubernetes       <none>        443/TCP                       9m
svc/pachd   <nodes>       650:30650/TCP,651:30651/TCP   7m

deploy/etcd    1         1         1            1           7m
deploy/pachd   1         1         1            0           7m

NAME                  DESIRED   CURRENT   READY     AGE
rs/etcd-281005231     1         1         1         7m
rs/pachd-1333950811   1         1         0         7m


First describe the pod:

$ kubectl describe po/pachd-1333950811-0sm1p

If you see an error including Error attaching EBS volume or similar, see the recourse for that error here under the corresponding section below this one. If you don’t see that error, but do see something like:

  1m    3s    9    {kubelet}                Warning    FailedSync    Error syncing pod, skipping: failed to "StartContainer" for "pachd" with CrashLoopBackOff: "Back-off 2m40s restarting failed container=pachd pod=pachd-1333950811-0sm1p_default(a92b6665-506a-11e7-8e07-02e3d74c49ac)"

That means Kubernetes tried running pachd, but pachd generated an internal error. To see the specifics of this internal error, check the logs for the pachd pod:

$kubectl logs po/pachd-1333950811-0sm1p

Note: If you’re using a log aggregator service (e.g. the default in GKE), you won’t see any logs when using kubectl logs ... in this way. You will need to look at your logs UI (e.g. in GKE’s case the stackdriver console).

These logs will likely reveal a misconfiguration in your deploy. For example, you might see, BucketRegionError: incorrect region, the bucket is not in 'us-west-2' region. In that case, you’ve deployed your bucket in a different region than your cluster.

If the error / recourse isn’t obvious from the error message, you can now provide the content of the pachd logs when getting help in our Slack channel or by opening a GitHub Issue. Please provide these logs either way as it is extremely helpful in resolving the issue..

Pod stuck in CrashLoopBackoff - with error attaching volume


A pod (could be the pachd pod or a worker pod) fails to startup, and is stuck in CrashLoopBackoff. If you execute kubectl describe po/pachd-xxxx, you’ll see an error message like the following at the bottom of the output:

  30s        30s        1    {attachdetach }                Warning        FailedMount    Failed to attach volume "etcd-volume" on node "" with: Error attaching EBS volume "vol-0c1d403ac05096dfe" to instance "i-0a12e00c0f3fb047d": VolumeInUse: vol-0c1d403ac05096dfe is already attached to an instance


Your best bet is to manually detach the volume and restart the pod.

For example, to resolve this issue when Pachyderm is deployed to AWS, first find the node of which the pod is scheduled. In the output of the kubectl describe po/pachd-xxx command above, you should see the name of the node on which the pod is running. In the AWS web console, find that node.. Once you have the right node, look in the bottom pane for the attached volume. Follow the link to the attached volume, and detach the volume. You may need to “Force Detach” it.

Once it’s detached (and marked as available). Restart the pod by killing it, e.g:

$kubectl delete po/pachd-xxx

It will take a moment for a new pod to get scheduled.

Connecting to a Pachyderm Cluster

Cannot connect via pachctl - context deadline exceeded


You may be using the environment variable PACHD_ADDRESS to specify how pachctl talks to your Pachyderm cluster, or you may be forwarding the pachyderm port. In any event, you might see something similar to:

$ pachctl version
COMPONENT           VERSION                                          
pachctl             1.4.8   
context deadline exceeded


It’s possible that the connection is just taking a while. Occasionally this can happen if your cluster is far away (deployed in a region across the country). Check your internet connection.

It’s also possible that you haven’t poked a hole in the firewall to access the node on this port. Usually to do that you adjust a security rule (in AWS parlance a security group). For example, on AWS, if you find your node in the web console and click on it, you should see a link to the associated security group. Inspect that group. There should be a way to “add a rule” to the group. You’ll want to enable TCP access (ingress) on port 30650. You’ll usually be asked which incoming IPs should be whitelisted. You can choose to use your own, or enable it for everyone (

Certificate Error When Using Kubectl


This can happen on any request using kubectl (e.g. kubectl get all). In particular you’ll see:

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.4", GitCommit:"d6f433224538d4f9ca2f7ae19b252e6fcb66a3ae", GitTreeState:"clean", BuildDate:"2017-05-19T20:41:24Z", GoVersion:"go1.8.1", Compiler:"gc", Platform:"darwin/amd64"}
Unable to connect to the server: x509: certificate signed by unknown authority


Check if you’re on any sort of VPN or other egress proxy that would break SSL. Also, there is a possibility that your credentials have expired. In the case where you’re using GKE and gcloud, renew your credentials via:

$ kubectl get all
Unable to connect to the server: x509: certificate signed by unknown authority
$ gcloud container clusters get-credentials my-cluster-name-dev
Fetching cluster endpoint and auth data.
kubeconfig entry generated for my-cluster-name-dev.
$ kubectl config current-context

Uploads/Downloads are Slow


Any pachctl put-file or pachctl get-file commands are slow.


If you do not explicitly set the PACHD_ADDRESS environment variable, pachctl will default to using port forwarding, which throttles traffic to ~1MB/s. If you need to do large downloads/uploads you should consider using the PACHD_ADDRESS variable instead to connect directly to your k8s master node. You’ll also want to make sure you’ve allowed ingress access through any firewalls to your k8s cluster.

Problems Running Pipelines

All your pods / jobs get evicted



$ kubectl get all

shows a bunch of pods that are marked Evicted. If you kubectl describe ... one of those evicted pods, you see an error saying that it was evicted due to disk pressure.


Your nodes are not configured with a big enough root volume size. You need to make sure that each node’s root volume is big enough to store the biggest datum you expect to process anywhere on your DAG plus the size of the output files that will be written for that datum.

Let’s say you have a repo with 100 folders. You have a single pipeline with this repo as an input, and the glob pattern is /*. That means each folder will be processed as a single datum. If the biggest folder is 50GB and your pipeline’s output is about 3 times as big, then your root volume size needs to be bigger than:

50 GB (to accommodate the input) + 50 GB x 3 (to accommodate the output) = 200GB

In this case we would recommend 250GB to be safe. If your root volume size is less than 50GB (many defaults are 20GB), this pipeline will fail when downloading the input. The pod may get evicted and rescheduled to a different node, where the same thing will happen.

Pipeline Exists But Never Runs


You can see the pipeline via:

$ pachctl list-pipeline

But if you look at the job via:

$ pachctl list-job

It’s marked as running with 0/0 datums having been processed. If you inspect the job via:

$ pachctl inspect-job

You don’t see any worker set. E.g:

Worker Status:
WORKER              JOB                 DATUM               STARTED             

If you do kubectl get pod you see the worker pod for your pipeline, e.g:


But it’s state is Pending or CrashLoopBackoff.


First make sure that there is no parent job still running. Do pachctl list-job | grep yourPipelineName to see if there are pending jobs on this pipeline that were kicked off prior to your job. A parent job is the job that corresponds to the parent output commit of this pipeline. A job will block until all parent jobs complete.

If there are no parent jobs that are still running, then continue debugging:

Describe the pod via:

$kubectl describe po/pipeline-foo-5-v1-273zc

If the state is CrashLoopBackoff, you’re looking for a descriptive error message. One such cause for this behavior might be if you specified an image for your pipeline that does not exist.

If the state is Pending it’s likely the cluster doesn’t have enough resources. In this case, you’ll see a could not schedule type of error message which should describe which resource you’re low on. This is more likely to happen if you’ve set resource requests (cpu/mem/gpu) for your pipelines. In this case, you’ll just need to scale up your resources. If you deployed using kops, you’ll want to do edit the instance group, e.g. kops edit ig nodes ... and up the number of nodes. If you didn’t use kops to deploy, you can use your cloud provider’s auto scaling groups to increase the size of your instance group. Either way, it can take up to 10 minutes for the changes to go into effect.

You can read more about autoscaling here