Run Commands

Environment Variables

Learn how to configure environment variables.

You can define environment variables that handle required configuration. In Pachyderm, you can define the following types of environment variables:

  • pachd variables: Used for your Pachyderm daemon container.

  • Pachyderm worker variables: Used by the Kubernetes pods that run your pipeline code.

💡

You can reference environment variables in your code. For example, if your code writes data to an external system and you want to know the current job ID, you can use the PACH_JOB_ID environment variable to refer to the current job ID.

pachd Environment Variables #

You can find the list of pachd environment variables in the pachd manifest by running the following command:

kubectl get deploy pachd -o yaml

The following tables list all the pachd environment variables.

Global Configuration #

Environment VariableDefault ValueDescription
ETCD_SERVICE_HOSTN/AThe host on which the etcd service runs.
ETCD_SERVICE_PORTN/AThe etcd port number.
PPS_WORKER_GRPC_PORT80The GRPs port number.
PORT650The pachd port number.
HTTP_PORT652The HTTP port number.
PEER_PORT653The port for pachd-to-pachd communication.
NAMESPACEdefaultThe namespace in which Pachyderm is deployed.

PachD Configuration #

Environment VariableDefault ValueDescription
NUM_SHARDS32The max number of pachd pods that can run in a
single cluster.
STORAGE_BACKEND""The storage backend defined for the Pachyderm cluster.
STORAGE_HOST_PATH""The host path to storage.
KUBERNETES_PORT_443_TCP_ADDRnoneAn IP address that Kubernetes exports
automatically for your code to communicate with
the Kubernetes API. Read access only. Most variables
that have use the PORT_ADDRESS_TCP_ADDR pattern
are Kubernetes environment variables. For more information,
see Kubernetes environment variables.
METRICStrueDefines whether anonymous Pachyderm metrics are being
collected or not.
BLOCK_CACHE_BYTES1GThe size of the block cache in pachd.
WORKER_IMAGE""The base Docker image that is used to run your pipeline.
WORKER_SIDECAR_IMAGE""The pachd image that is used as a worker sidecar.
WORKER_IMAGE_PULL_POLICYIfNotPresentThe pull policy that defines how Docker images are
pulled. You can set
a Kubernetes image pull policy as needed.
LOG_LEVELinfoVerbosity of the log output. If you want to disable
logging, set this variable to 0. Viable Options
debug
info
error
For more information, see Go logrus log levels.
IAM_ROLE""The role that defines permissions for Pachyderm in AWS.
IMAGE_PULL_SECRET""The Kubernetes secret for image pull credentials.
EXPOSE_OBJECT_APIfalseControls access to internal Pachyderm API.
WORKER_USES_ROOTtrueControls root access in the worker container.
S3GATEWAY_PORT600The S3 gateway port number
DISABLE_COMMIT_PROGRESS_COUNTERfalseA feature flag that disables commit propagation
progress counter. If you have a large DAG,
setting this parameter to true might help
improve etcd performance. You only need to set
this parameter on the pachd pod. Pachyderm passes
this parameter to worker containers automatically.

Storage Configuration #

Environment VariableDefault ValueDescription
STORAGE_MEMORY_THRESHOLDN/ADefines the storage memory threshold.
STORAGE_SHARD_THRESHOLDN/ADefines the storage shard threshold.

Pipeline Worker Environment Variables #

Pachyderm defines many environment variables for each Pachyderm worker that runs your pipeline code. You can print the list of environment variables into your Pachyderm logs by including the env command into your pipeline specification. For example, if you have an images repository, you can configure your pipeline specification like this:

{
    "pipeline": {
        "name": "env"
    },
    "input": {
        "pfs": {
            "glob": "/",
            "repo": "images"
        }
    },
    "transform": {
        "cmd": ["sh" ],
        "stdin": ["env"],
        "image": "ubuntu:14.04"
    }
}

Run this pipeline and, upon completion, you can view the log with variables by running the following command:

pachctl logs --pipeline=env
PPS_WORKER_IP=172.17.0.7
DASH_PORT_8081_TCP_PROTO=tcp
PACHD_PORT_600_TCP_PORT=600
KUBERNETES_SERVICE_PORT=443
KUBERNETES_PORT=tcp://10.96.0.1:443
...

You should see a lengthy list of variables. Many of them define internal networking parameters that most probably you will not need to use.

Most users find the following environment variables particularly useful:

Environment VariableDescription
AWS_ACCESS_KEY_IDThe ID that contains your AWS access key; requires pfs.s3: true or s3Out:true in your pipeline spec.
AWS_SECRET_ACCESS_KEYThe name of the secret which contains your AWS access key; requires pfs.s3: true or s3Out:true in your pipeline spec.
PACH_JOB_IDThe ID of the current job. For example,
PACH_JOB_ID=8991d6e811554b2a8eccaff10ebfb341.
PACH_DATUM_IDThe ID of the current Datum.
PACH_DATUM_<input.name>_JOIN_ONExposes the join_on match to the pipeline’s job.
PACH_DATUM_<input.name>_GROUP_BYExpose the group_by match to the pipeline’s job.
PACH_OUTPUT_COMMIT_IDThe ID of the commit in the output repo for
the current job. For example,
PACH_OUTPUT_COMMIT_ID=a974991ad44d4d37ba5cf33b9ff77394.
PPS_NAMESPACEThe PPS namespace. For example,
PPS_NAMESPACE=default.
PPS_SPEC_COMMITThe hash of the pipeline specification commit.
This value is tied to the pipeline version. Therefore, jobs that use
the same version of the same pipeline have the same spec commit.
For example, PPS_SPEC_COMMIT=3596627865b24c4caea9565fcde29e7d.
PPS_POD_NAMEThe name of the pipeline pod. For example,
pipeline-env-v1-zbwm2.
PPS_PIPELINE_NAMEThe name of the pipeline that this pod runs.
For example, env.
PIPELINE_SERVICE_PORT_PROMETHEUS_METRICSThe port that you can use to
exposed metrics to Prometheus from within your pipeline. The default value is 9090.
HOMEThe path to the home directory. The default value is /root
<input-repo>=<path/to/input/repo>The path to the filesystem that is
defined in the input in your pipeline specification. Pachyderm defines
such a variable for each input. The path is defined by the glob pattern in the
spec. For example, if you have an input images and a glob pattern of /,
Pachyderm defines the images=/pfs/images variable. If you
have a glob pattern of /*, Pachyderm matches
the files in the images repository and, therefore, the path is
images=/pfs/images/liberty.png.
input_COMMITThe ID of the commit that is used for the input.
For example, images_COMMIT=fa765b5454e3475f902eadebf83eac34.
S3_ENDPOINTA Pachyderm S3 gateway sidecar container endpoint.
If you have an S3 enabled pipeline, this parameter specifies a URL that
you can use to access the pipeline’s repositories state when a
particular job was run. The URL has the following format:
http://<job-ID>-s3:600.
An example of accessing the data by using AWS CLI looks like this: `echo foo_data

In addition to these environment variables, Kubernetes injects others for Services that run inside the cluster. These variables enable you to connect to those outside services, which can be powerful but might also result in processing being retried multiple times.

For example, if your code writes a row to a database, that row might be written multiple times because of retries. Interaction with outside services must be idempotent to prevent unexpected behavior. Furthermore, one of the running services that your code can connect to is Pachyderm itself. This is generally not recommended as very little of the Pachyderm API is idempotent, but in some specific cases it can be a viable approach.