Pipeline
Learn about the types of pipelines, including: spout, cron, and service pipelines.
May 26, 2023
A pipeline is a Pachyderm primitive that is responsible for reading data from a specified source, such as a Pachyderm repo, transforming it according to the pipeline configuration, and writing the result to an output repo.
A pipeline subscribes to a branch in one or more input repositories.
Every time the branch has a new commit, the pipeline executes a job
that runs your code to completion and writes the results to a commit
in the output repository. Every pipeline automatically creates
an output repository by the same name as the pipeline. For example,
a pipeline named model
writes all results to the
model
output repo.
In Pachyderm, a Pipeline is an individual execution step. You can chain multiple pipelines together to create a directed acyclic graph (DAG).
You define a pipeline declaratively, using a JSON or YAML file. Pipeline specification files follow Pachyderm’s pipeline reference specification file.
A minimum pipeline specification must include the following parameters:
name
— The name of your data pipeline. Set a meaningful name for your pipeline, such as the name of the transformation that the pipeline performs. For example,split
oredges
. Pachyderm automatically creates an output repository with the same name. A pipeline name must be an alphanumeric string that is less than 63 characters long and can include dashes and underscores. No other special characters allowed.input
— A location of the data that you want to process, such as a Pachyderm repository. You can specify multiple input repositories and set up the data to be combined in various ways. For more information, see Cross and Union, Join, Group. One very important property that is defined in theinput
field is theglob
pattern that specifies how Pachyderm breaks the data into individual processing units, called Datums. For more information, see Datum.transform
— Specifies the code that you want to run against your data. Thetransform
section must include animage
field that defines the Docker image that you want to run, as well as acmd
field for the specific code within the container that you want to execute, such as a Python script.
example #
{
"pipeline": {
"name": "wordcount"
},
"transform": {
"image": "wordcount-image",
"cmd": ["python3", "/my_python_code.py"]
},
"input": {
"pfs": {
"repo": "data",
"glob": "/*"
}
}
}