Run Commands

Glossary

series

Ancestry Syntax

About #

Ancestry syntax in Pachyderm is used to reference the history of commits and branches in a Pachyderm input repository. Ancestry syntax is similar to Git syntax, and it allows users to navigate the history of commits and branches using special characters like ^ and ..

  • The ^ character is used to reference a commit or branch parent, where commit^ refers to the parent of the commit, and branch^" refers to the parent of the branch head. Multiple ^ characters can be used to reference earlier ancestors, for example, commit^^ refers to the grandparent of the commit.

  • The . character is used to reference a specific commit in the history of a branch. For example, branch.1 refers to the first commit on the branch, branch.2 refers to the second commit, and so on.

Ancestry syntax allows users to access historical versions of data stored in Pachyderm, which can be useful for tasks like debugging, testing, and auditing. However, it’s important to note that resolving ancestry syntax can be computationally intensive, especially for long chains of commits, so it’s best to use this feature judiciously.

Branch

About #

A Pachyderm branch is a pointer to a commit that moves along with new commits. By default, Pachyderm does not create any branches when you create a repository. Most users create a master branch to initiate the first commit.

Branches allow collaboration between teams of data scientists. However, the master branch is sufficient for most users.

Each branch stores information about provenance, including input and output branches. Pachyderm pipelines trigger a job when changes are detected in the HEAD of a branch.

You can create additional branches with pachctl create branch and view branches with pachctl list branch. Deleting a branch doesn’t delete its commits, and all branches require a head commit.

Example #

pachctl list branch images

# BRANCH HEAD
# master c32879ae0e6f4b629a43429b7ec10ccc
Commit

About #

In Pachyderm, commits snapshot and preserve the state of files and directories in a repository at a point in time. Unlike Git, Pachyderm commits are centralized and transactional. You can create a commit with pachctl start commit and save it with pachctl finish commit. Once the commit is closed its contents are immutable. Commits may be chained together to represent a sequence of states.

All commits have an alphanumeric ID, and you can reference a commit with <repo>@<commitID>. Each commit has an origin that indicates why it was produced (USER or AUTO).

Global Commits #

A commit with global scope (global commit) represents the set of all provenance-dependent commits sharing the same ID.

Sub-Commits #

A commit with a more focused scope (sub-commit) represents the “Git-like” record of one commit in a single branch of a repository’s file system.

Actions #

Commit Set

About #

A Commit Set is an immutable set of all the commits that resulted from a modification to the system. The commits within a commit set share a name (i.e. Global ID). This naming scheme enables you to reference data related to a commit anywhere in their provenance graph by simply naming it.

Cron

About #

A cron, short for “chronograph,” is a time-based job scheduler that allows users to schedule and automate the execution of recurring tasks or commands at specific intervals. These tasks, referred to as cron jobs, are typically scripts or commands that perform specific actions.

Pachyderm uses the concept of crons in two ways: a cron pipeline spec, and a cron trigger.

Cron Pipelines vs Cron Triggers #

Cron Pipelines trigger on their specified cron interval, plus each time a new input data is added. This enables you to create pipelines that trigger jobs at least once on a regular schedule. This could be useful if you periodically make changes to your user code, but have no reason to commit more data. When you do commit more data, the Cron Pipeline still triggers as a normal pipeline would.

Cron Triggers enable you to set up a scheduled reoccurring event on a repo branch that evaluates and fires the trigger. When a Cron Trigger fires, but no new data has been added, there are no new downstream commits or jobs.

DAG

About #

In Pachyderm, a Directed Acyclic Graph (DAG) is a collection of pipelines connected by data dependencies. The DAG defines the order in which pipelines are executed and how data flows between them.

Each pipeline in a DAG processes data from its input repositories and produces output data that can be used as input by downstream pipelines. The input repositories of a pipeline can be the output repositories of other pipelines, allowing data to flow through the DAG.

To create a DAG in Pachyderm, you create multiple pipeline specifications and define the dependencies between them. You can define dependencies between pipelines using the input parameter in the pipeline specification. For example, if you have two pipelines named A and B, and B depends on the output of A, you would set the input parameter of B to the name of the output repository of A.

Data Parallelism

About #

Data parallelism refers to a parallel computing technique where a large dataset is partitioned and processed in parallel across multiple computing resources within a directed acyclic graph (DAG) or pipeline. In data parallelism, each task in the DAG/pipeline operates on a different subset of the dataset in parallel, allowing for efficient processing of large amounts of data. The results of each task are then combined to produce the final output. Data parallelism is often used in machine learning and deep learning pipelines where large datasets need to be processed in parallel using multiple computing resources. By distributing the data across different nodes, data parallelism can help reduce the overall processing time and improve the performance of the pipeline.

Datum

About #

A datum is the smallest indivisible unit of computation within a job. Datums are used to:

  • Divide your input data
  • Distribute processing workloads

A datum’s scope can be as large as all of your data at once, a directory, a file, or a combination of multiple inputs. The shape and quantity of your datums is determined by a glob pattern defined in your pipeline specification.

A job can have one, many, or no datums. Each datum is processed independently with a single execution of the user code on one of the pipeline worker pods. The individual output files produced by all of your datums are combined to create the final output commit.

If a job is successfully executed but has no matching files to transform, it is considered a zero-datum job.

Datum Processing States #

When a pipeline runs, it processes your datums. Some of them get processed successfully and some might be skipped or even fail. Generally, processed datums fall into either successful or failure state category.

Successful States #

StateDescription
SuccessThe datum has been successfully processed in this job.
SkippedThe datum has been successfully processed in a previous job, has not changed since then, and therefore, it was skipped in the current job.

Failure States #

StateDescription
FailedThe datum failed to be processed. Any failed datum in a job fails the whole job.
RecoveredThe datum failed, but was recovered by the user’s error handling code. Although the datum is marked as recovered, Pachyderm does not process it in the downstream pipelines. A recovered datum does not fail the whole job. Just like failed datums, recovered datums are retried on the next run of the pipeline.

You can view the information about datum processing states in the output of the pachctl list job <jobID> command:

datums in progress

ℹ️

Datums that failed are still included in the total, but not shown in the progress indicator.

Restarts #

A job restarts when an internal error (not the user code) occurs while processing a job. These occurrences are counted in the RESTART column.

Deferred Processing
Distributed Computing

About #

Distributed computing is a technique that allows you to split your jobs across multiple Pachyderm workers via the Parallelism PPS attribute. Leveraging distributed computing enables you to build production-scale pipelines with adjustable resources to optimize throughput.

For each job, all the datums are queued up and then distributed across the available workers. When a worker finishes processing its datum, it grabs a new datum from the queue until all the datums complete processing. If a worker pod crashes, its datums are redistributed to other workers for maximum fault tolerance.

Distributed computing basics

File

About #

A file is a Unix filesystem object, which is a directory or file, that stores data. Unlike source code version-control systems that are most suitable for storing plain text files, you can store any type of file in Pachyderm, including binary files.

Often, data scientists operate with comma-separated values (CSV), JavaScript Object Notation (JSON), images, and other plain text and binary file formats. Pachyderm supports all file sizes and formats and applies storage optimization techniques, such as deduplication, in the background.

Glob Pattern

About #

A glob pattern is a string of characters that specifies a set of filenames or paths in a file system. The term “glob” is short for “global,” and refers to the fact that a glob pattern can match multiple filenames or paths at once. For Pachyderm, you can use glob patterns to define the shape of your datums against your inputs, which are spread across Pachyderm workers for distributing computing.

Examples #

Glob PatternDatum created
/Pachyderm denotes the whole repository as a single datum and sends all input data to a single worker node to be processed together.
/*Pachyderm defines each top-level files / directories in the input repo, as a separate datum. For example, if you have a repository with ten files and no directory structure, Pachyderm identifies each file as a single datum and processes them independently.
/*/*Pachyderm processes each file / directory in each subdirectories as a separate datum.
/**Pachyderm processes each file in all directories and subdirectories as a separate datum.

Glob patterns can also use other special characters, such as the question mark (?) to match a single character, or brackets ([...]) to match a set of characters.

Global Identifier

About #

Global Identifiers provide a simple way to follow the provenance of a DAG. Commits and jobs sharing the same Global ID represent a logically-related set of objects.

When a new commit is made, Pachyderm creates an associated commit ID; all resulting downstream commits and jobs in your DAG will then share that same ID (the Global Identifier).

The following diagram illustrates the global commit and its various components:

global_commit_after_putfile

Actions #

  1. List all global commits & jobs
  2. List all sub-commits associated with a global ID
  3. Track provenance downstream, live
  4. Delete a Branch Head
History

About #

History in Pachyderm is a record of the changes made to data over time, stored as a series of immutable snapshots (commits) that can be accessed using ancestry syntax and branch pointers. Each commit has a parentage structure, where new commits inherit content from their parents, creating a chain of commits that represents the full history of changes to the data.

Input Repository

About #

In Pachyderm, an input repository is a location where data resides that is used as input for a Pachyderm pipeline. To define an input repository, you need to fill out the input attribute in pipeline’s specification file.

There are several ways to structure the content of your input repos, such as:

Once you have defined an input repository, you can use it as the input source for a Pachyderm pipeline. The pipeline will automatically subscribe to the branch of the input repository and process any new data that is added to the branch according to the pipeline configuration.

Job

About #

A job is an execution of a pipeline triggered by new data detected in an input repository.

When a commit is made to the input repository of a pipeline, jobs are created for all downstream pipelines in a directed acyclic graph (DAG), but they do not run until the prior pipelines they depend on produce their output. Each job runs the user’s code against the current commit in a repository at a specified branch and then submits the results to the output repository of the pipeline as a single output commit.

Each job has a unique alphanumeric identifier (ID) that users can reference in the <pipeline>@<jobID> format. Jobs have the following states:

SateDescription
CREATEDAn input commit exists, but the job has not been started by a worker yet.
STARTINGThe worker has allocated resources for the job (that is, the job counts towards parallelism), but it is still waiting on the inputs to be ready.
UNRUNNABLEThe job could not be run, because one or more of its inputs is the result of a failed or unrunnable job. As a simple example, say that pipelines Y and Z both depend on the output from pipeline X. If pipeline X fails, both pipeline Y and Z will pass from STARTING to UNRUNNABLE to signify that they had to be cancelled because of upstream failures.
RUNNINGThe worker is processing datums.
EGRESSThe worker has completed all the datums and is uploading the output to the egress endpoint.
FINISHINGAfter all of the datum processing and egress (if any) is done, the job transitions to a finishing state where all of the post-processing tasks such as compaction are performed.
FAILUREThe worker encountered too many errors when processing a datum.
KILLEDThe job timed out, or a user called StopJob
SUCCESSNone of the bad stuff happened.
NLP

NLP (Natural Language Processing) is a subfield of machine learning that focuses on teaching machines to understand and generate human language. It involves developing algorithms and models that can process, analyze, and generate natural language data, such as text, speech, and other forms of communication.

NLP has numerous applications in various industries, such as chatbots, voice assistants, machine translation, sentiment analysis, and text classification, among others. Some common techniques used in NLP include text preprocessing, feature extraction, language modeling, sequence-to-sequence models, attention-based models, and transformer-based models like BERT and GPT.

NLP models and algorithms often require large amounts of labeled data for training, and they can be computationally intensive. However, recent advancements in deep learning have led to significant improvements in NLP models, allowing them to achieve state-of-the-art performance on various natural language tasks.

Output Repository

About #

In Pachyderm, an output repo is a repository where the results of a pipeline’s processing are stored after being transformed by the provided user code. Every pipeline automatically creates an output repository with the same name as the pipeline.

When a pipeline runs, it creates a new commit in the output repository with the results of the processing. The commit contains a set of files that represent the output of the pipeline. Each commit in the output repository corresponds to a job that was run to generate that output.

An output repository can be created or deleted using a pachctl CLI command or the Pachyderm API.

Pachyderm Worker

About #

Pachyderm workers are kubernetes pods that run the docker image (your user code) specified in the pipeline specification. When you create a pipeline, Pachyderm spins up workers that continuously run in the cluster, waiting for new data to process.

Each datum goes through the following processing phases inside a Pachyderm worker pod:

PhaseDescription
DownloadingThe Pachyderm worker pod downloads the datum contents
into Pachyderm.
ProcessingThe Pachyderm worker pod runs the contents of the datum
against your code.
UploadingThe Pachyderm worker pod uploads the results of processing
into an output repository.

Distributed processing internals

Pipeline

About #

A pipeline is a Pachyderm primitive responsible for reading data from a specified source, such as a Pachyderm repo, transforming it according to the pipeline specification, and writing the result to an output repo.

Pipelines subscribe to a branch in one or more input repositories, and every time the branch has a new commit, the pipeline executes a job that runs user code to completion and writes the results to a commit in the output repository.

Pipelines are defined declaratively using a JSON or YAML file (the pipeline specification), which must include the name, input, and transform parameters at a minimum. Pipelines can be chained together to create a directed acyclic graph (DAG).

Pipeline Inputs

About #

In Pachyderm, pipeline inputs are defined as the source of the data that the pipeline reads and processes. The input for a pipeline can be a Pachyderm repository (input repo) or an external data source, such as a file in a cloud storage service.

To define a pipeline input, you need to specify the source of the data and how the data is organized. This is done in the input section of the pipeline specification file, which is a YAML or JSON file that defines the configuration of the pipeline.

Input Types #

The input section can contain one or more input sources, each specified as a separate block.

Pipeline Specification

About #

A pipeline specification is a declarative configuration file used to define the behavior of a Pachyderm pipeline. It is typically written in YAML or JSON format and contains information about the pipeline’s input sources, output destinations, Docker image (user code), command, and other metadata.

In addition to simply transforming your data, you can also achieve more advanced techniques though the pipeline specification, such as:

Project
Provenance

About #

Provenance in Pachyderm refers to the tracking of the dependencies and relationships between datasets, as well as the ability to go back in time and see the state of a dataset or repository at a particular moment. Pachyderm models both commit provenance and branch provenance to represent the dependencies between data in the pipeline.

Commit Provenance #

Commit provenance refers to the relationship between commits in different repositories. If a commit in a repository is derived from a commit in another repository, the derived commit is provenant on the source commit. Capturing this relationship supports queries regarding how data in a commit was derived.

Branch Provenance #

Branch provenance represents a more general relationship between data. It asserts that future commits in the downstream branch will be derived from the head commit of the upstream branch.

Traversing Provenance #

Pachyderm automatically maintains a complete audit trail, allowing all results to be fully reproducible. To track the direct provenance of commits and learn where the data in the repository originates, you can use the pachctl inspect command to view provenance information, including the origin kind, direct provenance, and size of the data.

Pachyderm’s DAG structure makes it easy to traverse the provenance and subvenance in any commit. All related steps in a DAG share the same global identifier, making it possible to run pachctl list commit <commitID> to get the full list of all the branches with commits created due to provenance relationships.

Task Parallelism

About #

Task parallelism refers to a parallel computing technique where multiple tasks within a directed acyclic graph (DAG) or pipeline are executed simultaneously on different computing resources. In task parallelism, the focus is on executing different tasks in parallel rather than parallelizing a single task. This means that each task in the DAG/pipeline is executed independently of other tasks, allowing for efficient use of resources and faster completion of the overall DAG/pipeline. Task parallelism is often used in data processing pipelines or workflows where tasks can be executed in parallel without any dependency on each other.

User Code

About #

In Pachyderm, user code refers to the custom code that users write to process their data in pipelines. User code can be written in any language and can use any libraries or frameworks.

Pachyderm allows users to define their user code as a Docker image, which can be pushed to a registry and referenced using the Transform attribute of the pipeline’s specification. The user code image contains the necessary dependencies and configuration for the code to run in Pachyderm’s distributed computing environment.

User code can be defined for each pipeline stage in Pachyderm, allowing users to chain together multiple processing steps and build complex data pipelines. Pachyderm also provides a Python library for building pipelines, which simplifies the process of defining user code and specifying pipeline stages.

Glossary Series

Ancestry Syntax
Branch
Commit
Commit Set
Cron
DAG
Data Parallelism
Datum
Deferred Processing
Distributed Computing
File
Glob Pattern
Global Identifier
History
Input Repository
Job
NLP
Output Repository
Pachyderm Worker
Pipeline
Pipeline Inputs
Pipeline Specification
Project
Provenance
Task Parallelism
User Code