Pachyderm Pipeline System (PPS) is the computational component of the Pachyderm platform that enables you to perform various transformations on your data. Pachyderm pipelines have the following main concepts:
A pipeline is a job-spawner that waits for certain conditions to be met. Most commonly, this means watching one or more Pachyderm repositories for new commits of data. When new data arrives, a pipeline executes a user-defined piece of code to perform an operation and process the data. Each of these executions is called a job.
Pachyderm has the following special types of pipelines:
- A cron input enables you to trigger the pipeline code at a specific interval. This type of pipeline is useful for such tasks as web scraping, querying a database, and other similar operations where you do not want to wait for new data, but instead trigger the pipeline periodically.
- A service is a special type of pipeline that instead of executing jobs and then waiting, permanently runs a serving data through an endpoint. For example, you can be serving an ML model or a REST API that can be queried. A service reads data from Pachyderm but does not have an output repo.
- A spout is a special type of pipeline for ingesting data from a data stream. A spout can subscribe to a message stream, such as Kafka or Amazon SQS, and ingest data when it receives a message. A spout does not have an input repo.
- A job is an individual execution of a pipeline. A job can succeed or fail. Within a job, data and processing can be broken up into individual units of work called datums.
- A datum is the smallest indivisible unit of work within a job. Different datums can be processed in parallel within a job.
Read the sections below to learn more about these concepts: