Cron Pipeline

Learn about the concept of a cron pipeline in Pachyderm.

March 22, 2023

A Cron pipeline is triggered by a set time interval instead of whenever new changes appear in the input repository.

About Cron Pipelines #

Use Cases #

Cron pipelines are great for tasks like:

Behavior #

When you create a Cron pipeline, Pachyderm creates a new input data repository that corresponds to the cron input Then, Pachyderm automatically commits a timestamp file to the cron input repository at your determined interval, which triggers the pipeline.

By default, each cron trigger adds a new tick file to the cron input repository, accumulating more datums over time. Optionally, you can set the overwrite flag to true to overwrite the timestamp file on each tick. To learn more about overwriting commits in Pachyderm, see Datum processing.

Required Parameters #

At minimum, a Cron pipeline must include all of the following parameters:

Parameters Description
"name" A descriptive name of the cron pipeline.
"spec" The interval between scheduled cron jobs; accepts RFC 3339 inputs, Predefined Schedules (@daily), and Intervals (@every 1h30m20s)

Callouts #

Examples #

Every 60 Seconds #

  "input": {
    "cron": {
      "name": "tick",
      "spec": "@every 60s"

Daily with Overwrites #

  "input": {
    "cron": {
      "name": "tick",
      "spec": "@daily",
      "overwrite": true

SQL Ingest with Jsonnet #

pachctl update pipeline --jsonnet \
  --arg name=myingest \
  --arg url="mysql://root@mysql:3306/test_db" \
  --arg query="SELECT * FROM test_data" \
  --arg hasHeader=false \
  --arg cronSpec="@every 60s" \
  --arg secretName="mysql-creds" \
  --arg format=json