Triggering Pipelines Periodically (cron)

Pachyderm pipelines are triggered by changes to their input data repositories (as further discussed in What Happens When You Create a Pipeline). However, if a pipeline consumes data from sources outside of Pachyderm, it can’t use Pachyderm’s triggering mechanism to process updates from those sources. For example, you might need to:

  • Scrape websites
  • Make API calls
  • Query a database
  • Retrieve a file from S3 or FTP

You can schedule pipelines like these to run regularly with Pachyderm’s built-in cron input type. You can find an example pipeline that queries MongoDB periodically here.

Cron Example

Let’s say that we want to query a database every 10 seconds and update our dataset with the new data every time the pipeline is triggered. We could do this with cron input as follows:

  "input": {
    "cron": {
      "name": "tick",
      "spec": "@every 10s"
    }
  }

When we create this pipeline, Pachyderm will create a new input data repository corresponding to the cron input. It will then automatically commit a timestamp file every 10 seconds to the cron input repository, which will automatically trigger our pipeline.

alt tag

The pipeline will run every 10 seconds, querying our database and updating its output.

We have used the @every 10s cron spec here, but you can use any cron spec formatted according to RFC 3339. For example, */10 * * * * would indicate that the pipeline should run every 10 minutes (these time formats should be familiar to those who have used cron in the past, and you can find more examples here)

By default, Pachyderm will run the pipeline on input data that has come in since the last tick. If instead we would like the pipeline to reprocess all the data, we can set the overwrite flag to true:

  "input": {
    "cron": {
      "name": "tick",
      "spec": "@every 10s",
      "overwrite": true
    }
  }

Now, it will overwrite the timestamp file each tick. Since the processed data is associated with the old file, its absence will indicate to Pachyderm that it needs to be reprocessed.