Run Commands

Datum Batching

Learn how to batch datums to optimize performance.

By default, Pachyderm processes each datum independently. This means that your user code is called once for each datum. This can be inefficient and costly if you have a large number of small datums or if your user code is slow to start.

When you have a large number of datums, you can batch them to optimize performance. Pachyderm provides a next datum command that you can use to batch datums.

Flow Diagram #

flowchart LR
    user_code(User Code)
    process_datum{process datum}

    cmd_err(Run cmd_err)
    kill[Kill User Code]  
    datum?{datum exists?}
    cmd_err?{cmd_err defined}

    user_code ==>ndsuccess
    ndsuccess =====> datum?
    datum? ==>|yes| process_datum
    process_datum ==>|success| response
    response ==> user_code

    datum? -->|no| kill
    process_datum -->|fail| nderror
    nderror --> cmd_err?
    cmd_err? -->|yes| cmd_err
    cmd_err? -->|no|kill
    cmd_err --> retry?
    retry? -->|yes| response
    retry? -->|no| kill

How to Batch Datums #

Via PachCTL #

  1. Define your user code and build a docker image. Your user code must call pachctl next datum to get the next datum to process.

  2. Create a repo (e.g., pachctl create repo repoName).

  3. Define a pipeline spec in YAML or JSON that references your Docker image and repo.

  4. Add the following to the transform section of your pipeline spec:

    • datum_batching: true
      name: p_datum_batching_example
        repo: repoName
        glob: "/*"
      datum_batching: true
      image: user/docker-image:tag
  5. Create the pipeline (e.g., pachctl update pipeline -f pipeline.yaml).

  6. Monitor the pipeline’s state either via Console or via pachctl list pipeline.


You can view the printed confirmation of “Next datum called” in the logs your pipeline’s job.