Run Commands

Standard ML Pipeline

Learn how to build a basic machine learning pipeline.

In this tutorial, we’ll build a simple machine learning pipeline in Pachyderm to train a regression model on housing market data to predict the value of homes in Boston.

Before You Start #

Tutorial #

Our Docker image’s user code for this tutorial is built on top of the civisanalytics/datascience-python base image, which includes the necessary dependencies. It uses pandas to import the structured dataset and the scikit-learn library to train the model.

1. Create a Project & Input Repo #

  1. Create a project named standard-ml-tutorial.
    pachctl create project standard-ml-tutorial
  2. Set the project as current.
    pachctl config update context --project standard-ml-tutorial
  3. Create a repo named housing_data.
    pachctl create repo housing_data

2. Create a Regression Pipeline #

  1. Create a file named regression.json with the following contents:

    # regression.json
        "pipeline": {
            "name": "regression"
        "description": "A pipeline that trains produces a regression model for housing prices.",
        "input": {
            "pfs": {
                "glob": "/*",
                "repo": "housing_data"
        "transform": {
            "cmd": [
                "python", "",
                "--input", "/pfs/housing_data/",
                "--target-col", "MEDV",
                "--output", "/pfs/out/"
            "image": "pachyderm/housing-prices:1.11.0"
  2. Save the file.

  3. Run the following command to create the pipeline:

    pachctl create pipeline -f regression.json

The pipeline writes the output to a PFS repo (/pfs/out/) created with the same name as the pipeline.

3. Upload the Housing Dataset #

  1. Download our first example data set, housing-simplified-1.csv.

  2. Add the data to your repo. Processing begins automatically — anytime you add new data, the pipeline will re-run.

    pachctl put file housing_data@master:housing-simplified.csv -f /path/to/housing-simplified-1.csv
  3. Verify that the data is in the repository.

    pachctl list file housing_data@master
    # NAME                    TYPE SIZE     
    # /housing-simplified.csv file 2.482KiB
  4. Verify that the pipeline is running by looking at the status of the job(s).

    pachctl list job
    # ID                               SUBJOBS PROGRESS CREATED            MODIFIED
    # e7dd14d201a64edc8bf61beed6085ae0 1       ▇▇▇▇▇▇▇▇ 48 seconds ago     48 seconds ago     
    # df117068124643299d46530859851a4b 1       ▇▇▇▇▇▇▇▇ About a minute ago About a minute ago 

4. Download Output Files #

Once the pipeline is completed, we can download the files that were created.

  1. View a list of the files in the output repo.
    pachctl list file regression@master
    # NAME                                  TYPE SIZE     
    # /housing-simplified_corr_matrix.png   file 18.66KiB 
    # /housing-simplified_cv_reg_output.png file 86.07KiB 
    # /housing-simplified_model.sav         file 798.5KiB 
    # /housing-simplified_pairplot.png      file 100.8KiB 
  2. Download the files.
    pachctl get file regression@master:/ --recursive --output .

When we inspect the learning curve, we can see that there is a large gap between the training score and the validation score. This typically indicates that our model could benefit from the addition of more data.

Now let’s update our dataset with additional examples.

5. Update the Dataset #

  1. Download our second example data set, housing-simplified-2.csv.

  2. Add the data to your repo.

    pachctl put file housing_data@master:housing-simplified.csv -f /path/to/housing-simplified-2.csv

We could also append new examples to the existing file, but in this tutorial we’re overwriting our previous file to one with more data.

This is where Pachyderm truly starts to shine. The new commit of data to the housing_data repository automatically kicks off a job on the regression pipeline without us having to do anything.

When the job is complete we can download the new files and see that our model has improved, given the new learning curve.

6. Inspect the Pipeline Lineage #

Since the pipeline versions all of our input and output data automatically, we can continue to iterate on our data and code while Pachyderm tracks all of our experiments.

For any given output commit, Pachyderm can tell us exactly which input commit of data was run. In this tutorial, we have only run 2 experiments so far, but this becomes incredibly valuable as your experiments continue to evolve and scale.

  1. Inspect the commits to your repo.

    pachctl list commit
    # ID                               SUBCOMMITS PROGRESS CREATED            MODIFIED
    # 3037785cc56c4387bbb897f1887b4a68 4          ▇▇▇▇▇▇▇▇ 11 seconds ago     11 seconds ago     
    # e7dd14d201a64edc8bf61beed6085ae0 4          ▇▇▇▇▇▇▇▇ About a minute ago About a minute ago 
    # df117068124643299d46530859851a4b 4          ▇▇▇▇▇▇▇▇ 2 minutes ago      2 minutes ago      
  2. Use the commit ID to check for what dataset was used to create the model.

    pachctl list file housing_data@3037785cc56c4387bbb897f1887b4a68
    # NAME                    TYPE SIZE     
    # /housing-simplified.csv file 12.14KiB 
  3. Use the commit ID to check the commit’s details (such as parent commit, branch, size, etc.)

     pachctl inspect commit housing_data@3037785cc56c4387bbb897f1887b4a68
     # Commit: housing_data@3037785cc56c4387bbb897f1887b4a68
     # Original Branch: master
     # Parent: e7dd14d201a64edc8bf61beed6085ae0
     # Started: 2 minutes ago
     # Finished: 2 minutes ago
     # Size: 12.14KiB

User Code Assets #

The Docker image used in this tutorial was built with the following assets: