You can use Pachyderm to build an automated machine learning pipeline that trains a model on a CSV file.

Before You Start

  • You must have Pachyderm installed and running on your cluster
  • You should have already completed the Standard ML Pipeline tutorial
  • You must be familiar with jsonnet
  • This tutorial assumes your active context is localhost:80

Tutorial

Our Docker image’s user code for this tutorial is built on top of the python:3.7-slim-buster base image. It also uses the mljar-supervised package to perform automated feature engineering, model selection, and hyperparameter tuning, making it easy to train high-quality machine learning models on structured data.

1. Create a Project & Input Repo

  1. Create a project named automl-tutorial.
    pachctl create project automl-tutorial
  2. Set the project as current.
    pachctl config update context --project automl-tutorial
  3. Create a new csv-data repo.
    pachctl create repo csv-data
  4. Upload the housing-simplified-1.csv file to the repo.
    pachctl put file csv_data@master:housing-simplified.csv -f /path/to/housing-simplified-1.csv
  1. Navigate to Console.
  2. Select Create Project.
  3. Provide a project Name and Description.
    • Name: automl-tutorial
    • Description: My second project tutorial.
  4. Select Create.
  5. Scroll to the project’s row and select View Project.
  6. Select Create Your First Repo.
  7. Provide a repo Name and Description.
    • Name: housing_data
    • Description: Repo for initial housing data
  8. Select Create.

2. Create a Jsonnet Pipeline

  1. Download or save our automl.jsonnet template.

    ////
    // Template arguments:
    //
    // name : The name of this pipeline, for disambiguation when 
    //          multiple instances are created.
    // input : the repo from which this pipeline will read the csv file to which
    //       it applies automl.
    // target_col : the column of the csv to be used as the target
    // args : additional parameters to pass to the automl regressor (e.g. "--random_state 42")
    ////
    function(name='regression', input, target_col, args='')
    {
      pipeline: { name: name},
      input: {
        pfs: {
          glob: "/",
          repo: input
        }
      },
      transform: {
        cmd: [ "python","/workdir/automl.py","--input","/pfs/"+input+"/", "--target-col", target_col, "--output","/pfs/out/"]+ std.split(args, ' '),
        image: "jimmywhitaker/automl:dev0.02"
      }
    }
  2. Create the AutoML pipeline by referencing and filling out the template’s arguments:

    pachctl update pipeline --jsonnet /path/to/automl.jsonnet  \
        --arg name="regression" \
        --arg input="csv_data" \
        --arg target_col="MEDV" \
        --arg args="--mode Explain --random_state 42"
This part must be done through the CLI due to the pipeline’s use of Jsonnet.

The model automatically starts training. Once complete, the trained model and evaluation metrics are output to the AutoML output repo.

3. Upload the Dataset

Update the dataset using housing-simplified-2.csv; Pachyderm retrains the model automatically.

pachctl put file csv_data@master:housing-simplified.csv -f /path/to/housing-simplified-2.csv
  1. Download the data set, housing-simplified-2.csv.
  2. Select the regression repo > Upload Files.
  3. Select Browse Files.
  4. Choose the housing-simplified-1.csv file.
  5. Select Upload.

Repeat the previous step as many times as you want. Each time, Pachyderm automatically retrains the model and outputs the new model and evaluation metrics to the AutoML output repo.


User Code Assets

The Docker image used in this tutorial was built with the following assets:

FROM python:3.7-slim-buster
RUN apt-get update && apt-get -y update
RUN apt-get install -y build-essential python3-pip python3-dev
RUN pip3 -q install pip --upgrade

WORKDIR /workdir/

COPY requirements.txt /workdir/
RUN pip3 install -r requirements.txt

COPY *.py /workdir/
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from supervised.automl import AutoML

import argparse
import os

parser = argparse.ArgumentParser(description="Structured data regression")
parser.add_argument("--input",
                    type=str,
                    help="csv file with all examples")
parser.add_argument("--target-col",
                    type=str,
                    help="column with target values")
parser.add_argument("--mode",
                    type=str,
                    default='Explain',
                    help="mode")
parser.add_argument("--random_state",
                    type=int,
                    default=42,
                    help="random seed")
parser.add_argument("--output",
                    metavar="DIR",
                    default='./output',
                    help="output directory")

def load_data(input_csv, target_col):
    # Load the data
    data = pd.read_csv(input_csv, header=0)
    targets = data[target_col]
    features = data.drop(target_col, axis = 1)
    
    # Create data splits
    X_train, X_test, y_train, y_test = train_test_split(
        features,
        targets,
        test_size=0.25,
        random_state=123,
    )
    return X_train, X_test, y_train, y_test


def main():
    args = parser.parse_args()
    if os.path.isfile(args.input):
        input_files = [args.input]
    else:  # Directory
        for dirpath, dirs, files in os.walk(args.input):  
            input_files = [ os.path.join(dirpath, filename) for filename in files if filename.endswith('.csv') ]
    print("Datasets: {}".format(input_files))
    os.makedirs(args.output, exist_ok=True)

    for filename in input_files:

        experiment_name = os.path.basename(os.path.splitext(filename)[0])
        # Data loading and Exploration
        X_train, X_test, y_train, y_test = load_data(filename, args.target_col)
       
        # Fit model
        automl = AutoML(total_time_limit=60*60, results_path=args.output) # 1 hour
        automl.fit(X_train, y_train)
        
        # compute the MSE on test data
        predictions = automl.predict_all(X_test)
        print("Test MSE:", mean_squared_error(y_test, predictions))


if __name__ == "__main__":
    main()
 mljar-supervised