In this tutorial, we’ll build a scalable inference data parallelism pipeline for breast cancer detection using data parallelism.

Before You Start

Tutorial

Our Docker image’s user code for this tutorial is built on top of the pytorch/pytorch base image, which includes necessary dependencies. The underlying code and pre-trained breast cancer detection model comes from this repo, developed by the Center of Data Science and Department of Radiology at NYU. Their original paper can be found here.

1. Create a Project & Input Repos

2. Create a Classification Pipeline

We’re going to need to first build a pipeline that will classify the breast cancer images. We’ll use a cross input to combine the sample data and models.

Tip

Datum Shape

When you define a glob pattern in your pipeline, you are defining how Pachyderm should split the data so that the code can execute as parallel jobs without having to modify the underlying implementation.

In this case, we are treating each exam (4 images and a list file) as a single datum. Each datum is processed individually, allowing parallelized computation for each exam that is added. The file structure for our sample_data is organized as follows:

sample_data/
├── <unique_exam_id>
│   ├── L_CC.png
│   ├── L_MLO.png
│   ├── R_CC.png
│   ├── R_MLO.png
│   └── gen_exam_list_before_cropping.pkl
├── <unique_exam_id>
│   ├── L_CC.png
│   ├── L_MLO.png
│   ├── R_CC.png
│   ├── R_MLO.png
│   └── gen_exam_list_before_cropping.pkl
...

The gen_exam_list_before_cropping.pkl is a pickled version of the image list, a requirement of the underlying library being used.

3. Upload Dataset

  1. Open or download this github repo.

    gh repo clone pachyderm/docs-content
  2. Navigate to this tutorial.

    cd content/products/mldm/latest/build-dags/tutorials/data-parallelism
  3. Upload the sample_data and models folders to your repos.


User Code Assets

The Docker image used in this tutorial was built with the following assets: