onboarding

series

1 Local Installation
๐Ÿ’ก

Looking to upgrade to pachd 2.3.0+ from an older version? Remember to also upgrade pachctl for the best experience.

This guide covers how you can quickly get started using Pachyderm locally on macOSยฎ, Linuxยฎ, or Microsoftยฎ Windowsยฎ. To install Pachyderm on Windows, first look at Deploy Pachyderm on Windows.

Pachyderm is a data-centric pipeline and data versioning application written in go that runs on top of a Kubernetes cluster. A common way to interact with Pachyderm is by using Pachyderm command-line tool pachctl, from a terminal window. To check the state of your deployment, you will also need to install kubectl, Kubernetes command-line tool.

Additionally, we will show you how to deploy and access Pachyderm UIs JupyterLab Mount Extension and Console on your local cluster.

Note that each web UI addresses different use cases:

  • JupyterLab Mount Extension allows you to experiment and explore your data, then build your pipelines’ code from your familiar Notebooks.
  • Console helps you visualize your DAGs (Directed Acyclic Graphs), monitor your pipeline executions, access your logs, and troubleshoot while your pipelines are running.
โš ๏ธ
  • A local installation is not designed to be a production
    environment
    . It is meant to help you learn and experiment quickly with Pachyderm.
  • A local installation is designed for a single-node cluster.
    This cluster uses local storage on disk and does not create
    Persistent Volumes (PVs). If you want to deploy a production multi-node
    cluster, follow the instructions for your cloud provider or on-prem
    installation as described in Deploy Pachyderm.
    New Kubernetes nodes cannot be added to this single-node cluster.

Pachyderm uses Helm for all deployments.

โš ๏ธ

We are now shipping Pachyderm with an optional embedded proxy allowing your cluster to expose one single port externally. This deployment setup is optional.

If you choose to deploy Pachyderm with a Proxy, check out our new recommended architecture and deployment instructions.

Prerequisites #

For a successful local deployment of Pachyderm, you will need:

Setup A Local Kubernetes Cluster #

Pick the virtual machine of your choice.

Using Minikube #

On your local machine, you can run Pachyderm in a minikube virtual machine.
Minikube is a tool that creates a single-node Kubernetes cluster. This limited
installation is sufficient to try basic Pachyderm functionality and complete
the Beginner Tutorial.

To configure Minikube, follow these steps:

  1. Install minikube and VirtualBox in your operating system as described in
    the Kubernetes documentation.

  2. Start minikube:

    minikube start  

    Linux users, add this --driver flag:

    minikube start --driver=kvm2
โ„น๏ธ

Any time you want to stop and restart Pachyderm, run minikube delete and minikube start. Minikube is not meant to be a production environment and does not handle being restarted well without a full wipe.

Using Kubernetes on Docker Desktop #

You can use Kubernetes on Docker Desktop instead of Minikube on macOS or Linux by following these steps:

  1. In the Docker Desktop Preferences, enable Kubernetes:

    Docker Desktop Enable K8s

  2. From the command prompt, confirm that Kubernetes is running:

    kubectl get all  
    NAME                 TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE  
    service/kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP   5d  

    To reset your Kubernetes cluster that runs on Docker Desktop, click the Reset Kubernetes cluster button. See image above.

Using Kind #

  1. Install Kind according to its documentation.

  2. From the command prompt, confirm that Kubernetes is running:

    kubectl get all  
    NAME                 TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE  
    service/kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP   5d  

Install pachctl #

โ„น๏ธ

pachctl is a command-line tool that you can use to interact with a Pachyderm cluster in your terminal.

โš ๏ธ

Pachyderm now offers universal Multi-Arch docker images that can serve both ARM and AMD users.

  • Brew users: The download of the package matching your architecture is automaticโ€”nothing specific to do.
  • Debian-based and other Linux flavors users not relying on Homebrew:

Run uname -m to identify your architecture, then choose the command in the AMD section below if the output is x86_64 , or ARM if it is aarch64.

  1. Run the corresponding steps for your operating system:

    • For macOS or Brew users, run:

      brew tap pachyderm/tap && brew install pachyderm/tap/pachctl@2.3 
    • For a Debian-based Linux 64-bit or Windows 10 or later running on WSL (Choose the command matching your architecture):

      • AMD Architectures (amd64):

        curl -o /tmp/pachctl.deb -L https://github.com/pachyderm/pachyderm/releases/download/v2.3.9/pachctl_2.3.9_amd64.deb && sudo dpkg -i /tmp/pachctl.deb  
      • ARM Architectures (arm64):

        curl -o /tmp/pachctl.deb -L https://github.com/pachyderm/pachyderm/releases/download/v2.3.9/pachctl_2.3.9_arm64.deb && sudo dpkg -i /tmp/pachctl.deb  
    • For all other Linux flavors (Choose the command matching your architecture):

      • AMD Architectures (amd64):

        curl -o /tmp/pachctl.tar.gz -L https://github.com/pachyderm/pachyderm/releases/download/v2.3.9/pachctl_2.3.9_linux_amd64.tar.gz && tar -xvf /tmp/pachctl.tar.gz -C /tmp && sudo cp /tmp/pachctl_2.3.9_linux_amd64/pachctl /usr/local/bin 
      • ARM Architectures (arm64):

        curl -o /tmp/pachctl.tar.gz -L https://github.com/pachyderm/pachyderm/releases/download/v2.3.9/pachctl_2.3.9_linux_arm64.tar.gz && tar -xvf /tmp/pachctl.tar.gz -C /tmp && sudo cp /tmp/pachctl_2.3.9_linux_arm64/pachctl /usr/local/bin 
  2. Verify that installation was successful by running pachctl version --client-only:

    pachctl version --client-only  

    System Response:

    COMPONENT           VERSION  
    pachctl             2.3.9  

    If you run pachctl version without the flag --client-only, the command times
    out. This is expected behavior because Pachyderm has not been deployed yet (pachd is not yet running).

๐Ÿ’ก

If you are new to Pachyderm, try Pachyderm Shell. This add-on tool suggests pachctl commands as you type. It will help you learn Pachyderm’s main commands faster.

โ„น๏ธ

A look at Pachyderm high-level architecture diagram will help you build a mental image of Pachyderm various architectural components.

For information, you can also check what a production setup looks like in this infrastructure diagram.

Install Helm #

Follow Helm’s installation guide.

Deploy Pachyderm #

When done with the Prerequisites, deploy Pachyderm on your local cluster by following these steps. Your default installation comes with Console (Pachyderm’s Web UI).

Additionally, for JupyterLab users, you can install Pachyderm JupyterLab Mount Extension on your local Pachyderm cluster to experience Pachyderm from your familiar notebooks.

Note that you can run both Console and JupyterLab on your local installation.

  • Get the Repo Info:

    helm repo add pach https://helm.pachyderm.com  
    helm repo update 
  • Install Pachyderm:

โš ๏ธ

To request a FREE trial enterprise license key, click here.

Pachyderm Community Edition (Includes Console) #

This command will install Pachyderm’s latest available GA version with Console Community Edition.

helm install --wait --timeout 10m pachd pach/pachyderm --set deployTarget=LOCAL  

Add the following --set console.enabled=false to the command above to install without Console.

Enterprise #

This command will unlock your enterprise features and install Console Enterprise. Note that Console Enterprise requires authentication. By default, we create a default mock user (username:admin, password: password) to authenticate to Console without having to connect your Identity Provider.

  • Create a license.txt file in which you paste your Enterprise Key.
  • Then, run the following helm command to install Pachyderm’s latest Enterprise Edition:
helm install --wait --timeout 10m pachd pach/pachyderm --set deployTarget=LOCAL  --set pachd.enterpriseLicenseKey=$(cat license.txt) --set console.enabled=true  
โ„น๏ธ

This installation can take several minutes. Run a quick helm list --all in a separate tab to witness the installation happening in the background.

๐Ÿ’ก

To uninstall Pachyderm fully.

Running helm uninstall pachd leaves persistent volume claims behind. To wipe your instance clean, run:

helm uninstall pachd 
kubectl delete pvc -l suite=pachyderm 
๐Ÿ“–

Check Your Install #

Check the status of the Pachyderm pods by periodically running kubectl get pods. When Pachyderm is ready for use, all Pachyderm pods must be in the Running status.

Because Pachyderm needs to pull the Pachyderm Docker images from DockerHub, it might take a few minutes for the Pachyderm pods status to change to Running.

kubectl get pods

System Response: At a very minimum, you should see the following pods (console depends on your choice above):

NAME                                           READY   STATUS      RESTARTS   AGE
pod/console-5b67678df6-s4d8c                   1/1     Running     0          2m8s
pod/etcd-0                                     1/1     Running     0          2m8s
pod/pachd-c5848b5c7-zwb8p                      1/1     Running     0          2m8s
pod/pg-bouncer-7b855cb797-jqqpx                1/1     Running     0          2m8s
pod/postgres-0                                 1/1     Running     0          2m8s

If you see a few restarts on the pachd nodes, that means that Kubernetes tried to bring up those pods before etcd or postgres were ready. Therefore, Kubernetes restarted those pods. Re-run kubectl get pods

Connect pachctl to Your Cluster #

Assuming your pachd is running as shown above, the easiest way to connect pachctl to your local cluster is to use the port-forward command.

  • To connect to your new Pachyderm instance, run:

    pachctl config import-kube local --overwrite
    pachctl config set active-context local
  • Then:

    pachctl port-forward

Verify that pachctl and your cluster are connected. #

pachctl version  

System Response:

COMPONENT           VERSION  
pachctl             2.3.9  
pachd               2.3.9  

You are all set!

If You Have Deployed Pachyderm Community Edition #

You are ready! To connect to your Console (Pachyderm UI), point your browser to localhost:4000.

If You Have Deployed Pachyderm Enterprise #

  • To connect to your Console (Pachyderm UI), point your browser to localhost:4000 and authenticate using the mock User (username: admin, password: password).

  • Alternatively, you can connect to your Console (Pachyderm UI) directly by pointing your browser to port 4000 on your minikube IP (run minikube ip to retrieve minikube’s external IP) or docker desktop IP http://<dockerDesktopIdaddress-or-minikube>:4000/ then authenticate using the mock User (username: admin, password: password).

  • To use pachctl, you need to run pachctl auth login then authenticate again (to Pachyderm this time) with the mock User (username: admin, password: password).

NOTEBOOKS USERS: Install Pachyderm JupyterLab Mount Extension #

โ„น๏ธ

You do not need a local Pachyderm cluster already running to install Pachyderm JupyterLab Mount Extension. However, you need a running cluster to connect your Mount Extension to; therefore, we recommend that you install Pachyderm locally first.

  • To install JupyterHub and the Mount Extension on your local cluster, run the following commands. You will be using our default jupyterhub-ext-values.yaml:

    helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/
    helm repo update
    helm upgrade --cleanup-on-fail \
    --install jupyter jupyterhub/jupyterhub \
    --values https://raw.githubusercontent.com/pachyderm/pachyderm/2.3.9/etc/helm/examples/jupyterhub-ext-values.yaml
  • Check the state of your pods kubectl get all. Look for the pods hub-xx and proxy-xx; their state should be Running. Run the command a couple times if necessary. The image takes some time to pull. See the example below:

    pod/hub-6fb9bb5847-ndfwc                       1/1     Running     0             22h
    pod/proxy-57db95fd89-l5pd5                     1/1     Running     0             22h
  • Once your pods are up, in your terminal, run :

    kubectl port-forward svc/proxy-public 8888:80

    Then

    kubectl get services | grep -w "pachd " | awk '{print $3}'

    Note the returned ip address. You will need this cluster IP in a next step.

  • Point your browser to http://localhost:8888, and authenticate using any mock User (username: admin, password: password will do).

  • Now that you are in, click on Pachyderm’s Mount Extension icon on the left of your JupyterLab to connect your JupyterLab to your Pachyderm cluster. Enter grpc://<your-pachd-cluster-ip-from-the-previous-step>:30650 to login.

  • If Pachyderm was deployed with Enterprise, you will be prompted to login again. Use the same mock User (username: admin, password: password).

  • Verify that your JupyterLab Extension is connected to your cluster. From the cell of a notebook, run:

    !pachctl version
    COMPONENT           VERSION  
    pachctl             2.3.9  
    pachd               2.3.9  
โš ๏ธ

Try our Notebook examples!

Make sure to check our data science notebook examples running on Pachyderm, from a market sentiment NLP implementation using a FinBERT model to pipelines training a regression model on the Boston Housing Dataset.

Next Steps #

Complete the Beginner Tutorial to learn the basics of Pachyderm, such as adding data to a repository and building analysis pipelines.

โ„น๏ธ
2 Beginner Tutorial

Welcome to the beginner tutorial for Pachyderm! This tutorial should take about 15 minutes to complete and introduce you to Pachyderm’s fundamental concepts.

Prerequisites #

This guide assumes that you have Pachyderm running.

  • For an easy and quick start, install Pachyderm on your local machine as described in our Local Installation page and start experimenting.

  • Or check out our Quick Install page to deploy on your favorite cloud.

๐Ÿ’ก

If you are new to Pachyderm, try Pachyderm Shell. This handy tool suggests pachctl commands as you type and helps you learn Pachyderm faster.

For this tutorial, you will use pachctl to interact with your Pachyderm cluster from your terminal window and Console (Pachyderm Web UI) to interactively visualize and explore your pipelines, your data, debug jobs, read logs, etc…

If you deployed Pachyderm locally using the default local installation instructions, you have also deployed Pachyderm Web UI. Point your browser to localhost:4000 to connect to Console. You should land on this page:

Console Landing Page

Click on your View Project (We are working on allowing you to organize your pipelines by Projects) to get started. You are all set to have a follow-along visual experience of the coming steps.

Image processing with OpenCV #

This tutorial walks you through the deployment of a Pachyderm pipeline that performs edge detection on a few images. Thanks to Pachyderm’s built-in processing primitives, we can keep our code simple but still run the pipeline in a distributed, streaming fashion. Moreover, as new data is added, the pipeline automatically processes it and outputs the results.

If you hit any errors not covered in this guide, get help in our public community Slack, submit an issue on GitHub, or email us at support@pachyderm.io. We are here to help!

Create a Repo #

A repo is the highest level data primitive in Pachyderm. Like many things in Pachyderm, it shares its name with a primitive in Git and is designed to behave analogously. Generally, repos should be dedicated to a single source of data such as log messages from a particular service, a users table, or training data for an ML model. Repos are easy to create and do not take much space when empty so do not worry about making tons of them.

๐Ÿ“–

More about the concepts of Repository and Branch in Pachyderm.

For this demo, we create a repo called images to hold the data we want to process:

pachctl create repo images

Verify that the repository was created:

pachctl list repo

System response:

NAME   CREATED       SIZE (MASTER) ACCESS LEVEL
images 4 seconds ago โ‰ค 0B          [repoOwner]

This output shows that the repo has been successfully created. Because we have not added anything to it yet, the size of the repository HEAD commit on the master branch is 0B.

Check your Console and notice the creation of your repository.

โ„น๏ธ

Note the “plus” icon in your images repository. It indicates that this repository is an input repository instead of an output repository where the product of your pipeline transformation will be committed. Users can write data to input repositories only.

Console images repo

Adding Data to Pachyderm #

Now that we have created a repo, it is time to add some data. In Pachyderm, you write data to an explicit commit. Commits are immutable snapshots of your data which give Pachyderm its version control properties. You can add, remove, or update files in a given commit.

๐Ÿ“–

More about the concept of Commit in Pachyderm.

Let’s start by adding a file, in this case an image, to a new commit. We have provided some sample images for you that we host on Imgur.

Use the pachctl put file command along with the -f flag. The -f flag can take either a local file, a URL, or a object storage bucket which it scrapes automatically. In this case, we simply pass the URL.

Unlike Git, commits in Pachyderm must be explicitly started and finished as they can contain huge amounts of data and we do not want that much dirty data hanging around in an unpersisted state. pachctl put file automatically starts and finishes a commit for you so you can add files more easily. If you want to add many files over a period of time, you can do pachctl start commit and pachctl finish commit yourself.

To commit the file liberty.png to the master branch of the images repo, run :

pachctl put file images@master:liberty.png -f http://imgur.com/46Q8nDz.png

To make sure that the data we just added is in Pachyderm.

  • Use the pachctl list repo command to check that data has been added:

    pachctl list repo

    System response:

    NAME   CREATED       SIZE (MASTER) ACCESS LEVEL
    images 2 minutes ago โ‰ค 57.27KiB    [repoOwner]
  • View the commit that was just created:

    pachctl list commit images

    System response:

    REPO   BRANCH COMMIT                           FINISHED       SIZE     ORIGIN DESCRIPTION
    images master 89a5ab3a23c949949f763943dd7a8aac 55 seconds ago 57.27KiB USER
  • View the file in that commit:

    pachctl list file images@master

    System response:

    NAME         TYPE SIZE
    /liberty.png file 57.27KiB

In your Console, click on the images repo to visualize its commit and inspect its file:

Console images liberty

Alternatively, you can view the file by retrieving it from Pachyderm. Because this is an image, you cannot just print it out in the terminal, but the following command will let you view it:

  • On macOS, run:
pachctl get file images@master:liberty.png | open -f -a Preview.app
  • On Linux 64-bit, run:
pachctl get file images@master:liberty.png | display

Create a Pipeline #

Now that you have some data in your repo, it is time to do something with it. Pipelines are the core processing primitive in Pachyderm. Pipelines are defined with a simple JSON file called a pipeline specification or pipeline spec for short. For this example, we already created the pipeline spec for you.

When you want to create your own pipeline specification later, you can refer to the full Pipeline Specification to use more advanced options. Options include building your own code into a container. In this tutorial, your pipeline will use a pre-built Docker image.

๐Ÿ“–

More about the concept of Pipeline in Pachyderm.

For now, we are going to create a single pipeline that takes in images and does some simple edge detection.

image

Below is the edges.json pipeline spec. Let’s walk through the details.

{
  "pipeline": {
    "name": "edges"
  },
  "description": "A pipeline that performs image edge detection by using the OpenCV library.",
  "transform": {
    "cmd": [ "python3", "/edges.py" ],
    "image": "pachyderm/opencv:1.0"
  },
  "input": {
    "pfs": {
      "repo": "images",
      "glob": "/*"
    }
  }
}

The pipeline spec contains a few simple sections. The pipeline section contains a name, which is how you will identify your pipeline. Your pipeline will also automatically create an output repo with the same name. The transform section allows you to specify the docker image you want to use. In this case, pachyderm/opencv:1.0 is the docker image (defaults to DockerHub as the registry), and the entry point is edges.py. The input section specifies repos visible to the running pipeline, and how to process the data from the repos. Commits to these repos will automatically trigger the pipeline to create new jobs to process them. In this case, images is the repo, and /* is the glob pattern.

The glob pattern defines how the input data will be transformed into datums if you want to distribute computation. /* means that each file can be processed individually, which makes sense for images. Glob patterns are one of the most powerful features in Pachyderm.

๐Ÿ“–

More about the concept of Glob Pattern in Pachyderm and the fundamental notion of Datums.

The following extract is the Python code run in this pipeline:

import cv2
import numpy as np
from matplotlib import pyplot as plt
import os

# make_edges reads an image from /pfs/images and outputs the result of running
# edge detection on that image to /pfs/out. Note that /pfs/images and
# /pfs/out are special directories that Pachyderm injects into the container.
def make_edges(image):
   img = cv2.imread(image)
   tail = os.path.split(image)[1]
   edges = cv2.Canny(img,100,200)
   plt.imsave(os.path.join("/pfs/out", os.path.splitext(tail)[0]+'.png'), edges, cmap = 'gray')

# walk /pfs/images and call make_edges on every file found
for dirpath, dirs, files in os.walk("/pfs/images"):
   for file in files:
       make_edges(os.path.join(dirpath, file))

The code simply walks over all the images in /pfs/images, performs edge detection, and writes the result to /pfs/out.

/pfs/images and /pfs/out are special local directories that Pachyderm creates within the container automatically. All the input data for a pipeline is stored in /pfs/<input_repo_name> and your code should always write out to /pfs/out (see the function make_edges(image) above). Pachyderm automatically gathers everything you write to /pfs/out, versions it as this pipeline output, and maps it to the appropriate output repo of your pipeline.

Now, let’s create the pipeline in Pachyderm:

pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/2.3.x/examples/opencv/edges.json

Again, check the end result in your Console:

Console edges pipeline

What Happens When You Create a Pipeline #

Creating a pipeline tells Pachyderm to run your code on the data in your input repo (the HEAD commit) as well as all future commits that occur after the pipeline is created. Our repo already had a commit, so Pachyderm automatically launched a job to process that data.

๐Ÿ“–

More about the concept of Job in Pachyderm.

The first time Pachyderm runs a pipeline job, it needs to download the Docker image (specified in the pipeline spec) from the specified Docker registry (DockerHub in this case). This first run might take a minute or so because of the image download, depending on your Internet connection. Subsequent runs will be much faster.

  • You can view the job with:

    pachctl list job

    System response:

    ID                               SUBJOBS PROGRESS CREATED       MODIFIED
    23378d899d3d45738f55df3809841145 1       โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡ 5 seconds ago 5 seconds ago
  • You can check the state of your pipeline:

    pachctl list pipeline

    System response:

    NAME  VERSION INPUT     CREATED       STATE / LAST JOB  DESCRIPTION
    edges 1       images:/* 2 minutes ago running / success A pipeline that performs image edge detection by using the OpenCV library.

    Yay! Your pipeline succeeded! Pachyderm creates a corresponding output repo for every pipeline. This output repo will have the same name as the pipeline, and all the results of that pipeline will be versioned in this output repo. In our example, the edges pipeline created an output repo called edges to store the results written to /pfs/out.

  • List your repositories:

    pachctl list repo

    System response:

    NAME   CREATED        SIZE (MASTER) ACCESS LEVEL
    edges  10 minutes ago โ‰ค 22.22KiB    [repoOwner]  Output repo for pipeline edges.
    images 3 hours ago    โ‰ค 57.27KiB    [repoOwner]

Note that all of that information and more are available in your Console.

Reading the Output #

We can view the output data from the edges repo in the same fashion that we viewed the input data.

  • On macOS, run:
pachctl get file edges@master:liberty.png | open -f -a Preview.app
  • On Linux 64-bit, run:
pachctl get file edges@master:liberty.png | display

The output should look like this:

Console edges liberty

Processing More Data #

Pipelines will automatically process the data from new commits as they are created. In a way, pipelines “subscribe” to their input repo(s), ready to process any new incoming commits. Also similar to Git, commits have a parental structure that tracks which files have changed. In this case we are going to be adding more images.

Let’s create two new commits in a parental structure. To do this we will simply do two more put file commands and by specifying master as the branch, it automatically parents our commits onto each other. Branch names are just references to a particular HEAD commit.

pachctl put file images@master:AT-AT.png -f http://imgur.com/8MN9Kg0.png
pachctl put file images@master:kitten.png -f http://imgur.com/g2QnNqa.png

Adding a new commit of data will automatically trigger the pipeline to run on the new data we have added. We will see corresponding jobs get started and commits to the output edges repo. Let’s also view our new outputs.

  • View the list of jobs that have started:

    pachctl list job

    System response:

    ID                               SUBJOBS PROGRESS CREATED        MODIFIED
    1c1a9d7d36944eabb4f6f14ebca25bf1 1       โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡ 31 seconds ago 31 seconds ago
    fe5c4f70ac4347fd9c5934f0a9c44651 1       โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡ 47 seconds ago 47 seconds ago
    23378d899d3d45738f55df3809841145 1       โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡ 12 minutes ago 12 minutes ago
  • View the output data:

    • On macOS, run:

      pachctl get file edges@master:AT-AT.png | open -f -a Preview.app
      pachctl get file edges@master:kitten.png | open -f -a Preview.app
    • On Linux, run:

      pachctl get file edges@master:AT-AT.png | display
      pachctl get file edges@master:kitten.png | display

Adding Another Pipeline #

We have successfully deployed and used a single stage Pachyderm pipeline. Now, let’s add a processing stage to illustrate a multi-stage Pachyderm pipeline (also referenced as a Directed Acyclic Graph or DAG is this documentation). Specifically, let’s add a montage pipeline that take our original and edge detected images and arranges them into a single montage of images:

image

Below is the pipeline spec for this new pipeline:

{
  "pipeline": {
    "name": "montage"
  },
  "description": "A pipeline that combines images from the `images` and `edges` repositories into a montage.",
  "input": {
    "cross": [ {
      "pfs": {
        "glob": "/",
        "repo": "images"
      }
    },
    {
      "pfs": {
        "glob": "/",
        "repo": "edges"
      }
    } ]
  },
  "transform": {
    "cmd": [ "sh" ],
    "image": "v4tech/imagemagick",
    "stdin": [ "montage -shadow -background SkyBlue -geometry 300x300+2+2 $(find /pfs -type f | sort) /pfs/out/montage.png" ]
  }
}

This montage pipeline spec is similar to our edges pipeline except for the following differences:

  1. We are using a different Docker image that has imagemagick installed.
  2. We are executing a sh command with stdin instead of a python script in the pipeline’s transform section.
  3. We have multiple input data repositories (images and edges).

In the montage pipeline we are combining our multiple input data repositories using a cross pattern. This cross pattern creates a single pairing of our input images with our edge detected images. There are several interesting ways to combine data in Pachyderm, which are discussed in pipelines’ concepts and our pipeline specification page.

  • To create the montage pipeline, run:

    pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/2.3.x/examples/opencv/montage.json

    See your new DAG in Console:

    Console opencv DAG

  • The pipeline creation triggers a job that generates a montage for all the current HEAD commits of the input repos:

    pachctl list job

    System response:

    ID                               SUBJOBS PROGRESS CREATED        MODIFIED
    01e0c8040e18429daf7f67ce34c3a5d7 1       โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡ 11 seconds ago 11 seconds ago
    1c1a9d7d36944eabb4f6f14ebca25bf1 1       โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡ 12 minutes ago 12 minutes ago
    fe5c4f70ac4347fd9c5934f0a9c44651 1       โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡ 12 minutes ago 12 minutes ago
    23378d899d3d45738f55df3809841145 1       โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡ 24 minutes ago 24 minutes ago
  • View the generated montage image in Console or by running one of the following commands:

    • In Console:

    Console opencv montage

    • On macOS, run:
    pachctl get file montage@master:montage.png | open -f -a Preview.app
    • On Linux 64-bit, run:
    pachctl get file montage@master:montage.png | display

Next Steps #

You can use what you have learned to build on or change these pipelines. You can also dig in and learn more details about:

Again, we would love to help and see what you come up with! Submit any questions, comment, contribution on GitHub, Slack, or email at support@pachyderm.io if you want to show off anything nifty you have created!

Local Getting Started Guides

What is a Local Installation? #

A local installation means that you will allocate resources from your local machine (e.g., your laptop) to spin up a Kubernetes cluster to run Pachyderm. This installation method is not for a production setup, but is great for personal use, testing, and product exploration.

Which Guide Should I Use? #

Both the Docker Desktop and Minikube installation guides support MacOS, Windows, and Linux. If this is your first time using Kubernetes, try Docker Desktop — if you are experienced with Kubernetes, you can deploy using a variety of solutions not listed here (KinD, Rancher Desktop, Podman, etc.).

๐Ÿ’ก

Binary Files (Advanced Users)

You can download the latest binary files from GitHub for a direct installation of pachctl and the mount-server.

Beginner Tutorial

Welcome to the beginner tutorial for Pachyderm! This tutorial should take about 15 minutes to complete and introduce you to Pachyderm’s fundamental concepts.

Prerequisites #

This guide assumes that you have Pachyderm running.

  • For an easy and quick start, install Pachyderm on your local machine as described in our Local Installation page and start experimenting.

  • Or check out our Quick Install page to deploy on your favorite cloud.

๐Ÿ’ก

If you are new to Pachyderm, try Pachyderm Shell. This handy tool suggests pachctl commands as you type and helps you learn Pachyderm faster.

For this tutorial, you will use pachctl to interact with your Pachyderm cluster from your terminal window and Console (Pachyderm Web UI) to interactively visualize and explore your pipelines, your data, debug jobs, read logs, etc…

If you deployed Pachyderm locally using the default local installation instructions, you have also deployed Pachyderm Web UI. Point your browser to localhost:4000 to connect to Console. You should land on this page:

Console Landing Page

Click on your View Project (We are working on allowing you to organize your pipelines by Projects) to get started. You are all set to have a follow-along visual experience of the coming steps.

Image processing with OpenCV #

This tutorial walks you through the deployment of a Pachyderm pipeline that performs edge detection on a few images. Thanks to Pachyderm’s built-in processing primitives, we can keep our code simple but still run the pipeline in a distributed, streaming fashion. Moreover, as new data is added, the pipeline automatically processes it and outputs the results.

If you hit any errors not covered in this guide, get help in our public community Slack, submit an issue on GitHub, or email us at support@pachyderm.io. We are here to help!

Create a Repo #

A repo is the highest level data primitive in Pachyderm. Like many things in Pachyderm, it shares its name with a primitive in Git and is designed to behave analogously. Generally, repos should be dedicated to a single source of data such as log messages from a particular service, a users table, or training data for an ML model. Repos are easy to create and do not take much space when empty so do not worry about making tons of them.

๐Ÿ“–

More about the concepts of Repository and Branch in Pachyderm.

For this demo, we create a repo called images to hold the data we want to process:

pachctl create repo images

Verify that the repository was created:

pachctl list repo

System response:

NAME   CREATED       SIZE (MASTER) ACCESS LEVEL
images 4 seconds ago โ‰ค 0B          [repoOwner]

This output shows that the repo has been successfully created. Because we have not added anything to it yet, the size of the repository HEAD commit on the master branch is 0B.

Check your Console and notice the creation of your repository.

โ„น๏ธ

Note the “plus” icon in your images repository. It indicates that this repository is an input repository instead of an output repository where the product of your pipeline transformation will be committed. Users can write data to input repositories only.

Console images repo

Adding Data to Pachyderm #

Now that we have created a repo, it is time to add some data. In Pachyderm, you write data to an explicit commit. Commits are immutable snapshots of your data which give Pachyderm its version control properties. You can add, remove, or update files in a given commit.

๐Ÿ“–

More about the concept of Commit in Pachyderm.

Let’s start by adding a file, in this case an image, to a new commit. We have provided some sample images for you that we host on Imgur.

Use the pachctl put file command along with the -f flag. The -f flag can take either a local file, a URL, or a object storage bucket which it scrapes automatically. In this case, we simply pass the URL.

Unlike Git, commits in Pachyderm must be explicitly started and finished as they can contain huge amounts of data and we do not want that much dirty data hanging around in an unpersisted state. pachctl put file automatically starts and finishes a commit for you so you can add files more easily. If you want to add many files over a period of time, you can do pachctl start commit and pachctl finish commit yourself.

To commit the file liberty.png to the master branch of the images repo, run :

pachctl put file images@master:liberty.png -f http://imgur.com/46Q8nDz.png

To make sure that the data we just added is in Pachyderm.

  • Use the pachctl list repo command to check that data has been added:

    pachctl list repo

    System response:

    NAME   CREATED       SIZE (MASTER) ACCESS LEVEL
    images 2 minutes ago โ‰ค 57.27KiB    [repoOwner]
  • View the commit that was just created:

    pachctl list commit images

    System response:

    REPO   BRANCH COMMIT                           FINISHED       SIZE     ORIGIN DESCRIPTION
    images master 89a5ab3a23c949949f763943dd7a8aac 55 seconds ago 57.27KiB USER
  • View the file in that commit:

    pachctl list file images@master

    System response:

    NAME         TYPE SIZE
    /liberty.png file 57.27KiB

In your Console, click on the images repo to visualize its commit and inspect its file:

Console images liberty

Alternatively, you can view the file by retrieving it from Pachyderm. Because this is an image, you cannot just print it out in the terminal, but the following command will let you view it:

  • On macOS, run:
pachctl get file images@master:liberty.png | open -f -a Preview.app
  • On Linux 64-bit, run:
pachctl get file images@master:liberty.png | display

Create a Pipeline #

Now that you have some data in your repo, it is time to do something with it. Pipelines are the core processing primitive in Pachyderm. Pipelines are defined with a simple JSON file called a pipeline specification or pipeline spec for short. For this example, we already created the pipeline spec for you.

When you want to create your own pipeline specification later, you can refer to the full Pipeline Specification to use more advanced options. Options include building your own code into a container. In this tutorial, your pipeline will use a pre-built Docker image.

๐Ÿ“–

More about the concept of Pipeline in Pachyderm.

For now, we are going to create a single pipeline that takes in images and does some simple edge detection.

image

Below is the edges.json pipeline spec. Let’s walk through the details.

{
  "pipeline": {
    "name": "edges"
  },
  "description": "A pipeline that performs image edge detection by using the OpenCV library.",
  "transform": {
    "cmd": [ "python3", "/edges.py" ],
    "image": "pachyderm/opencv:1.0"
  },
  "input": {
    "pfs": {
      "repo": "images",
      "glob": "/*"
    }
  }
}

The pipeline spec contains a few simple sections. The pipeline section contains a name, which is how you will identify your pipeline. Your pipeline will also automatically create an output repo with the same name. The transform section allows you to specify the docker image you want to use. In this case, pachyderm/opencv:1.0 is the docker image (defaults to DockerHub as the registry), and the entry point is edges.py. The input section specifies repos visible to the running pipeline, and how to process the data from the repos. Commits to these repos will automatically trigger the pipeline to create new jobs to process them. In this case, images is the repo, and /* is the glob pattern.

The glob pattern defines how the input data will be transformed into datums if you want to distribute computation. /* means that each file can be processed individually, which makes sense for images. Glob patterns are one of the most powerful features in Pachyderm.

๐Ÿ“–

More about the concept of Glob Pattern in Pachyderm and the fundamental notion of Datums.

The following extract is the Python code run in this pipeline:

import cv2
import numpy as np
from matplotlib import pyplot as plt
import os

# make_edges reads an image from /pfs/images and outputs the result of running
# edge detection on that image to /pfs/out. Note that /pfs/images and
# /pfs/out are special directories that Pachyderm injects into the container.
def make_edges(image):
   img = cv2.imread(image)
   tail = os.path.split(image)[1]
   edges = cv2.Canny(img,100,200)
   plt.imsave(os.path.join("/pfs/out", os.path.splitext(tail)[0]+'.png'), edges, cmap = 'gray')

# walk /pfs/images and call make_edges on every file found
for dirpath, dirs, files in os.walk("/pfs/images"):
   for file in files:
       make_edges(os.path.join(dirpath, file))

The code simply walks over all the images in /pfs/images, performs edge detection, and writes the result to /pfs/out.

/pfs/images and /pfs/out are special local directories that Pachyderm creates within the container automatically. All the input data for a pipeline is stored in /pfs/<input_repo_name> and your code should always write out to /pfs/out (see the function make_edges(image) above). Pachyderm automatically gathers everything you write to /pfs/out, versions it as this pipeline output, and maps it to the appropriate output repo of your pipeline.

Now, let’s create the pipeline in Pachyderm:

pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/2.4.x/examples/opencv/edges.json

Again, check the end result in your Console:

Console edges pipeline

What Happens When You Create a Pipeline #

Creating a pipeline tells Pachyderm to run your code on the data in your input repo (the HEAD commit) as well as all future commits that occur after the pipeline is created. Our repo already had a commit, so Pachyderm automatically launched a job to process that data.

๐Ÿ“–

More about the concept of Job in Pachyderm.

The first time Pachyderm runs a pipeline job, it needs to download the Docker image (specified in the pipeline spec) from the specified Docker registry (DockerHub in this case). This first run might take a minute or so because of the image download, depending on your Internet connection. Subsequent runs will be much faster.

  • You can view the job with:

    pachctl list job

    System response:

    ID                               SUBJOBS PROGRESS CREATED       MODIFIED
    23378d899d3d45738f55df3809841145 1       โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡ 5 seconds ago 5 seconds ago
  • You can check the state of your pipeline:

    pachctl list pipeline

    System response:

    NAME  VERSION INPUT     CREATED       STATE / LAST JOB  DESCRIPTION
    edges 1       images:/* 2 minutes ago running / success A pipeline that performs image edge detection by using the OpenCV library.

    Yay! Your pipeline succeeded! Pachyderm creates a corresponding output repo for every pipeline. This output repo will have the same name as the pipeline, and all the results of that pipeline will be versioned in this output repo. In our example, the edges pipeline created an output repo called edges to store the results written to /pfs/out.

  • List your repositories:

    pachctl list repo

    System response:

    NAME   CREATED        SIZE (MASTER) ACCESS LEVEL
    edges  10 minutes ago โ‰ค 22.22KiB    [repoOwner]  Output repo for pipeline edges.
    images 3 hours ago    โ‰ค 57.27KiB    [repoOwner]

Note that all of that information and more are available in your Console.

Reading the Output #

We can view the output data from the edges repo in the same fashion that we viewed the input data.

  • On macOS, run:
pachctl get file edges@master:liberty.png | open -f -a Preview.app
  • On Linux 64-bit, run:
pachctl get file edges@master:liberty.png | display

The output should look like this:

Console edges liberty

Processing More Data #

Pipelines will automatically process the data from new commits as they are created. In a way, pipelines “subscribe” to their input repo(s), ready to process any new incoming commits. Also similar to Git, commits have a parental structure that tracks which files have changed. In this case we are going to be adding more images.

Let’s create two new commits in a parental structure. To do this we will simply do two more put file commands and by specifying master as the branch, it automatically parents our commits onto each other. Branch names are just references to a particular HEAD commit.

pachctl put file images@master:AT-AT.png -f http://imgur.com/8MN9Kg0.png
pachctl put file images@master:kitten.png -f http://imgur.com/g2QnNqa.png

Adding a new commit of data will automatically trigger the pipeline to run on the new data we have added. We will see corresponding jobs get started and commits to the output edges repo. Let’s also view our new outputs.

  • View the list of jobs that have started:

    pachctl list job

    System response:

    ID                               SUBJOBS PROGRESS CREATED        MODIFIED
    1c1a9d7d36944eabb4f6f14ebca25bf1 1       โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡ 31 seconds ago 31 seconds ago
    fe5c4f70ac4347fd9c5934f0a9c44651 1       โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡ 47 seconds ago 47 seconds ago
    23378d899d3d45738f55df3809841145 1       โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡ 12 minutes ago 12 minutes ago
  • View the output data:

    • On macOS, run:

      pachctl get file edges@master:AT-AT.png | open -f -a Preview.app
      pachctl get file edges@master:kitten.png | open -f -a Preview.app
    • On Linux, run:

      pachctl get file edges@master:AT-AT.png | display
      pachctl get file edges@master:kitten.png | display

Adding Another Pipeline #

We have successfully deployed and used a single stage Pachyderm pipeline. Now, let’s add a processing stage to illustrate a multi-stage Pachyderm pipeline (also referenced as a Directed Acyclic Graph or DAG is this documentation). Specifically, let’s add a montage pipeline that take our original and edge detected images and arranges them into a single montage of images:

image

Below is the pipeline spec for this new pipeline:

{
  "pipeline": {
    "name": "montage"
  },
  "description": "A pipeline that combines images from the `images` and `edges` repositories into a montage.",
  "input": {
    "cross": [ {
      "pfs": {
        "glob": "/",
        "repo": "images"
      }
    },
    {
      "pfs": {
        "glob": "/",
        "repo": "edges"
      }
    } ]
  },
  "transform": {
    "cmd": [ "sh" ],
    "image": "v4tech/imagemagick",
    "stdin": [ "montage -shadow -background SkyBlue -geometry 300x300+2+2 $(find /pfs -type f | sort) /pfs/out/montage.png" ]
  }
}

This montage pipeline spec is similar to our edges pipeline except for the following differences:

  1. We are using a different Docker image that has imagemagick installed.
  2. We are executing a sh command with stdin instead of a python script in the pipeline’s transform section.
  3. We have multiple input data repositories (images and edges).

In the montage pipeline we are combining our multiple input data repositories using a cross pattern. This cross pattern creates a single pairing of our input images with our edge detected images. There are several interesting ways to combine data in Pachyderm, which are discussed in pipelines’ concepts and our pipeline specification page.

  • To create the montage pipeline, run:

    pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/2.4.x/examples/opencv/montage.json

    See your new DAG in Console:

    Console opencv DAG

  • The pipeline creation triggers a job that generates a montage for all the current HEAD commits of the input repos:

    pachctl list job

    System response:

    ID                               SUBJOBS PROGRESS CREATED        MODIFIED
    01e0c8040e18429daf7f67ce34c3a5d7 1       โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡ 11 seconds ago 11 seconds ago
    1c1a9d7d36944eabb4f6f14ebca25bf1 1       โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡ 12 minutes ago 12 minutes ago
    fe5c4f70ac4347fd9c5934f0a9c44651 1       โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡ 12 minutes ago 12 minutes ago
    23378d899d3d45738f55df3809841145 1       โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡โ–‡ 24 minutes ago 24 minutes ago
  • View the generated montage image in Console or by running one of the following commands:

    • In Console:

    Console opencv montage

    • On macOS, run:
    pachctl get file montage@master:montage.png | open -f -a Preview.app
    • On Linux 64-bit, run:
    pachctl get file montage@master:montage.png | display

Next Steps #

You can use what you have learned to build on or change these pipelines. You can also dig in and learn more details about:

Again, we would love to help and see what you come up with! Submit any questions, comment, contribution on GitHub, Slack, or email at support@pachyderm.io if you want to show off anything nifty you have created!