Welcome to the beginner tutorial for Pachyderm. If you’ve already got Pachyderm installed, this guide should take about 15 minutes and you’ll be introduced to the basic concepts of Pachyderm.
Analyzing Log Lines from a Fruit Stand¶
In this guide you’re going to create a Pachyderm pipeline to process
transaction logs from a fruit stand. We’ll use two standard unix tools,
awk to do our processing. Thanks to Pachyderm’s processing system we’ll
be able to run the pipeline in a distributed, streaming fashion. As new data is
added, the pipeline will automatically process it and materialize the results.
If you hit any errors not covered in this guide, check our Troubleshooting docs for common errors, submit an issue on GitHub, join our users channel on Slack, or email us at email@example.com and we can help you right away.
This guide assumes that you already have Pachyderm running locally. Check out our Local Installation instructions if haven’t done that yet and then come back here to continue.
Create a Repo¶
repo is the highest level primitive in the Pachyderm file system (pfs). Like all primitives in pfs, it shares it’s name with a primitive in Git and is designed to behave analogously. Generally, repos should be dedicated to a single source of data such as log messages from a particular service, a users table, or training data for an ML model. Repos are dirt cheap so don’t be shy about making tons of them.
For this demo, we’ll simply create a repo called “data” to hold the data we want to process:
$ pachctl create-repo data # See the repo we just created $ pachctl list-repo data
Adding Data to Pachyderm¶
Now that we’ve created a repo it’s time to add some data. In Pachyderm, you write data to an explicit
commit (again, similar to Git). Commits are immutable snapshots of your data which give Pachyderm its version control properties.
Files can be added, removed, or updated in a given commit and then you can view a diff of those changes compared to a previous commit.
Let’s start by just adding a file to a new commit. We’ve provided a sample data file for you to use in our GitHub repo – it’s a list of purchases from a fruit stand.
We’ll use the
put-file command along with two flags,
-f can take either a local file or a URL, in our case, the sample data on GitHub.
We also specificy the repo name “data” and the branch name “master”.
$ pachctl put-file data master sales -c -f https://raw.githubusercontent.com/pachyderm/pachyderm/v1.2.1/doc/examples/fruit_stand/set1.txt
Unlike Git though, commits in Pachyderm must be explicitly started and finished as they can contain huge amounts of data and we don’t want that much “dirty” data hanging around in an unpersisted state. The
-c flag we used above specifies that we want to start a new commit, add data, and finish the commit in a convenient one-liner.
Finally, we can see the data we just added to Pachyderm.
# If we list the repos, we can see that there is now data $ pachctl list-repo NAME CREATED SIZE data 12 minutes ago 874 B # We can view the commit we just created pachctl list-commit data BRANCH REPO/ID PARENT STARTED FINISHED SIZE master data/master/0 <none> 6 minutes ago 6 minutes ago 874 B # We can also view the contents of the file that we just added $ pachctl get-file data master sales orange 4 banana 2 banana 9 orange 9 ...
Create a Pipeline¶
Now that we’ve got some data in our repo, it’s time to do something with it.
Pipelines are the core primitive for Pachyderm’s processing system (pps) and
they’re specified with a JSON encoding. For this example, we’ve already created the pipeline for you and it can be found at examples/fruit_stand/pipeline.json on Github. Please open a new tab to view the pipeline while we talk through it.
When you want to create your own pipelines later, you can refer to the full Pipeline Specification to use more advanced options. This includes building your own code into a container instead of just using simple shell commands as we’re doing here.
For now, we’re going to create a pipeline with 2 transformations in it. The first transformation filters the sales logs into separate records for apples, oranges and bananas. The second step sums these sales numbers into a final sales count.
+----------+ +--------------+ +------------+ |input data| --> |filter pipline| --> |sum pipeline| +----------+ +--------------+ +------------+
In the first step of this pipeline, we are grepping for the terms “apple”, “orange”, and “banana” and writing that line to the corresponding file. Notice we read data from
/pfs/[input_repo_name]) and write data to
/pfs/out/. These are special local directories that Pachyderm creates within the container for you. All the input data will be found in
/pfs/[input_repo_name] and your code should always write to
The second step of this pipeline takes each file, removes the fruit name, and sums up the purchases. The output of our complete pipeline is three files, one for each type of fruit with a single number showing the total quantity sold.
Now let’s create the pipeline in Pachyderm:
$ pachctl create-pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/v1.2.1/doc/examples/fruit_stand/pipeline.json
What Happens When You Create a Pipeline¶
Creating a pipeline tells Pachyderm to run your code on every finished
commit in a repo as well as all future commits that happen after the pipeline is created. Our repo already had a commit, so Pachyderm automatically
job to process that data.
You can view the job with:
$ pachctl list-job ID OUTPUT STARTED DURATION STATE 90c74896fd227f319c3c19459aa7a22b sum/e4060e15948c4b7b89947a02eace5dca/0 2 minutes ago Less than a second success 67c30d70ba9d2179aa133255f5dc81db filter/d737e9b7cfae40d4aa8a8871cdb9f783/0 3 minutes ago 2 seconds success
Every pipeline creates a corresponding repo with the same name where it stores its output results. In our example, the “filter” transformation created a repo called “filter” which was the input to the “sum” transformation. The “sum” repo contains the final output files.
$ pachctl list-repo NAME CREATED SIZE sum 2 minutes ago 12 B filter 2 minutes ago 200 B data 19 minutes ago 874 B
Reading the Output¶
We can read the output data from the “sum” repo in the same fashion that we read the input data (except now we need to use an explicit commitID because the “sum” repo doesn’t have a “master” branch:
$ pachctl get-file sum e4060e15948c4b7b89947a02eace5dca/0 apple 133
Processing More Data¶
Pipelines will also automatically process the data from new commits as they are created. Think of pipelines as being subscribed to any new commits that are finished on their input repo(s). Also similar to Git, commits have a parental structure that track how files change over time. In this case we’re going to be adding more data to the same file “sales.”
In our fruit stand example, this could be making a commit every hour with all the new purchases that happened in that timeframe.
Let’s create a new commit with our previous commit as the parent and add more sample data (set2.txt) to “sales”:
$ pachctl put-file data master sales -c -f https://raw.githubusercontent.com/pachyderm/pachyderm/v1.2.1/doc/examples/fruit_stand/set2.txt
Adding a new commit of data will automatically trigger the pipeline to run on the new data we’ve added. We’ll see a corresponding commit to the output “sum” repo with files “apple”, “orange” and “banana” each containing the cumulative total of purchases. Let’s read the “apples” file again and see the new total number of apples sold.
$ pachctl get-file sum 4092f4675650476ab0a3fde5b7780316/0 apple 324
One thing that’s interesting to note is that our pipeline is completely incremental. Since
grep is a
map operation, Pachyderm will only
grep the new data from set2.txt instead of re-filtering all the data. If you look back at the “sum” pipeline, you’ll notice the
method and that our code uses
/prev to compute the sum incrementally based upon our previous commit. You can learn more about incrementally in our advanced Incrementality docs.
We can view the parental structure of the commits we just created.
$ pachctl list-commit data BRANCH REPO/ID PARENT STARTED FINISHED SIZE master data/master/0 <none> 19 minutes ago 19 minutes ago 874 B master data/master/1 master/0 2 minutes ago 2 minutes ago 874 B
Exploring the File System¶
Another nifty feature of Pachyderm is that you can mount the file system locally to poke around and explore your data using FUSE. FUSE comes pre-installed on most Linux distributions. For OS X, you’ll need to install OSX FUSE.
The first thing we need to do is mount Pachyderm’s filesystem (pfs).
First create the mount point:
$ mkdir ~/pfs
And then mount it:
# We background this process because it blocks. $ pachctl mount ~/pfs &
This will mount pfs on
~/pfs you can inspect the filesystem like you would any
other local filesystem such as using
ls or pointing your browser at it.
# We can see our repos $ ls ~/pfs data filter sum # And commits $ ls ~/pfs/sum 4092f4675650476ab0a3fde5b7780316/1 4092f4675650476ab0a3fde5b7780316/0
pachctl unmount ~/pfs to unmount the filesystem. You can also use the
-a flag to remove all Pachyderm FUSE mounts.
You’ve now got Pachyderm running locally with data and a pipeline! If you want to keep playing with Pachyderm locally, here are some ideas to expand on your working setup.
- Write a script to stream more data into Pachyderm. We already have one in Golang for you on GitHub if you want to use it.
- Add a new pipeline that does something interesting with the “sum” repo as an input.
- Add your own data set and
grepfor different terms. This example can be generalized to generic word count.
You can also start learning some of the more advanced topics to develop analysis in Pachyderm:
- Deploying on the Cloud
- Getting Your Data into Pachyderm from other sources
- Creating Analysis Pipelines using your own code