Cross & Union Inputs
Learn about the concept of cross and union inputs in Pachyderm.
March 22, 2023
Pachyderm enables you to combine multiple PFS inputs by using the union
and cross
operators in the pipeline specification.
You can think of union as a disjoint union binary operator and cross as a cartesian product binary operator.
This section describes how to use cross
and union
in your pipelines and how you can optimize your code when you work with them.
Union Input #
The union input combines each of the datums in the input repos as one set of datums. The number of datums that are processed is the sum of all the datums in each repo.
For example, you have two input repos, A
and B
. Each of these repositories contain three files with the following names.
Repository A
has the following structure:
A
âââ 1.txt
âââ 2.txt
âââ 3.txt
Repository B
has the following structure:
B
âââ 4.txt
âââ 5.txt
âââ 6.txt
If you want your pipeline to process each file independently as a
separate datum, use a glob pattern of /*
. Each
glob is applied to each input independently. The input section
in the pipeline spec might have the following structure:
"input": {
"union": [
{
"pfs": {
"glob": "/*",
"repo": "A"
}
},
{
"pfs": {
"glob": "/*",
"repo": "B"
}
}
]
}
In this example, each Pachyderm repository has those three files in the root
directory, so three datums from each input. Therefore, the union of A
and B
has six datums in total.
Your pipeline processes the following datums without any specific order:
/pfs/A/1.txt
/pfs/A/2.txt
/pfs/A/3.txt
/pfs/B/4.txt
/pfs/B/5.txt
/pfs/B/6.txt
Each datum in a pipeline is processed independently by a single
execution of your code. In this example, your code runs six times, and
each datum is available to it one at a time. For example, your code
processes pfs/A/1.txt
in one of the runs and pfs/B/5.txt
in a
different run, and so on. In a union, two or more datums are never
available to your code at the same time. You can simplify
your union code by using the name
property as described below.
Simplifying the Union Pipelines Code #
In the example above, your code needs to read into the pfs/A
or pfs/B
directory because only one of them is present in any given datum.
To simplify your code, you can add the name
field to the pfs
object and
give the same name to each of the input repos. For example, you can add, the
name
field with the value C
to the input repositories A
and B
:
"input": {
"union": [
{
"pfs": {
"name": "C",
"glob": "/*",
"repo": "A"
}
},
{
"pfs": {
"name": "C",
"glob": "/*",
"repo": "B"
}
}
]
}
Then, in the pipeline, all datums appear in the same directory.
/pfs/C/1.txt # from A
/pfs/C/2.txt # from A
/pfs/C/3.txt # from A
/pfs/C/4.txt # from B
/pfs/C/5.txt # from B
/pfs/C/6.txt # from B
Cross Input #
In a cross input, Pachyderm exposes every combination of datums, or a cross-product, from each of your input repositories to your code in a single run.
In other words, a cross input pairs every datum in one repository with each datum in another, creating sets of datums. Your transformation code is provided one of these sets at the time to process.
For example, you have repositories A
and B
with three datums, each
with the following structure:
For this example, the glob pattern is set to /*
.
Repository A
has three files at the top level:
A
âââ 1.txt
âââ 2.txt
âââ 3.txt
Repository B
has three files at the top level:
B
âââ 4.txt
âââ 5.txt
âââ 6.txt
Because you have three datums in each repo, Pachyderm exposes a total of nine combinations of datums to your code.
In cross pipelines, both pfs/A
and pfs/B
directories are visible during each code run.
Run 1: /pfs/A/1.txt
/pfs/B/4.txt
Run 2: /pfs/A/1.txt
/pfs/B/5.txt
...
Run 9: /pfs/A/3.txt
/pfs/B/6.txt
In cross inputs, if you use the name
field, your two inputs cannot have the same name. This could cause file system collisions.