Reference
PachCTL

Deploy Pachyderm via Proxy (One Port)

Learn how to deploy Pachyderm using an embedded proxy, exposing only one external port.

March 23, 2023

We are now shipping Pachyderm with an optional embedded proxy allowing Pachyderm to expose one single port externally (whether you access pachd over gRPC using pachctl, or console over HTTP, for example).

See Pachyderm new high-level architecture diagram:

High level architecture

This page is an add-on to existing installation instructions in the case where you chose to deploy Pachyderm with an embedded proxy. The steps below replace all or parts of the existing installation documentation. We will let you know when to use them and which section they overwrite.

ℹī¸
  • When the proxy option is activated, Pachyderm is reachable through one TCP port for all incoming grpc (grpcs), console (HTTP/HTTPS), s3 gateway, OIDC, and dex traffic, then routes each call to the appropriate backend microservice without any additional configuration.

  • Enable the proxy as follow:

    proxy:
      enabled: true
      service:
        type: LoadBalancer
⚠ī¸

The deployment of Pachyderm with a proxy is optional at the moment and will become permanent in the next minor release of Pachyderm.

The diagram below gives a quick overview of the layout of services and pods when using a proxy. In particular, it details how Pachyderm listens to all inbound traffic on one port, then routes each call to the appropriate backend:

Infrastruture Recommendation

ℹī¸

See our reference values.yaml for all available configurable fields of the proxy.

Before any deployment in production, we recommend reading the following section to set up your production infrastructure.

Alternatively, you can skip those infrastructure prerequisites and make a quick cloud installation or jump to our local deployment section for a first encounter with Pachyderm.

Pachyderm General Infrastructure Recommendations #

For production deployments, we recommend that you:

ℹī¸
  • You can optionally attach any additional Load Balancer configuration information to the metadata of your service by adding the appropriate annotations in the proxy.service of your values.yaml.
  • You can pre-create a static IP (For example, in GCP: gcloud compute addresses create ADDRESS_NAME --global --IP-version IPV4), then pass this external IP to the loadBalancerIP in the proxy.service of your values.yaml.
proxy:
  enabled: true
  service:
    type: LoadBalancer
    annotations: {<add-optional-annotations-here}
    loadBalancerIP: <insert-your-proxy-external-IP-address-here>

Deploy Pachyderm in Production With a Proxy #

Once you have your networking infrastructure setup, check the deployment page that matches your cloud provider and follow the installation steps that apply to the cloud provider of your choice from section 1-6. Make sure that you have enabled the proxy by adding the following lines to your values.yaml:

proxy:
  enabled: true
  service:
    type: LoadBalancer
    annotations: {see examples below}

Once your cluster is provisioned, and Pachyderm installed, replace the instructions in section 7 (Have ‘pachctl’ And Your Cluster Communicate) by this new set of instructions.

⚠ī¸

If you plan to deploy Console in Production, read the following and adjust your values.yaml accordingly.

Deploying Pachyderm with a proxy simplifies the setup of Console (No more dedicated DNS and ingress needed in front of Console). In a production environment, you will need to:

  • Activate Authentication.Although, if you are an Helm user, setting up your License Key in your values.yaml will activate Authentication by default. This instruction applies to users activating auth by using pachctl.
  • Update the values in the highlighted fields below.
  • Additionally, you will need to configure your Identity Provider (oidc.upstreamIDPs). See examples for the oidc.upstreamIDPs value in the helm chart values specification and read our IDP Configuration page for a better understanding of each field.

deployTarget: "<pick-your-cloud-provider>"

# enable the proxy
proxy:
  enabled: true
  service:
    type: LoadBalancer
    annotations: {...}

ingress:
  host: <insert-external-ip-address-or-dns-name>

pachd:
  storage:
    amazon:
      bucket: "<bucket-name>"
      ...
      region: "<us-east-2>"
  # pachyderm enterprise key
  enterpriseLicenseKey: "<your-enterprise-token>"

oidc:
  # populate the pachd.upstreamIDPs with an array of Dex Connector configurations.
  upstreamIDPs: []

To connect your pachctl client to your cluster #

The grpc address provided when pointing your pachctl CLI at your cluster changes now that a proxy allows a single entry point. Run the following commands:

  1. Retrieve the external IP address of your TCP load balancer (or use your domain name):
kubectl get services | grep pachyderm-proxy | awk '{print $4}'
  1. Update the context of your cluster using the external IP address/domain name captured above:

    echo '{"pachd_address": "grpc://<external-IP-address-or-domain-name>:80"}' | pachctl config set context "<your-cluster-context-name>" --overwrite
    pachctl config set active-context "<your-cluster-context-name>"
  2. Check that your are using the right context:

    pachctl config get active-context

Your cluster context name should show up. Your pachctl client now points to your cluster.

If you have deployed Console #

Point your browser to http://<external-IP-address-or-domain-name>. No port number is needed. You will be prompted to log in to your Console.

Quick Cloud Deployment With a Proxy #

Follow your regular QUICK Cloud Deploy documentation, but for those few steps:

AWS #

Deploy Pachyderm without Console #

deployTarget: "AMAZON"

proxy:
  enabled: true
  service:
    type: LoadBalancer

pachd:
  storage:
    amazon:
      bucket: "bucket_name"      
      # this is an example access key ID taken from https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html (AWS Credentials)
      id: "AKIAIOSFODNN7EXAMPLE"                
      # this is an example secret access key taken from https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html  (AWS Credentials)          
      secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
      region: "us-east-2"          

Deploy Pachyderm with Console and Enterprise #

deployTarget: "AMAZON"

proxy:
  enabled: true
  service:
    type: LoadBalancer

ingress:
  host: <insert-external-ip-address-or-dns-name>

pachd:
  storage:
    amazon:
      bucket: "<bucket-name>"                
      # this is an example access key ID taken from https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html (AWS Credentials)
      id: "AKIAIOSFODNN7EXAMPLE"                
      # this is an example secret access key taken from https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html  (AWS Credentials)          
      secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
      region: "<us-east-2>"
  # pachyderm enterprise key 
  enterpriseLicenseKey: "<your-enterprise-token>"
  localhostIssuer: "true"

Google #

Deploy Pachyderm without Console #

deployTarget: "GOOGLE"

proxy:
  enabled: true
  service:
    type: LoadBalancer

pachd:
  storage:
    google:
      bucket: "<bucket-name>"
      cred: |
                INSERT JSON CONTENT HERE
  externalService:
    enabled: true

Deploy Pachyderm with Console and Enterprise #

deployTarget: "GOOGLE"

proxy:
  enabled: true
  service:
    type: LoadBalancer

ingress:
  host: <insert-external-ip-address-or-dns-name>

pachd:
  storage:
    google:
      bucket: "<bucket-name>"
      cred: |
                INSERT JSON CONTENT HERE
  # pachyderm enterprise key
  enterpriseLicenseKey: "<your-enterprise-token>"
  localhostIssuer: "true"

Azure #

Deploy Pachyderm without Console #

deployTarget: "MICROSOFT"

proxy:
  enabled: true
  service:
    type: LoadBalancer

pachd:
  storage:
    microsoft:
      # storage container name
      container: "blah"
      # storage account name
      id: "AKIAIOSFODNN7EXAMPLE"
      # storage account key
      secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"

Deploy Pachyderm with Console and Enterprise #

deployTarget: "MICROSOFT"

proxy:
  enabled: true
  service:
    type: LoadBalancer

ingress:
  host: <insert-external-ip-address-or-dns-name>


pachd:
  storage:
    microsoft:
      # storage container name
      container: "<your-container-name>"
      # storage account name
      id: "AKIAIOSFODNN7EXAMPLE"
      # storage account key
      secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
  # pachyderm enterprise key
  enterpriseLicenseKey: "<your-enterprise-token>"
  localhostIssuer: "true"

Deploy Pachyderm Locally With a Proxy #

This section is an alternative to the default local deployment instructions. It uses a variant of the original one line command to enable a proxy.

Follow the Prerequisites before deploying Pachyderm (with or without Console) on your local cluster, then Connect ‘pachctl’ To Your Cluster.

JupyterLab users, you can also install Pachyderm JupyterLab Mount Extension on your local Pachyderm cluster to experience Pachyderm from your familiar notebooks.

Note that you can run both Console and JupyterLab on your local installation.

Prerequisites #

Then start your Kubernetes environment.

Minikube (OS X / Windows) #

minikube start

Later, we will use minikube tunnel to make the proxy available on localhost.

Check Minikube’s documentation for details.

Kind (Linux) #

  cat <<EOF | kind create cluster --name=kind --config=-
  kind: Cluster
  apiVersion: kind.x-k8s.io/v1alpha4
  nodes:
      - role: control-plane
        kubeadmConfigPatches:
            - |
                kind: InitConfiguration
                nodeRegistration:
                    kubeletExtraArgs:
                        node-labels: "ingress-ready=true"
        extraPortMappings:
            - containerPort: 30080
              hostPort: 80
              protocol: TCP
            - containerPort: 30443
              hostPort: 443
              protocol: TCP
    EOF

The extraPortMappings will make NodePorts in the cluster available on localhost; NodePort 30080 becomes localhost:80. This will make Pachyderm available at localhost:80 as long as this kind cluster is running.

Check Kind’s documentation for details.

Deploy Pachyderm Community Edition Or Enterprise #

⚠ī¸

Attention Kind users

Set your Service type to NodePort rather than LoadBalancer in the commands below.

-- set proxy.service.type=NodePort

Community Edition With Console #

helm install pachyderm pach/pachyderm --set deployTarget=LOCAL --set proxy.enabled=true --set proxy.service.type=LoadBalancer 

Enterprise With Console #

This command will unlock your enterprise features and install Console Enterprise. Note that Console Enterprise requires authentication. By default, we create a default mock user (username:admin, password: password) to authenticate to Console without having to connect your Identity Provider.

Check the status of the Pachyderm pods by periodically running kubectl get pods. When Pachyderm is ready for use, all Pachyderm pods must be in the Running status.

kubectl get pods

System Response: At a very minimum, you should see the following pods (console depends on your choice above):

NAME                                  READY   STATUS    RESTARTS   AGE
pod/console-55bc9f679-w4xrk           1/1     Running   0          71m
pod/etcd-0                            1/1     Running   0          70m
pod/pachd-84487d6675-cf68x            1/1     Running   0          71m
pod/pachyderm-proxy-89d5c4f65-pst9l   1/1     Running   0          71m
pod/pg-bouncer-5dd558c8dc-zjlpj       1/1     Running   0          71m
pod/postgres-0                        1/1     Running   0          70m

Connect ‘pachctl’ To Your Cluster #

Assuming your pachd is running as shown above, you can now connect pachctl to your local cluster.

⚠ī¸

Minikube users

Open a new tab in your terminal and run minikube tunnel (the command creates a network route on your host to pachyderm-proxy service deployed with type LoadBalancer, and set its ingress to its ClusterIP, here 127.0.0.1). You will be prompted to enter your password.

Changes to the S3 Gateway #

The pachyderm-proxy service also routes Pachyderm’s S3 gateway (allowing you to access Pachyderm’s repo through the S3 protocol) on port 80 (note the endpoint in the diagram below).

Global S3 Gateway with Proxy

Changes to the Enterprise Server Setup #

Your enterprise server is deployed in the same way as any regular cluster with a few differences (no object-store and two PostgreSQL databases required: dex and pachyderm). The same applies when deploying an enterprise server with a proxy.

Note that the enterprise server will be deployed behind its proxy, as will each cluster registered to this enterprise server.

⚠ī¸

Enabling an embedded enterprise server with your pachd as part of the same helm installation will not work with the proxy.

You can use a standalone enterprise server instead.

Follow your regular enterprise server deployment and configuration instructions, except for those few steps: