Installation

This page describes the simplest way to install Data Mechanics on a new GKE cluster or on an existing cluster. To set up an ingress, TLS, let Data Mechanics handle authentication, or check out all installation options, go to the Deployment section.

Check your gear

Before proceeding, make sure you have the following items:

  • Data Mechanics Helm chart data-mechanics-<version>.tgz
  • your Data Mechanics customer key (ex: 0f2ccf041ba662cdfdf829a837fbb5e5fb56d15ed47f)

Create a GKE cluster

Skip this section if you already have a cluster on which to deploy Data Mechanics.

The command below creates a GKE cluster using the Google Cloud Platform CLI gcloud. This command assumes that:

  • a default region or zone has been set in your GCP projet, for more information see Changing the Default Region or Zone.
  • the Kubernetes Engine API is activated in your GCP project. If it is not the case, an error will be raised with the url to activate this API in the error message.
gcloud container clusters create <your-cluster-name> \
--enable-autoscaling \
--min-nodes 0 --max-nodes 50 \
--machine-type n1-standard-4 \
--scopes default,storage-full

Some explanations about the parameters:

  • autoscaling is enabled on the cluster in order to reduce the cost of your Data Mechanics deployment. The number of nodes is constrained in the range 0 to 50. Spark automatically triggers the creation of additional nodes when there is no more space on the cluster, until the max-nodes is reached.
  • we recommend using instances with at least 4 cores (n1-standard-4 here). Performance is best when every Spark executor (a Spark executor = a Kubernetes pod) has at least 4 cores available.
  • as your Spark applications will most likely read and write data from GCS, the scope storage-full is added to the instances of the cluster. Note that this does not mean that data can accessed be without additional setup: a service account with sufficient permissions must be used by the Spark pods (see Accessing data).

Deploy Data Mechanics

Environment setup

Please make sure that Kubernetes CLI kubectl and Helm CLI helm are installed in your shell.

Make sure that kubectl points to the correct Kubernetes cluster by running:

kubectl config current-context

The output should look something like gke_<project-name>_<region/zone>_<your-cluster-name>. Otherwise, you can let gcloud configure kubectl with:

gcloud container clusters get-credentials <your-cluster-name>

Namespaces

Data Mechanics requires two namespaces. A namespace where Data Mechanics service resides, data-mechanics by default, and another one where to run Spark applications, spark-apps by default.

The names can be adjusted to your needs, see the Helm chart options.

Let's create them:

kubectl create namespace data-mechanics
kubectl create namespace spark-apps

Install the Helm chart

Data Mechanics comes packaged as a Helm chart.

helm install data-mechanics.tgz \
--name data-mechanics \
--namespace data-mechanics \
--set customerKey=<your-customer-key>

The command above installs the Data Mechanics chart in namespace data-mechanics with release name data-mechanics.

You can verify that the different services by listing pods in namespace data-mechanics:

kubectl get pods --namespace data-mechanics

It should output something like this:

NAME READY STATUS
data-mechanics-sparkoperator-8685c4c5f5-hvtq4 1/1 Running
submission-service-6f8b9d964f-bngmg 1/1 Running

The next step is to run a first Spark application on your cluster.