This page describes the simplest way to install Data Mechanics on a new GKE cluster or on an existing cluster. To set up an ingress, TLS, let Data Mechanics handle authentication, or check out all installation options, go to the Deployment section.
Check your gear
Before proceeding, make sure you have the following items:
- Data Mechanics Helm chart
- your Data Mechanics customer key (ex:
Create a GKE cluster
Skip this section if you already have a cluster on which to deploy Data Mechanics.
The command below creates a GKE cluster using the Google Cloud Platform CLI
This command assumes that:
- a default region or zone has been set in your GCP projet, for more information see Changing the Default Region or Zone.
- the Kubernetes Engine API is activated in your GCP project. If it is not the case, an error will be raised with the url to activate this API in the error message.
gcloud container clusters create <your-cluster-name> \--enable-autoscaling \--min-nodes 0 --max-nodes 50 \--machine-type n1-standard-4 \--scopes default,storage-full
Some explanations about the parameters:
- autoscaling is enabled on the cluster in order to reduce the cost of your Data Mechanics deployment. The number of nodes is constrained in the range 0 to 50. Spark automatically triggers the creation of additional nodes when there is no more space on the cluster, until the
- we recommend using instances with at least 4 cores (
n1-standard-4here). Performance is best when every Spark executor (a Spark executor = a Kubernetes pod) has at least 4 cores available.
- as your Spark applications will most likely read and write data from GCS, the scope
storage-fullis added to the instances of the cluster. Note that this does not mean that data can accessed be without additional setup: a service account with sufficient permissions must be used by the Spark pods (see Accessing data).
Deploy Data Mechanics
Make sure that
kubectl points to the correct Kubernetes cluster by running:
kubectl config current-context
The output should look something like
Otherwise, you can let
gcloud container clusters get-credentials <your-cluster-name>
Data Mechanics requires two namespaces.
A namespace where Data Mechanics service resides,
data-mechanics by default, and another one where to run Spark applications,
spark-apps by default.
The names can be adjusted to your needs, see the Helm chart options.
Let's create them:
kubectl create namespace data-mechanicskubectl create namespace spark-apps
Install the Helm chart
Data Mechanics comes packaged as a Helm chart.
- Helm 2.xx
- Helm 3.xx
helm install data-mechanics.tgz \--name data-mechanics \--namespace data-mechanics \--set customerKey=<your-customer-key>
The command above installs the Data Mechanics chart in namespace
data-mechanics with release name
You can verify that the different services by listing pods in namespace
kubectl get pods --namespace data-mechanics
It should output something like this:
NAME READY STATUSdata-mechanics-sparkoperator-8685c4c5f5-hvtq4 1/1 Runningsubmission-service-6f8b9d964f-bngmg 1/1 Running
The next step is to run a first Spark application on your cluster.