First application

This page describes how to run a Spark application on a cluster installed following the quickstart instructions in the Installation page.

Environment setup

Make sure that kubectl points to the correct Kubernetes cluster by running:

kubectl config current-context

The output should look something like gke_<project-name>_<region/zone>_<your-cluster-name>. Otherwise, you can let gcloud configure kubectl with:

gcloud container clusters get-credentials <your-cluster-name>

Open a connection to Data Mechanics service

The quickstart Installation procedure does not set up an ingress on your cluster. As a result, Data Mechanics service only exposes a cluster IP and you will need port forwarding to access it.

kubectl port-forward -n data-mechanics service/submission-service 5000:80

This command forwards all requests to http://localhost:5000/ to the Data Mechanics submission service.

To avoid port-forwarding and set up an ingress (and TLS) on your cluster, refer to this page.

Navigate to the dashboard

The dashboard is available at http://localhost:5000/dashboard/. It should be empty. Empty dashboard

If the Data Mechanics service is open to the internet, the dashboard should be protected. Read more about this topic here.

Run a Spark application

The command below will run the Monte-Carlo Pi computation contained in all Spark distributions.

curl -X POST \
http://localhost:5000/api/apps/ \
-H 'Content-Type: application/json' \
-d '{
"jobName": "spark-pi",
"configOverrides": {
"type": "Scala",
"sparkVersion": "3.0.0",
"mainApplicationFile": "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-preview.jar",
"mainClass": "org.apache.spark.examples.SparkPi",
"arguments": ["10000"],
"executor": {
"cores": 2

Here's a breakdown of the payload:

  • We assign the jobName spark-pi to the application. Applications with the same jobName are grouped in the Data Mechanics dashboard. A job is typically an application that runs every day, every hour, etc.
  • This a Scala application running Spark 3.0.0.
  • The command to run is specified by mainApplicationFile, mainClass, and arguments.
  • We override the default configuration and request 2 cores per executor.

The API should return something like:

"appName": "spark-pi-20191208-154504-xh6x5",
"jobName": "spark-pi",
"configTemplateName": "",
"config": {
"type": "Scala",
"sparkVersion": "3.0.0",
"mode": "cluster",
"image": "",
"mainClass": "org.apache.spark.examples.SparkPi",
"mainApplicationFile": "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-preview.jar",
"arguments": [
"sparkConf": {
"spark.dynamicAllocation.enabled": "true",
"spark.dynamicAllocation.shuffleTracking.enabled": "true"
"driver": {
"serviceAccount": "spark-driver",
"cores": 1,
"memory": "1g"
"executor": {
"instances": 1,
"cores": 2,
"memory": "4g"
"restartPolicy": {
"type": "Never"

Note that some additional configurations are automatically set by Data Mechanics.

To know more about the API routes and parameters, check out the API reference or navigate to http://localhost:5000/api/ in your browser.

The running application should automatically appear in the dashboard: An app running

Clicking on the application opens the application page. It shows the app configuration as a JSON blob and a live log stream. Application page

Note how the number of executors increases over time. Data Mechanics enables dynamic allocation by default when running Spark 3.0.0.

This example uses a JAR embedded in the Spark Docker image and neither reads nor writes data. For a more real-world use case, refer to this page.