Submitting an app

This page describes how to run a first Spark application programmatically through Data Mechanics API.

The Data Mechanics gateway lives in your Kubernetes cluster and exposes a URL that has been shared with you during the initial setup of the platform. Let's call it https://<your-cluster-url>/.

The Data Mechanics API is exposed at https://<your-cluster-url>/api/. You can run, configure, and monitor applications using the different endpoints available.

To know more about the API routes and parameters, check out the API reference or navigate to https://<your-cluster-url>/api/ in your browser.

For now, let's focus on running a first application!

The command below will run the Monte-Carlo Pi computation contained in all Spark distributions:

curl -X POST \
https://<your-cluster-url>/api/apps/ \
-H 'Content-Type: application/json' \
-H 'X-API-Key: <your-user-key>' \
-d '{
"jobName": "sparkpi2",
"configOverrides": {
"type": "Scala",
"sparkVersion": "3.0.0",
"mainApplicationFile": "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-preview.jar",
"mainClass": "org.apache.spark.examples.SparkPi",
"arguments": ["10000"],
"executor": {
"cores": 2
}
}
}'

Here's a breakdown of the payload:

  • We assign the jobName "sparkpi2" to the application. Applications with the same jobName are grouped in the Data Mechanics dashboard. A job is typically a scheduled application that runs every day or every hour. In the dashboard, the Jobs view lets you track performance of jobs over time.
  • Default configurations are overriden in configOverrides:
    • This is a Scala application running Spark 3.0.0.
    • The command to run is specified by mainApplicationFile, mainClass, and arguments.
    • We override the default configuration and request 2 cores per executor.

The API then returns something like:

{
"trackingUuid": "73d64b1a-fef4-4828-ab3d-7adfc16c6b6a",
"appName": "sparkpi2-20200624-133913-r7739",
"userName": "julien@datamechanics.co",
"jobName": "sparkpi2",
"interactive": false,
"resourceVersion": 60163238,
"config": {
"type": "Scala",
"sparkVersion": "3.0.0",
"image": "gcr.io/dm-docker/spark-gcs:3.0.0",
"mainClass": "org.apache.spark.examples.SparkPi",
"mainApplicationFile": "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-preview.jar",
"arguments": ["10000"],
"sparkConf": {
"spark.kubernetes.allocation.batch.size": "10",
"spark.dynamicAllocation.enabled": "true",
"spark.dynamicAllocation.shuffleTracking.enabled": "true",
"spark.dynamicAllocation.maxExecutors": "100",
"spark.dynamicAllocation.executorAllocationRatio": "0.33",
"spark.dynamicAllocation.sustainedSchedulerBacklogTimeout": "30"
},
"driver": {
"cores": 1,
"memory": "1g",
"envVars": {
"KUBERNETES_REQUEST_TIMEOUT": "30000",
"KUBERNETES_CONNECTION_TIMEOUT": "30000"
}
},
"executor": {
"instances": 1,
"cores": 2,
"memory": "3g"
},
"restartPolicy": {
"type": "Never"
}
}
}

Note that some additional configurations are automatically set by Data Mechanics.

In particular, the appName is a unique identifier of this Spark application on your cluster. Here it is has been generated automatically from the jobName, but you can set it yourself in the payload of your request to launch an app.

Beside the appName, Data Mechanics has also set some defaults to increase the stability and performance of the app. Learn more about configuration management and auto-tuning.

The running application should automatically appear in the dashboard at https://<your-cluster-url>/dashboard/:

An app running

Clicking on the application opens the application page. At this point, you can open the Spark UI, follow the live log stream, or kill the app.

Application page

This example uses a JAR embedded in the Spark Docker image and neither reads nor writes data. For a more real-world use case, learn how to Access your own data.