This page shows how to run your own code and access data hosted in your cloud account. It assumes that you know how to run a Spark application on Data Mechanics.
Specify data in your arguments
On Google Cloud Platform, suppose that:
- you want to run a word count Scala Application hosted at
- that reads input files in
- and writes to
- The main class is
org.<your-org>.wordcount.WordCount <input> <output>
Here is the payload you would submit:
The command above fails because the Spark pods do not have sufficient permissions to access the code and the data:
We'll now see two ways to grant access to the Spark pods.
Granting permissions to node instances
Spark pods running in Kubernetes inherit the permissions of the nodes they run on. So a solution is to simply grant access to those underlying nodes.
Find the service account used by GCE instances running as Kubernetes nodes. Depending on your setup, it can be the default Compute Engine service account, of the form
or another service account that you created yourself of the form
The error log in the previous section shows the service account currently used by the Spark pods, which is the GCE instances' service account.
The Spark application above should now work, without modifying the payload.
Granting permissions using Kubernetes secrets
Spark pods can impersonate a user (AWS) or a service account (GCP) by using their credentials. On Azure, a Spark pod can use a storage account access key in order to access to the storage account's containers.
To protect those credentials, we will store them in Kubernetes secrets and configure Spark to mount those secret into all the driver and executor pods.
This bash script will create an access key for the service account, and store it in a secret called
data-access in the Kubernetes namespace where Data Mechanics Spark applications are run:
Modify the payload to launch the Spark application in order to reference the secret: