Accessing data

This page shows how to run your own code and access data hosted in your cloud account. It assumes that you know how to run a Spark application on Data Mechanics.

Specify data in your arguments

On Google Cloud Platform, suppose that:

  • you want to run a word count Scala Application hosted at gs://<your-bucket>/wordcount.jar
  • that reads input files in gs://<your-bucket>/input/*
  • and writes to gs://<your-bucket>/output
  • The main class is org.<your-org>.wordcount.WordCount <input> <output>

Here is the payload you would submit:

curl -X POST \
https://<your-cluster-url>/api/apps/ \
-H 'Content-Type: application/json' \
-H 'X-API-Key: <your-user-key>' \
-d '{
"jobName": "word-count",
"configOverrides": {
"type": "Scala",
"sparkVersion": "3.0.0",
"mainApplicationFile": "gs://<your-bucket>/wordcount.jar",
"mainClass": "org.<your-org>.wordcount.WordCount",
"arguments": ["gs://<your-bucket>/input/*", "gs://<your-bucket>/output"]
}
}'

The command above fails because the Spark pods do not have sufficient permissions to access the code and the data:

Caused by: com.google....json.GoogleJsonResponseException: 403 Forbidden
{
"code" : 403,
"errors" : [ {
"domain" : "global",
"message" : "<service-account-name>@developer.gserviceaccount.com does not have storage.objects.get access to <your-bucket>/wordcount.jar.",
"reason" : "forbidden"
} ],
"message" : "<service-account-name>@developer.gserviceaccount.com does not have storage.objects.get access to <your-bucket>/wordcount.jar."
}

We'll now see two ways to grant access to the Spark pods.

Granting permissions to node instances

Spark pods running in Kubernetes inherit the permissions of the nodes they run on. So a solution is to simply grant access to those underlying nodes.

Find the service account used by GCE instances running as Kubernetes nodes. Depending on your setup, it can be the default Compute Engine service account, of the form

PROJECT_NUMBER-compute@developer.gserviceaccount.com

or another service account that you created yourself of the form

SERVICE_ACCOUNT_NAME@PROJECT_ID.iam.gserviceaccount.com

The error log in the previous section shows the service account currently used by the Spark pods, which is the GCE instances' service account.

Once you have found the service account, grant it sufficient permissions using IAM roles. The list of IAM roles for GCS is here.

The Spark application above should now work, without modifying the payload.

Granting permissions using Kubernetes secrets

Spark pods can impersonate a user (AWS) or a service account (GCP) by using their credentials. On Azure, a Spark pod can use a storage account access key in order to access to the storage account's containers.

To protect those credentials, we will store them in Kubernetes secrets and configure Spark to mount those secret into all the driver and executor pods.

Create a service account in the GCP console, and grant it sufficient permissions using IAM roles. The list of IAM roles for GCS is here.

This bash script will create an access key for the service account, and store it in a secret called data-access in the Kubernetes namespace where Data Mechanics Spark applications are run:

TMP_FILE=$(mktemp)
gcloud iam service-accounts keys create $TMP_FILE --iam-account <your-service-account>@<your-project>.iam.gserviceaccount.com
kubectl create secret -n spark-apps generic data-access --from-file=$TMP_FILE)

Modify the payload to launch the Spark application in order to reference the secret:

curl -X POST \
https://<your-cluster-url>/api/apps/ \
-H 'Content-Type: application/json' \
-H 'X-API-Key: <your-user-key>' \
-d '{
"jobName": "word-count",
"configOverrides": {
"type": "Scala",
"sparkVersion": "3.0.0",
"mainApplicationFile": "gs://<your-bucket>/wordcount.jar",
"mainClass": "org.<your-org>.wordcount.WordCount",
"arguments": ["gs://<your-bucket>/input/*", "gs://<your-bucket>/output"],
"driver": {
"secrets": [
{
"name": "data-access",
"path": "/mnt/secrets",
"secretType": "GCPServiceAccount"
}
]
},
"executor": {
"secrets": [
{
"name": "data-access",
"path": "/mnt/secrets",
"secretType": "GCPServiceAccount"
}
]
}
}
}'