Accessing data

This page shows how to run your own code and access data hosted on GCS. It assumes that you know how to run a Spark application on Data Mechanics.

Environment setup

Make sure that kubectl points to the correct Kubernetes cluster by running:

kubectl config current-context

The output should look something like gke_<project-name>_<region/zone>_<your-cluster-name>. Otherwise, you can let gcloud configure kubectl with:

gcloud container clusters get-credentials <your-cluster-name>

Specify data in your arguments

Suppose that:

  • you want to run a word count Scala Application hosted at gs://<your-bucket>/wordcount.jar,
  • that reads input files in gs://<your-bucket>/input/*,
  • and writes to gs://<your-bucket>/output.
  • The main class is org.<your-org>.wordcount.WordCount <input> <output>.

Here is the payload you would submit:

curl -X POST \
https://<your-cluster-url>/api/apps/ \
-H 'Content-Type: application/json' \
-d '{
"jobName": "word-count",
"configOverrides": {
"type": "Scala",
"sparkVersion": "3.0.0",
"mainApplicationFile": "gs://<your-bucket>/wordcount.jar",
"mainClass": "org.<your-org>.wordcount.WordCount",
"arguments": ["gs://<your-bucket>/input/*", "gs://<your-bucket>/output"]
}
}'

The command above fails because the Spark pods do not have sufficient permissions to access the code and the data:

Caused by: com.google....json.GoogleJsonResponseException: 403 Forbidden
{
"code" : 403,
"errors" : [ {
"domain" : "global",
"message" : "<service-account-name>@developer.gserviceaccount.com does not have storage.objects.get access to <your-bucket>/wordcount.jar.",
"reason" : "forbidden"
} ],
"message" : "<service-account-name>@developer.gserviceaccount.com does not have storage.objects.get access to <your-bucket>/wordcount.jar."
}

Give permissions to Spark pods

There are two ways of adding permissions to the Spark pods.

The first one is to extend the permissions of the GCE instances acting as Kubernetes nodes. Though this solution works, it is not recommended from a security perspective, since the permissions of the GCE instances apply to all Spark applications.

A second and more fine-grained option is to create secrets in the Kubernetes cluster containing GCP credentials and let Spark use them to query GCS.

Permissions on GCE instances

Find the GCP service account used by the GCE instances running as Kubernetes nodes. Depending on your setup, it can be the default Compute Engine service account, of the form

PROJECT_NUMBER-compute@developer.gserviceaccount.com

or another service account that you created yourself of the form

SERVICE_ACCOUNT_NAME@PROJECT_ID.iam.gserviceaccount.com

The error log in the previous section shows the service account currently used by the Spark pods, which is the GCE instances' service account.

Once you have found the service account, grant it sufficient permissions using IAM roles. The list of IAM roles for GCS is here.

The Spark application above should now work, without modifying the payload.

GCP credentials in Kubernetes secrets

With proper configuration, Spark pods can assume a different GCP service account than that of the underlying GCE instances.

Here are the steps to achieve this:

  1. create or retrieve a service account to be assumed by the Spark pods
  2. grant it sufficient permissions (see here)
  3. generate a service account key file for this service account
  4. store the service account key file in a Kubernetes secret in namespace spark-apps
  5. configure the Spark application to use the secret

We assume that you've been able to get to step 3 and generate a service account key file key.json. Let's now detail points 4 and 5.

Store the service account key as a secret with

kubectl create secret -n spark-apps generic gcs-svc-account --from-file=key.json

This command creates a secret gcs-svc-account in namespace spark-apps.

Note that the rest of the procedure will not work if the file name was not key.json.

We now mount the secret into the Spark pods and set its type to be GCPServiceAccount so that the Spark operator configures Spark to use it:

curl -X POST \
http(s)://<your-cluster-url>/api/apps/ \
-H 'Content-Type: application/json' \
-d '{
"jobName": "word-count",
"configOverrides": {
"type": "Scala",
"sparkVersion": "3.0.0",
"mainApplicationFile": "gs://<your-bucket>/wordcount.jar",
"mainClass": "org.<your-org>.wordcount.WordCount",
"arguments": ["gs://<your-bucket>/input/*", "gs://<your-bucket>/output"],
"executor": {
"secrets": [
{
"name": "gcs-svc-account",
"path": "/mnt/secrets",
"secretType": "GCPServiceAccount"
}
]
},
"driver": {
"secrets": [
{
"name": "gcs-svc-account",
"path": "/mnt/secrets",
"secretType": "GCPServiceAccount"
}
]
}
}
}'

The Spark application can now access code and data.