Accessing data

This page shows how to run your own code and access data hosted in your cloud account. It assumes that you know how to run a Spark application on Data Mechanics.

Specify data in your arguments

On Google Cloud Platform, suppose that:

  • you want to run a word count Scala Application hosted at gs://<your-bucket>/wordcount.jar
  • that reads input files in gs://<your-bucket>/input/*
  • and writes to gs://<your-bucket>/output
  • The main class is org.<your-org>.wordcount.WordCount <input> <output>

Here is the payload you would submit:

curl -X POST \
https://<your-cluster-url>/api/apps/ \
-H 'Content-Type: application/json' \
-d '{
"jobName": "word-count",
"configOverrides": {
"type": "Scala",
"sparkVersion": "3.0.0",
"mainApplicationFile": "gs://<your-bucket>/wordcount.jar",
"mainClass": "org.<your-org>.wordcount.WordCount",
"arguments": ["gs://<your-bucket>/input/*", "gs://<your-bucket>/output"]

The command above fails because the Spark pods do not have sufficient permissions to access the code and the data:

Caused by: 403 Forbidden
"code" : 403,
"errors" : [ {
"domain" : "global",
"message" : "<service-account-name> does not have storage.objects.get access to <your-bucket>/wordcount.jar.",
"reason" : "forbidden"
} ],
"message" : "<service-account-name> does not have storage.objects.get access to <your-bucket>/wordcount.jar."

Permissions on node instances

Find the service account used by GCE instances running as Kubernetes nodes. Depending on your setup, it can be the default Compute Engine service account, of the form

or another service account that you created yourself of the form

The error log in the previous section shows the service account currently used by the Spark pods, which is the GCE instances' service account.

Once you have found the service account, grant it sufficient permissions using IAM roles. The list of IAM roles for GCS is here.

The Spark application above should now work, without modifying the payload.