Accessing data

This page shows how to run your own code and access data hosted in your cloud account. It assumes that you know how to run a Spark application on Data Mechanics.

Specify data in your arguments

On Google Cloud Platform, suppose that:

  • you want to run a word count Scala Application hosted at gs://<your-bucket>/wordcount.jar
  • that reads input files in gs://<your-bucket>/input/*
  • and writes to gs://<your-bucket>/output
  • The main class is org.<your-org>.wordcount.WordCount <input> <output>

Here is the payload you would submit:

curl -X POST \
https://<your-cluster-url>/api/apps/ \
-H 'Content-Type: application/json' \
-H 'X-API-Key: <your-user-key>' \
-d '{
"jobName": "word-count",
"configOverrides": {
"type": "Scala",
"sparkVersion": "3.0.0",
"mainApplicationFile": "gs://<your-bucket>/wordcount.jar",
"mainClass": "org.<your-org>.wordcount.WordCount",
"arguments": ["gs://<your-bucket>/input/*", "gs://<your-bucket>/output"]
}
}'

The command above fails because the Spark pods do not have sufficient permissions to access the code and the data:

Caused by: com.google....json.GoogleJsonResponseException: 403 Forbidden
{
"code" : 403,
"errors" : [ {
"domain" : "global",
"message" : "<service-account-name>@developer.gserviceaccount.com does not have storage.objects.get access to <your-bucket>/wordcount.jar.",
"reason" : "forbidden"
} ],
"message" : "<service-account-name>@developer.gserviceaccount.com does not have storage.objects.get access to <your-bucket>/wordcount.jar."
}

We'll now see two ways to grant access to the Spark pods.

Granting permissions to node instances

Spark pods running in Kubernetes inherit the permissions of the nodes they run on. So a solution is to simply grant access to those underlying nodes.

Your data is in the same AWS account as the Data Mechanics cluster

To let your cluster nodes access your S3 buckets you need to perform the following steps:

  • create a data access policy for your S3 buckets
  • attach the policy to the IAM role associated to your cluster nodes

To create a policy for your cluster nodes:

  1. Sign in to the IAM console at https://console.aws.amazon.com/iam/ with a user that has administrator permissions.
  2. In the navigation pane, choose Policies.
  3. In the content pane, choose Create policy.
  4. Choose the JSON tab and define the policy. An example of policy could for cluster nodes could be:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:ListAllMyBuckets"
],
"Resource": "arn:aws:s3:::*"
},
{
"Effect": "Allow",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::<bucket-name>",
"arn:aws:s3:::<bucket-name>/*"
]
}
]
}

Attach the created policy to the IAM role associated with cluster nodes:

  1. Sign in to the AWS Management Console and open the IAM console at https://console.aws.amazon.com/iam/.
  2. In the navigation pane, choose Policies.
  3. In the list of policies, select the check box next to the name of the policy to attach. You can use the Filter menu and the search box to filter the list of policies.
  4. Choose Policy actions, and then choose Attach.
  5. Select the IAM role associated to cluster nodes, choose Attach policy.

The IAM role associated to cluster nodes can be found in the Roles. tab of the navigation pane. Note down the IAM role ARN and not the instance profile ARN:

AWS IAM role ARN

Your data is in another AWS account than the Data Mechanics cluster

To let your cluster nodes access your S3 buckets you need to perform the following steps:

  • create an IAM role in the AWS account where the data lives
  • grant this IAM role permissions to access the data
  • authorize the IAM role associated to your cluster nodes to assume the above IAM role

Detailed explanation can be found in the "Cross-account IAM roles" of the AWS documentation about cross-account access.

The IAM role associated to your cluster nodes (called "the IAM role in Account B" in the AWS documentation) can be found in the Roles. tab of the navigation pane of the IAM console available at https://console.aws.amazon.com/iam/. Use the IAM role ARN and not the instance profile ARN:

AWS IAM role ARN

Granting permissions using Kubernetes secrets

Spark pods can impersonate a user (AWS) or a service account (GCP) by using their credentials. On Azure, a Spark pod can use a storage account access key in order to access to the storage account's containers.

To protect those credentials, we will store them in Kubernetes secrets and configure Spark to mount those secret into all the driver and executor pods as environment variables.

To let your Spark pods access your S3 buckets you need to perform the following steps:

  • create a data access policy for your S3 buckets
  • create a user that is granted the data access policy
  • create an access key for the user
  • create a Kubernetes secret that contains the access key
  • configure Spark to use the Kubernetes secret

Create a policy for your cluster nodes:

  1. Sign in to the IAM console at https://console.aws.amazon.com/iam/ with a user that has administrator permissions.
  2. In the navigation pane, choose Policies.
  3. In the content pane, choose Create policy.
  4. Choose the JSON tab and define the policy. An example of policy could for cluster nodes could be:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:ListAllMyBuckets"
],
"Resource": "arn:aws:s3:::*"
},
{
"Effect": "Allow",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::<bucket-name>",
"arn:aws:s3:::<bucket-name>/*"
]
}
]
}

Create a user that is granted the data access policy:

  1. Sign in to the AWS Management Console and open the IAM console at https://console.aws.amazon.com/iam/.
  2. In the navigation pane, choose Users, and click on Add User.
  3. Give it a name and check "Programmatic access".
  4. Click on "Attach existing policies directly" and attach the policy you just created.
  5. Complete the user creation process.

A user with the correct policy is now created.

Create an access key for the user:

  1. Sign in to the AWS Management Console and open the IAM console at https://console.aws.amazon.com/iam/.
  2. In the navigation pane, choose Users, and click on the user you just created.
  3. Go to the "Security credentials" tab and click on "Create access key".
  4. Note down the access key ID and the access key secret.

Create a Kubernetes secret that contains the access key:

The command below generates a secret from the access key ID and the access key secret created in the previous steps:

kubectl create secret -n spark-apps generic data-access \
--from-literal 'AWS_ACCESS_KEY_ID=<access-key-id>' \
--from-literal 'AWS_SECRET_ACCESS_KEY=<access-key-secret>'

Configure Spark to use the Kubernetes secret:

You can now modify the payload to launch the Spark application in order to reference the secret:

curl -X POST \
https://<your-cluster-url>/api/apps/ \
-H 'Content-Type: application/json' \
-H 'X-API-Key: <your-user-key>' \
-d '{
"jobName": "word-count",
"configOverrides": {
"type": "Scala",
"sparkVersion": "3.0.0",
"mainApplicationFile": "s3a://<your-bucket>/wordcount.jar",
"mainClass": "org.<your-org>.wordcount.WordCount",
"arguments": ["s3a://<your-bucket>/input/*", "s3a://<your-bucket>/output"],
"driver": {
"envSecretKeyRefs": {
"AWS_ACCESS_KEY_ID": {
"name": "data-access",
"key": "AWS_ACCESS_KEY_ID"
},
"AWS_SECRET_ACCESS_KEY": {
"name": "data-access",
"key": "AWS_SECRET_ACCESS_KEY"
}
}
},
"executor": {
"envSecretKeyRefs": {
"AWS_ACCESS_KEY_ID": {
"name": "data-access",
"key": "AWS_ACCESS_KEY_ID"
},
"AWS_SECRET_ACCESS_KEY": {
"name": "data-access",
"key": "AWS_SECRET_ACCESS_KEY"
}
}
}
}
}'