Mounting Volumes

This page shows how to add volumes to your Spark application using the Data Mechancis API

Standard configuration

By default, Spark uses temporary storage to spill data to disk during shuffles and other operations when no volume is present. With Spark on Kubernetes, pods are initialized with an emptyDir volume for each directory listed in spark.local.dir. When the pod is deleted, the ephemeral storage will be cleared and the data removed.

For certain Spark applications with larger shuffle workfloads or a need to persist data beyond the life of the application, it may be desirable to mount an external volume. Depending on your cloud provider, there are several volume options you can choose from to make your Spark workspaces more dynamic and scalable.

By default, if you select a node type that includes scalable block storage, such as d instances on AWS (m5d, r5d, c5d, etc), Data Mechanics will automatically mount this volume and configure Spark to operate within it.

Mount secrets as files in volumes

Instead of setting environment variables from kubernetes secrets, you can also mount secrets directly as files into a volume.

First, create a kubernetes secret in your cluster, and make sure to use the namespace spark-apps so your application can access it. Here is an example basic-auth secret below:

apiVersion: v1
kind: Secret
metadata:
namespace: spark-apps
name: basic-auth
data:
password: cGFzc3dvcmQK # password
user: dXNlcgo= # user
type: Opaque

In your application json config, add a volume object that references the secret name, data key you would like to include in the volume, and a name for the volume reference. In your executor or driver config, add a volumeMounts object that includes the volume name referenced in the previous step and path where you would like the file to be mounted. You can find an example application json config below that creates a volume named volume-secret that references the user key in teh basic-auth secret and mounts that secret file to /opt/spark/work-dir/secrets on the executor pod(s).

{
"executor": {
"cores": 1,
"instances": 3,
"instanceType": "r5.xlarge",
"volumeMounts": [{
"mountPath": "/opt/spark/work-dir/secrets",
"name": "volume-secret"
}]
},
"volumes": [{
"secret": {
"items": [{
"key": "user",
"path": "secret.yaml"
}],
"secretName": "basic-auth"
},
"name": "volume-secret"
}]
}

If you kubectl exec into one of the executor pods, you will find the secrets.yaml file located at /opt/spark/work-dir/secrets.

Mount persistent cloud volumes to your container

For larger volume needs, sharing volumes across applications, or persisting data beyond an application's lifecycle, you can mount a cloud provider volume to your container. Instead of passing in a mount path, find the appropriate key for the cloud volume of your choice, and add the neccessary data to the object. You can find the full list of supported volume options in the API reference for submitting applications.

To be enable volume mounting, you must first create the volume instance in your respective cloud provider. Then, you must make sure your cluster role has read/write access to the volume.

You can create an AWS Elastic Block Storage in the AWS console, or by running the following command:

aws ec2 create-volume --availability-zone <your availability zone> <name of EBS>

For more configuration options, see the docs for EBS - https://docs.aws.amazon.com/cli/latest/reference/ec2/create-volume.html

To mount an EBS volume to your spark application, add the following json to your application submission:

{
"volumes": [
{
"name": "spark-aws-dir",
"awsElasticBlockStore": {
"fsType": "type of file system",
"readOnly": false,
"volumeID": "id of ebs"
}
},
]
}

Dynamically provision volumes using Persistent Volume Claims

As of Spark 3.1, you can dynamically provision and mount volumes to both your executor and driver pods using persistent volume claims. Persistent volume claims are kubernetes resources that allow you to allocate and mount resources of an elastic disk to your kuberenets pods. When your application is complete, kubernetes will remove the pod and delete the volume from your cloud provider. You can read more about the different configuration options here.

To enable PVC provisioning, you must add a few configuation settings to your sparkConf:

{
"executor": {
"envVars": {
"SPARK_EXECUTOR_DIRS": "/tmp/data"
}
},
"sparkConf": {
"spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName": "OnDemand",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.storageClass":"standard",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.sizeLimit":"500Gi",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path":"/tmp/data",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly":"false"
}
}

In the sample above, add the environment variable SPARK_EXECUTOR_DIRS to your executor config and set the path equal to your volume mount path. The claimName must be set to OnDemand. You can set the mount path to anywhere you like on the container, but make sure Spark is aware of the drive and actually using it. You will need to create the directory in your Dockerfile and add the necessary permissions so Spark can write to it:

RUN mkdir /tmp/data
RUN chmod -R 777 /tmp/data
RUN chown -R 185:root /tmp/data

The storage class standard is pre-defined in your kubernetes cluster in Google Cloud. For other cloud providers, you must create this storage class. Add the following yaml to a file called sc.yaml with the provisioner field referencing the elastic volume associated with your cloud provider, and run kubectl apply -f sc.yaml:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: standard
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp2

In addition, you must also attach a cluster role of edit to the spark-driver service account. You can run the following command to attach the cluster role:

kubectl create clusterrolebinding <cluster-role-binding-name> --clusterrole=edit --serviceaccount=spark-apps:spark-driver --namespace=default

Dynamic PVC Reuse

With the release of Spark 3.2, you can now dynamnically reuse persistent volume claims within the same application. This becomes particularly beneficial when using spot instances. In a normal Spark application leveraging spot instances for executors, if you lose a node to a spot kill, you will lose any shuffle or output data that was stored on that node, and Spark will be forced to recompute the results. With PVC reuse enabled, the Spark driver will maintain the PVC(s) of the lost spot instance, add new executors to the cluster, and attach the PVC(s) to the new executors, maintaining the work and progress of the previous node. This feature can significantly reduce application runtime and cost, offsetting one of the major drawbacks of using spot instances. To enable dynamic PVC reuse, add the following two lines, in addition to the configuration above, to your Spark conf:

{
"sparkConf": {
"spark.kubernetes.driver.reusePersistentVolumeClaim":"true",
"spark.kubernetes.driver.ownPersistentVolumeClaim":"true"
}
}