This page shows how to add volumes to your Spark application using the Data Mechancis API
By default, Spark uses temporary storage to spill data to disk during shuffles and other operations when no volume is present. With Spark on Kubernetes, pods are initialized with an
emptyDir volume for each directory listed in
spark.local.dir. When the pod is deleted, the ephemeral storage will be cleared and the data removed.
For certain Spark applications with larger shuffle workfloads or a need to persist data beyond the life of the application, it may be desirable to mount an external volume. Depending on your cloud provider, there are several volume options you can choose from to make your Spark workspaces more dynamic and scalable.
By default, if you select a node type that includes scalable block storage, such as
d instances on AWS (m5d, r5d, c5d, etc), Data Mechanics will automatically mount this volume and configure Spark to operate within it.
Mount secrets as files in volumes
Instead of setting environment variables from kubernetes secrets, you can also mount secrets directly as files into a volume.
First, create a kubernetes secret in your cluster, and make sure to use the namespace
spark-apps so your application can access it. Here is an example basic-auth secret below:
In your application json config, add a volume object that references the secret name, data key you would like to include in the volume, and a name for the volume reference. In your executor or driver config, add a volumeMounts object that includes the volume name referenced in the previous step and path where you would like the file to be mounted. You can find an example application json config below that creates a volume named
volume-secret that references the user key in teh basic-auth secret and mounts that secret file to
/opt/spark/work-dir/secrets on the executor pod(s).
kubectl exec into one of the executor pods, you will find the secrets.yaml file located at
Mount persistent cloud volumes to your container
For larger volume needs, sharing volumes across applications, or persisting data beyond an application's lifecycle, you can mount a cloud provider volume to your container. Instead of passing in a mount path, find the appropriate key for the cloud volume of your choice, and add the neccessary data to the object. You can find the full list of supported volume options in the API reference for submitting applications.
To be enable volume mounting, you must first create the volume instance in your respective cloud provider. Then, you must make sure your cluster role has read/write access to the volume.
You can create an AWS Elastic Block Storage in the AWS console, or by running the following command:
For more configuration options, see the docs for EBS - https://docs.aws.amazon.com/cli/latest/reference/ec2/create-volume.html
To mount an EBS volume to your spark application, add the following json to your application submission:
Dynamically provision volumes using Persistent Volume Claims
As of Spark 3.1, you can dynamically provision and mount volumes to both your executor and driver pods using persistent volume claims. Persistent volume claims are kubernetes resources that allow you to allocate and mount resources of an elastic disk to your kuberenets pods. When your application is complete, kubernetes will remove the pod and delete the volume from your cloud provider. You can read more about the different configuration options here.
To enable PVC provisioning, you must add a few configuation settings to your sparkConf:
In the sample above, add the environment variable
SPARK_EXECUTOR_DIRS to your executor config and set the path equal to your volume mount path. The
claimName must be set to
OnDemand. You can set the mount path to anywhere you like on the container, but make sure Spark is aware of the drive and actually using it. You will need to create the directory in your Dockerfile and add the necessary permissions so Spark can write to it:
The storage class
standard is pre-defined in your kubernetes cluster in Google Cloud. For other cloud providers, you must create this storage class. Add the following yaml to a file called
sc.yaml with the provisioner field referencing the elastic volume associated with your cloud provider, and run
kubectl apply -f sc.yaml:
In addition, you must also attach a cluster role of edit to the
spark-driver service account. You can run the following command to attach the cluster role:
Dynamic PVC Reuse
With the release of Spark 3.2, you can now dynamnically reuse persistent volume claims within the same application. This becomes particularly beneficial when using spot instances. In a normal Spark application leveraging spot instances for executors, if you lose a node to a spot kill, you will lose any shuffle or output data that was stored on that node, and Spark will be forced to recompute the results. With PVC reuse enabled, the Spark driver will maintain the PVC(s) of the lost spot instance, add new executors to the cluster, and attach the PVC(s) to the new executors, maintaining the work and progress of the previous node. This feature can significantly reduce application runtime and cost, offsetting one of the major drawbacks of using spot instances. To enable dynamic PVC reuse, add the following two lines, in addition to the configuration above, to your Spark conf: