Data Mechanics offers publicly available Docker images for Spark. Read on to learn all about them!
What's a Docker image for Spark?
When Spark runs on Kubernetes, the driver and executors are Docker containers that execute a Docker image specifically built to run Spark.
What are the Data Mechanics Docker Images?
Data Mechanics offers a versatile Docker image for Spark:
spark:platform comes with batteries included, beside a Spark distribution, it contains connectors to popular object stores, Python support with
conda, Jupyter notebook support, and more.
Images to start with
|Full Image name||Spark Version||Scala Version||Python Version||Hadoop Version|
How to use those images for your apps and jobs?
When submitting Spark apps on the Data Mechanics platform, you can:
- Omit the
imagefield: in this case,
spark:platformwill be used by default according to the Spark version specified in the
sparkVersionfield. If both
sparkVersionfields are specified, the Spark version of the image takes precedence.
- Specify the image in your configOverrides with the
- Specify a Data Mechanics Config Template with the
Need another image?
To match different dependencies and version requirements you can find more images at https://gcr.io/datamechanics/spark:platform.
All these dependencies can have different versions.
A combination of dependencies versions is called a flavor of
spark:platform in this page.
The image tag indicates the flavor of the image and can be adjusted to fit your needs.
Here are two examples of image tags:
Image tags are described in more details below.
Need to build your own Image?
You should use one of the
spark:platform image as a base. Once your custom image is in your local docker repository you have to Tag and Push it, see Set up a Docker registry and push your image.
Connecting your local Docker with Cloud Repository
When developing Docker images, it can be useful to authorize your local docker configuration to interact with your Cloud Repository so you can continue to use the
docker cli tool and easily build, push, and update images.
Use the following commands to connect your Cloud Repository
aws ecr get-login-password --region region | docker login --username AWS --password-stdin [aws_account_id].dkr.ecr.region.amazonaws.com
Data sources connectors
gcr.io/datamechanics/spark:platform supports for the following data sources:
- AWS S3 (
- Google Cloud Storage (
- Azure Blob Storage (
- Azure Datalake generation 1 (
- Azure Datalake generation 2 (
- Delta Lake
The versions of those connectors depend on the versions of Spark and Hadoop. Here are the versions per
|S3 (s3a://)||Hadoop 3.1.0 - AWS 1.11.271||Hadoop 3.2.0 - AWS 1.11.375||Hadoop 3.2.0 - AWS 1.11.375||Hadoop 3.3.1 - AWS 1.11.901|
|ADLS gen1 (adl://)||Hadoop 3.1.0 - ADLS SDK 2.2.5||Hadoop 3.2.0 - ADLS SDK 2.2.9||Hadoop 3.2.0 - ADLS SDK 2.2.9||Hadoop 3.3.1 - ADLS SDK 2.3.9|
|Azure Blob Storage (wasbs://)||Hadoop 3.1.0 - Azure Storage 5.4.0||Hadoop 3.2.0 - Azure Storage 7.0.0||Hadoop 3.2.0 - Azure Storage 7.0.0||Hadoop 3.3.1 - Azure Storage 7.0.1|
|ADLS gen2 (abfss://)||Hadoop 3.1.0 - Azure Storage 5.4.0||Hadoop 3.2.0 - Azure Storage 7.0.0||Hadoop 3.2.0 - Azure Storage 7.0.0||Hadoop 3.3.1 - Azure Storage 7.0.1|
To check these versions, you may also run the image locally and list the JARs in
Image tags and flavors
Data Mechanics builds Spark Docker images for mutiple combinations of the versions of the dependencies included with Spark. These combinations are called flavors.
Here's the matrix of versions that Data Mechanics provides:
|Spark||2.4.5 to 3.2.0|
|Hadoop||2.6, 2.7, 3.1, 3.2, and 3.3|
|Java||8 and 11|
|Scala||2.11 and 2.12|
|Python||2.7 to 3.8|
Note that not all the combinations in the matrix exist. To list all the flavors for a given image, check out our Docker registry at https://gcr.io/datamechanics/spark:platform.
Data Mechanics provides long-form tags like
gcr.io/datamechanics/spark:platform-3.1.2-java-8-scala-2.12-hadoop-3.2.0-python-3.8-latest where all versions are exposed.
In most cases, we encourage starting with our short-form tags like
gcr.io/datamechanics/spark:platform-3.1.2-latestcontains a Spark
3.1.2distribution and all other dependencies are set to the latest compatible version. For example,
2.12, and Java
11. We allow ourselves to upgrade the version of a dependency if a new, compatible version is released. For example, we may upgrade
3.3.0once it is compatible with Spark
gcr.io/datamechanics/spark:platform-3.1-latestcontains the latest Spark version of the
3.1minor. For example,
platform-3.1-latestcurrently uses Spark
3.1.2and will be upgraded once Spark
3.1.3is released. For other dependencies (Hadoop, Python, etc),
platform-3.1.2-latest(see the previous bullet point).
Please use a long-form only if you need a specific combination For instance, you may require a specific combination of versions when migrating an existing Scala or Java project to Spark on Kubernetes. On the other hand, new JVM projects and Pyspark projects should work just fine with short-form tags!
For production workload:
- We don't recommend to use the
- To keep the image stable you should use images with the Data Mechanics version tag like
dm16below. The following images are the same:
- Long-form tag images without the Data Mechanics version can change to the exclusion of the Spark, Hadoop, Java, Scala and Python versions specified in the image tag.
Head over to the release notes to learn what changed a Data Mechanics version tag introduced.
Data Mechanics Dockerhub repository
Data Mechanics also has a public Dockerhub repository.
Spark images published on DockerHub are also available on GCR. For example,
datamechanics/spark:3.2.0-latest on DockerHub is the same as
gcr.io/datamechanics/spark:3.2.0-latest. For Data Mechanics customers, we generally recommend going with images hosted on GCR to avoid hitting rates limit.
We also recommend Data Mechanics customers to use the
gcr.io/datamechanics/spark:platform images, which contain additional capabilities exclusive to the Data Mechanics platform, like Jupyter support.