Data Mechanics Docker images
Data Mechanics offers publicly available Docker images for Spark. Read on to learn all about them!
What's a Docker image for Spark?
When Spark runs on Kubernetes, the driver and executors are Docker containers that execute a Docker image specifically built to run Spark.
What are the Data Mechanics Docker Images?
Data Mechanics offers a versatile Docker image for Spark: spark:platform
comes with batteries included, beside a Spark distribution, it contains connectors to popular object stores, Python support with pip
and conda
, Jupyter notebook support, and more.
Images to start with
Full Image name | Spark Version | Scala Version | Python Version | Hadoop Version |
---|---|---|---|---|
gcr.io/datamechanics/spark:platform-2.4-latest | 2.4.7 | 2.12 | 3.7 | 3.1.0 |
gcr.io/datamechanics/spark:platform-3.0-latest | 3.0.3 | 2.12 | 3.8 | 3.2.0 |
gcr.io/datamechanics/spark:platform-3.1-latest | 3.1.3 | 2.12 | 3.8 | 3.2.0 |
gcr.io/datamechanics/spark:platform-3.2-latest | 3.2.1 | 2.12 | 3.8 | 3.3.1 |
How to use those images for your apps and jobs?
When submitting Spark apps on the Data Mechanics platform, you can:
- Omit the
image
field: in this case,spark:platform
will be used by default according to the Spark version specified in thesparkVersion
field. If bothimage
andsparkVersion
fields are specified, the Spark version of the image takes precedence. - Specify the image in your configOverrides with the
image
field - Specify a Data Mechanics Config Template with the
image
field
Need another image?
To match different dependencies and version requirements you can find more images at https://gcr.io/datamechanics/spark:platform.
All these dependencies can have different versions.
A combination of dependencies versions is called a flavor of spark:platform
in this page.
The image tag indicates the flavor of the image and can be adjusted to fit your needs.
Here are two examples of image tags:
Image tags are described in more details below.
Need to build your own Image?
You should use one of the spark:platform
image as a base. Once your custom image is in your local docker repository you have to Tag and Push it, see Set up a Docker registry and push your image.
Connecting your local Docker with Cloud Repository
When developing Docker images, it can be useful to authorize your local docker configuration to interact with your Cloud Repository so you can continue to use the docker
cli tool and easily build, push, and update images.
Use the following commands to connect your Cloud Repository
- AWS
- GCP
- Azure
- DockerHub
aws ecr get-login-password --region region | docker login --username AWS --password-stdin [aws_account_id].dkr.ecr.region.amazonaws.com
Data sources connectors
gcr.io/datamechanics/spark:platform
supports for the following data sources:
- AWS S3 (
s3a://
scheme) - Google Cloud Storage (
gs://
scheme) - Azure Blob Storage (
wasbs://
scheme) - Azure Datalake generation 1 (
adl://
scheme) - Azure Datalake generation 2 (
abfss://
scheme) - Snowflake
- Delta Lake
- Hadoop Cloud Magic committer for S3
The versions of those connectors depend on the versions of Spark and Hadoop. Here are the versions per spark:platform
image:
platform-2.4-latest | platform-3.0-latest | platform-3.1-latest | platform-3.2-latest | |
---|---|---|---|---|
S3 (s3a://) | Hadoop 3.1.0 - AWS 1.11.271 | Hadoop 3.2.0 - AWS 1.11.375 | Hadoop 3.2.0 - AWS 1.11.375 | Hadoop 3.3.1 - AWS 1.11.901 |
ADLS gen1 (adl://) | Hadoop 3.1.0 - ADLS SDK 2.2.5 | Hadoop 3.2.0 - ADLS SDK 2.2.9 | Hadoop 3.2.0 - ADLS SDK 2.2.9 | Hadoop 3.3.1 - ADLS SDK 2.3.9 |
Azure Blob Storage (wasbs://) | Hadoop 3.1.0 - Azure Storage 5.4.0 | Hadoop 3.2.0 - Azure Storage 7.0.0 | Hadoop 3.2.0 - Azure Storage 7.0.0 | Hadoop 3.3.1 - Azure Storage 7.0.1 |
ADLS gen2 (abfss://) | Hadoop 3.1.0 - Azure Storage 5.4.0 | Hadoop 3.2.0 - Azure Storage 7.0.0 | Hadoop 3.2.0 - Azure Storage 7.0.0 | Hadoop 3.3.1 - Azure Storage 7.0.1 |
GCS | 2.1.5 | 2.1.5 | 2.1.5 | 2.1.5 |
Delta | 0.6.1 | 0.8.0 | 1.0.1 | 1.1.0 |
Snowflake | 2.9.3 | 2.10.0 | 2.10.0 | 2.10.0 |
Pyarrow | 3.0.0 | 3.0.0 | 3.0.0 | 3.0.0 |
To check these versions, you may also run the image locally and list the JARs in /opt/spark/jars/
:
Python support
gcr.io/datamechanics/spark:platform
supports Pyspark applications.
When building a custom image or working from a notebook, additional Python packages can be installed with pip
or conda
.
Image tags and flavors
Data Mechanics builds Spark Docker images for mutiple combinations of the versions of the dependencies included with Spark. These combinations are called flavors.
Here's the matrix of versions that Data Mechanics provides:
Component | Available versions |
---|---|
Spark | 2.4.5 to 3.2.1 |
Hadoop | 2.6, 2.7, 3.1, 3.2, and 3.3 |
Java | 8 and 11 |
Scala | 2.11 and 2.12 |
Python | 2.7 to 3.8 |
Note that not all the combinations in the matrix exist. To list all the flavors for a given image, check out our Docker registry at https://gcr.io/datamechanics/spark:platform.
Data Mechanics provides long-form tags like gcr.io/datamechanics/spark:platform-3.1.3-java-8-scala-2.12-hadoop-3.2.0-python-3.8-latest
where all versions are exposed.
In most cases, we encourage starting with our short-form tags like gcr.io/datamechanics/spark:platform-3.1-latest
or gcr.io/datamechanics/spark:platform-3.1.3-latest
.
gcr.io/datamechanics/spark:platform-3.1.3-latest
contains a Spark3.1.3
distribution and all other dependencies are set to the latest compatible version. For example,platform-3.1.3-latest
contains Hadoop3.2.0
, Python3.8
, Scala2.12
, and Java11
. We allow ourselves to upgrade the version of a dependency if a new, compatible version is released. For example, we may upgradeplatform-3.1.3-latest
to Hadoop3.3.0
once it is compatible with Spark3.1.3
.gcr.io/datamechanics/spark:platform-3.1-latest
contains the latest Spark version of the3.1
minor. For example,platform-3.1-latest
currently uses Spark3.1.3
. For other dependencies (Hadoop, Python, etc),platform-3.1-latest
behaves likeplatform-3.1.3-latest
(see the previous bullet point).
Please use a long-form only if you need a specific combination For instance, you may require a specific combination of versions when migrating an existing Scala or Java project to Spark on Kubernetes. On the other hand, new JVM projects and Pyspark projects should work just fine with short-form tags!
For production workload:
- We don't recommend to use the
latest
tags - To keep the image stable you should use images with the Data Mechanics version tag like
dm17
below. The following images are the same:gcr.io/datamechanics/spark:platform-3.2-dm17
gcr.io/datamechanics/spark:platform-3.2.0-dm17
gcr.io/datamechanics/spark:platform-3.2.0-hadoop-3.3.1-java-8-scala-2.12-python-3.8-dm17
- Long-form tag images without the Data Mechanics version can change to the exclusion of the Spark, Hadoop, Java, Scala and Python versions specified in the image tag.
Head over to the release notes to learn what changed a Data Mechanics version tag introduced.
Data Mechanics Dockerhub repository
Data Mechanics also has a public Dockerhub repository.
Spark images published on DockerHub are also available on GCR. For example, datamechanics/spark:3.2.0-latest
on DockerHub is the same as gcr.io/datamechanics/spark:3.2.0-latest
. For Data Mechanics customers, we generally recommend going with images hosted on GCR to avoid hitting rates limit.
We also recommend Data Mechanics customers to use the gcr.io/datamechanics/spark:platform
images, which contain additional capabilities exclusive to the Data Mechanics platform, like Jupyter support.