Data Mechanics Docker images

Data Mechanics offers publicly available Docker images for Spark. Read on to learn all about them!

What's a Docker image for Spark?

When Spark runs on Kubernetes, the driver and executors are Docker containers that execute a Docker image specifically built to run Spark.

What are the Data Mechanics Docker Images?

Data Mechanics offers a versatile Docker image for Spark: spark:platform comes with batteries included, beside a Spark distribution, it contains connectors to popular object stores, Python support with pip and conda, Jupyter notebook support, and more.

Images to start with

Full Image nameSpark VersionScala VersionPython VersionHadoop Version

How to use those images for your apps and jobs?

When submitting Spark apps on the Data Mechanics platform, you can:

  • Omit the image field: in this case, spark:platform will be used by default according to the Spark version specified in the sparkVersion field. If both image and sparkVersion fields are specified, the Spark version of the image takes precedence.
  • Specify the image in your configOverrides with the image field
  • Specify a Data Mechanics Config Template with the image field

Need another image?

To match different dependencies and version requirements you can find more images at

All these dependencies can have different versions. A combination of dependencies versions is called a flavor of spark:platform in this page. The image tag indicates the flavor of the image and can be adjusted to fit your needs.

Here are two examples of image tags:

Image tags are described in more details below.

Need to build your own Image?

You should use one of the spark:platform image as a base. Once your custom image is in your local docker repository you have to Tag and Push it, see Set up a Docker registry and push your image.

Data sources connectors supports for the following data sources:

  • AWS S3 (s3a:// scheme)
  • Google Cloud Storage (gs:// scheme)
  • Azure Blob Storage (wasbs:// scheme)
  • Azure Datalake generation 1 (adl:// scheme)
  • Azure Datalake generation 2 (abfss:// scheme)
  • Snowflake
  • Delta Lake

The versions of those connectors depend on the versions of Spark and Hadoop. Here are the versions per spark:platform image:

S3 (s3a://)Hadoop 3.1.0 - AWS 1.11.271Hadoop 3.2.0 - AWS 1.11.375Hadoop 3.2.0 - AWS 1.11.375
ADLS gen1 (adl://)Hadoop 3.1.0 - ADLS SDK 2.2.5Hadoop 3.2.0 - ADLS SDK 2.2.9Hadoop 3.2.0 - ADLS SDK 2.2.9
Azure Blob Storage (wasbs://)Hadoop 3.1.0 - Azure Storage 5.4.0Hadoop 3.2.0 - Azure Storage 7.0.0Hadoop 3.2.0 - Azure Storage 7.0.0
ADLS gen2 (abfss://)Hadoop 3.1.0 - Azure Storage 5.4.0Hadoop 3.2.0 - Azure Storage 7.0.0Hadoop 3.2.0 - Azure Storage 7.0.0

To check these versions, you may also run the image locally and list the JARs in /opt/spark/jars/:

$ docker run -ti ls -1 /opt/spark/jars | grep delta

Python support supports Pyspark applications. When building a custom image or working from a notebook, additional Python packages can be installed with pip or conda.

Image tags and flavors

Data Mechanics builds Spark Docker images for mutiple combinations of the versions of the dependencies included with Spark. These combinations are called flavors.

Here's the matrix of versions that Data Mechanics provides:

Component Available versions
Spark2.4.5 to 3.1.1
Hadoop2.6, 2.7, 3.1, and 3.2
Java8 and 11
Scala2.11 and 2.12
Python2.7 to 3.8

Note that not all the combinations in the matrix exist. To list all the flavors for a given image, check out our Docker registry at

Data Mechanics provides long-form tags like where all versions are exposed.

In most cases, we encourage starting with our short-form tags like or

  • contains a Spark 3.1.1 distribution and all other dependencies are set to the latest compatible version. For example, platform-3.1.1-latest contains Hadoop 3.2.0, Python 3.8, Scala 2.12, and Java 11. We allow ourselves to upgrade the version of a dependency if a new, compatible version is released. For example, we may upgrade platform-3.1.1-latest to Hadoop 3.3.0 once it is compatible with Spark 3.1.1.
  • contains the latest Spark version of the 3.1 minor. For example, platform-3.1-latest currently uses Spark 3.1.1 and will be upgraded once Spark 3.1.2 is released. For other dependencies (Hadoop, Python, etc), platform-3.1-latest behaves like platform-3.1.1-latest (see the previous bullet point).

Please use a long-form only if you need a specific combination For instance, you may require a specific combination of versions when migrating an existing Scala or Java project to Spark on Kubernetes. On the other hand, new JVM projects and Pyspark projects should work just fine with short-form tags!

For production workload:

  • We don't recommend to use the latest tags
  • To keep the image stable you should use images with the Data Mechanics version tag like dm13 below. The following images are the same:
  • Long-form tag images without the Data Mechanics version can change to the exclusion of the Spark, Hadoop, Java, Scala and Python versions specified in the image tag.

Head over to the release notes to learn what changed a Data Mechanics version tag introduced.

Data Mechanics Dockerhub repository

Data Mechanics also has a public Dockerhub repository.

Spark images published on DockerHub are also available on GCR. For example, datamechanics/spark:3.1.1-latest on DockerHub is the same as For Data Mechanics customers, we generally recommend going with images hosted on GCR to avoid hitting rates limit.

We also recommend Data Mechanics customers to use the images, which contain additional capabilities exclusive to the Data Mechanics platform, like Jupyter support.