Data Mechanics Docker images

Data Mechanics offers publicly available Docker images for Spark. Read on to learn all about them!

What's a Docker image for Spark?

When Spark runs on Kubernetes, the driver and executors are Docker containers that execute a Docker image specifically built to run Spark.

What are the Data Mechanics Docker Images?

Data Mechanics offers a versatile Docker image for Spark: spark:platform comes with batteries included, beside a Spark distribution, it contains connectors to popular object stores, Python support with pip and conda, Jupyter notebook support, and more.

Images to start with

Full Image nameSpark VersionScala VersionPython VersionHadoop Version
gcr.io/datamechanics/spark:platform-2.4-latest2.4.72.123.73.1.0
gcr.io/datamechanics/spark:platform-3.0-latest3.0.22.123.83.2.0
gcr.io/datamechanics/spark:platform-3.1-latest3.1.12.123.83.2.0
gcr.io/datamechanics/spark:platform-3.2-latest3.2.02.123.83.3.1

How to use those images for your apps and jobs?

When submitting Spark apps on the Data Mechanics platform, you can:

  • Omit the image field: in this case, spark:platform will be used by default according to the Spark version specified in the sparkVersion field. If both image and sparkVersion fields are specified, the Spark version of the image takes precedence.
  • Specify the image in your configOverrides with the image field
  • Specify a Data Mechanics Config Template with the image field

Need another image?

To match different dependencies and version requirements you can find more images at https://gcr.io/datamechanics/spark:platform.

All these dependencies can have different versions. A combination of dependencies versions is called a flavor of spark:platform in this page. The image tag indicates the flavor of the image and can be adjusted to fit your needs.

Here are two examples of image tags:

gcr.io/datamechanics/spark:platform-3.2.0-latest
gcr.io/datamechanics/spark:platform-3.2.0-hadoop-3.3.1-java-8-scala-2.12-python-3.8-latest

Image tags are described in more details below.

Need to build your own Image?

You should use one of the spark:platform image as a base. Once your custom image is in your local docker repository you have to Tag and Push it, see Set up a Docker registry and push your image.

Connecting your local Docker with Cloud Repository

When developing Docker images, it can be useful to authorize your local docker configuration to interact with your Cloud Repository so you can continue to use the docker cli tool and easily build, push, and update images.

Use the following commands to connect your Cloud Repository

aws ecr get-login-password --region region | docker login --username AWS --password-stdin [aws_account_id].dkr.ecr.region.amazonaws.com

Data sources connectors

gcr.io/datamechanics/spark:platform supports for the following data sources:

  • AWS S3 (s3a:// scheme)
  • Google Cloud Storage (gs:// scheme)
  • Azure Blob Storage (wasbs:// scheme)
  • Azure Datalake generation 1 (adl:// scheme)
  • Azure Datalake generation 2 (abfss:// scheme)
  • Snowflake
  • Delta Lake

The versions of those connectors depend on the versions of Spark and Hadoop. Here are the versions per spark:platform image:

platform-2.4-latestplatform-3.0-latestplatform-3.1-latestplatform-3.2-latest
S3 (s3a://)Hadoop 3.1.0 - AWS 1.11.271Hadoop 3.2.0 - AWS 1.11.375Hadoop 3.2.0 - AWS 1.11.375Hadoop 3.3.1 - AWS 1.11.901
ADLS gen1 (adl://)Hadoop 3.1.0 - ADLS SDK 2.2.5Hadoop 3.2.0 - ADLS SDK 2.2.9Hadoop 3.2.0 - ADLS SDK 2.2.9Hadoop 3.3.1 - ADLS SDK 2.3.9
Azure Blob Storage (wasbs://)Hadoop 3.1.0 - Azure Storage 5.4.0Hadoop 3.2.0 - Azure Storage 7.0.0Hadoop 3.2.0 - Azure Storage 7.0.0Hadoop 3.3.1 - Azure Storage 7.0.1
ADLS gen2 (abfss://)Hadoop 3.1.0 - Azure Storage 5.4.0Hadoop 3.2.0 - Azure Storage 7.0.0Hadoop 3.2.0 - Azure Storage 7.0.0Hadoop 3.3.1 - Azure Storage 7.0.1
GCS2.1.52.1.52.1.52.1.5
Delta0.6.10.8.01.0.01.0.0
Snowflake2.9.12.9.12.9.12.9.1
Pyarrow3.0.03.0.03.0.03.0.0

To check these versions, you may also run the image locally and list the JARs in /opt/spark/jars/:

$ docker run -ti gcr.io/datamechanics/spark:platform-3.1-latest ls -1 /opt/spark/jars | grep delta
delta-core_2.12-0.7.0.jar

Python support

gcr.io/datamechanics/spark:platform supports Pyspark applications. When building a custom image or working from a notebook, additional Python packages can be installed with pip or conda.

Image tags and flavors

Data Mechanics builds Spark Docker images for mutiple combinations of the versions of the dependencies included with Spark. These combinations are called flavors.

Here's the matrix of versions that Data Mechanics provides:

ComponentAvailable versions
Spark2.4.5 to 3.1.1
Hadoop2.6, 2.7, 3.1, 3.2, and 3.3
Java8 and 11
Scala2.11 and 2.12
Python2.7 to 3.8

Note that not all the combinations in the matrix exist. To list all the flavors for a given image, check out our Docker registry at https://gcr.io/datamechanics/spark:platform.

Data Mechanics provides long-form tags like gcr.io/datamechanics/spark:platform-3.1.1-java-8-scala-2.12-hadoop-3.2.0-python-3.8-latest where all versions are exposed.

In most cases, we encourage starting with our short-form tags like gcr.io/datamechanics/spark:platform-3.1-latest or gcr.io/datamechanics/spark:platform-3.1.1-latest.

  • gcr.io/datamechanics/spark:platform-3.1.1-latest contains a Spark 3.1.1 distribution and all other dependencies are set to the latest compatible version. For example, platform-3.1.1-latest contains Hadoop 3.2.0, Python 3.8, Scala 2.12, and Java 11. We allow ourselves to upgrade the version of a dependency if a new, compatible version is released. For example, we may upgrade platform-3.1.1-latest to Hadoop 3.3.0 once it is compatible with Spark 3.1.1.
  • gcr.io/datamechanics/spark:platform-3.1-latest contains the latest Spark version of the 3.1 minor. For example, platform-3.1-latest currently uses Spark 3.1.1 and will be upgraded once Spark 3.1.2 is released. For other dependencies (Hadoop, Python, etc), platform-3.1-latest behaves like platform-3.1.1-latest (see the previous bullet point).

Please use a long-form only if you need a specific combination For instance, you may require a specific combination of versions when migrating an existing Scala or Java project to Spark on Kubernetes. On the other hand, new JVM projects and Pyspark projects should work just fine with short-form tags!

For production workload:

  • We don't recommend to use the latest tags
  • To keep the image stable you should use images with the Data Mechanics version tag like dm15 below. The following images are the same:
    • gcr.io/datamechanics/spark:platform-3.2-dm15
    • gcr.io/datamechanics/spark:platform-3.2.0-dm15
    • gcr.io/datamechanics/spark:platform-3.2.0-hadoop-3.3.1-java-8-scala-2.12-python-3.8-dm15
  • Long-form tag images without the Data Mechanics version can change to the exclusion of the Spark, Hadoop, Java, Scala and Python versions specified in the image tag.

Head over to the release notes to learn what changed a Data Mechanics version tag introduced.

Data Mechanics Dockerhub repository

Data Mechanics also has a public Dockerhub repository.

Spark images published on DockerHub are also available on GCR. For example, datamechanics/spark:3.2.0-latest on DockerHub is the same as gcr.io/datamechanics/spark:3.2.0-latest. For Data Mechanics customers, we generally recommend going with images hosted on GCR to avoid hitting rates limit.

We also recommend Data Mechanics customers to use the gcr.io/datamechanics/spark:platform images, which contain additional capabilities exclusive to the Data Mechanics platform, like Jupyter support.