Packaging code

In this page, we describe how to package your Spark code so that it can be run on Data Mechanics.

There are two options available:

  • building a Docker image containing your source code and pushing it to a Docker registry, or
  • uploading your source code to an object storage.

Add your code to a Docker image

Using Docker images makes dependency management easy, particularly for Python workloads. They let you have tight control over your environment: you can run the same Docker image locally during development and on Data Mechanics for production.

In this section, we'll learn how to build a Docker image from your code, set up a container registry, and push the Docker image to the container registry.

Build a Docker image and run it locally

Make sure you have Docker installed on your machine.

For compatibility reasons, you must use one of our published Docker images as a base, then add your dependencies on top. Building an entirely custom Docker images is not supported.

Data Mechanics Docker images are freely available at gcr.io/datamechanics/.

How to choose a Data Mechanics base Docker image?

This depends on the nature of your workload.

We offer different flavors, depending on the type of your workload (JVM-based language or Python) and on the desired Spark version (we support 2.4.4 and 3.0.0):

Java & ScalaPython
2.4.4gcr.io/datamechanics/spark-connectors:2.4.4-dm4gcr.io/datamechanics/spark-py-connectors:2.4.4-dm4
3.0.0gcr.io/datamechanics/spark-connectors:3.0.0-dm4gcr.io/datamechanics/spark-py-connectors:3.0.0-dm4

All our images contain Spark connectors to GCS, S3, Azure Blob Storage, and Snowflake.

The dm4 suffix indicates the version of this image. We sometimes upgrade our Spark images, for instance when there is a Docker security patch. dm4 is our latest version!

We'll assume your project directory has the following structure:

  • a main python file e.g. main.py
  • a requirements.txt file specifying project dependencies
  • a global python package named src containing all project sources. This package can contain modules and packages and does not require source files to be flattened. Because src is a python package it must contain a __init__.py file.

Add a file called Dockerfile to the project directory with the following content:

FROM gcr.io/datamechanics/spark-py-connectors:3.0.0-dm4
WORKDIR /opt/application/
COPY requirements.txt .
RUN pip3 install -r requirements.txt
COPY src/ src/
COPY main.py .
ENV PYSPARK_MAJOR_PYTHON_VERSION=3

Build the Docker image by running this command in the project directory:

docker build -t my-app:dev .

Run it locally with

docker run my-app:dev driver local:///opt/application/main.py <args>

where <args> are the arguments to be passed to the main script main.py.

Set up a Docker registry and push your image

The simplest option on Google Cloud is to use the Container Registry of the project where the Data Mechanics platform is deployed. This way, the Spark pods can pull the Docker images without needing extra permissions.

To configure the project and the Google Cloud CLI (GCP documentation):

  1. Enable the Container Registry API in the project where the Data Mechanics platform is deployed
  2. Configure Docker to interact with GCP by running gcloud auth configure-docker.

You can now re-tag and push your Docker image with

docker tag my-app:dev gcr.io/<project-id>/my-app:dev
docker push gcr.io/<project-id>/my-app:dev

where <project-id> is the GCP project where the Spark platform is deployed.

Please refer to the GCP documentation about Container registry in case of issue.

Run your image on Data Mechanics

The Spark application can now be run on Data Mechanics:

curl --request POST https://<your-cluster-url>/api/apps/
--header 'Content-Type: application/json'
--header 'X-API-Key: <your-api-key>'
--data-raw '{
"jobName": "my-app",
"configOverrides": {
"type": "Python",
"image": "gcr.io/<project-id>/my-app:dev",
"mainApplicationFile": "local:///opt/application/main.py",
"arguments": [<args>]
}
}'

Host your code on an object storage

In this section, we'll learn how to package your code, upload it to an object storage, and make it accessible to the Data Mechanics platform.

If possible, prefer building a Docker image containing your source code. It is more robust and more convenient, especially for Python.

Python

Project structure

In order to run on your cluster, your Spark application project directory must fit the following structure:

  • a main python file e.g. main.py
  • a requirements.txt file specifying project dependencies
  • a global python package named src containing all project sources. This package can contain modules and packages and does not require source files to be flattened. Because src is a python package it must contain a __init__.py file.

Package python libraries

Run the following command at the root of your project, where the requirements.txt file is located.

rm -rf tmp_libs
pip wheel -r requirements.txt -w tmp_libs
cd tmp_libs
for file in $(ls) ; do
unzip $file
rm $file
done
zip -r ../libs.zip .
cd ..
rm -rf tmp_libs

All your dependencies are now zipped into a libs.zip file.

Package project source files

Zip your project source files from the global package src. This package will be consumed by your Spark application main file using python imports such as:

  • import src.your_module
  • from src.your_package.your_module import your_object
  • ...

Zip the src global package:

zip -r ./src.zip ./src

All your sources modules/packages are now zipped into a src.zip file.

Upload project files

Upload prepared files to your cloud storage:

gsutil cp libs.zip gs://<gcs-folder>/libs.zip
gsutil cp src.zip gs://<gcs-folder>/src.zip
gsutil cp <your_main_application_file.py> gs://<gcs-folder>/<your_main_application_file.py>

Run the application

All required files are uploaded in your cloud storage. The Spark application can now be started:

curl --request POST https://<your-cluster-url>/api/apps/
--header 'Content-Type: application/json'
--header 'X-API-Key: <your-api-key>'
--data-raw '{
"jobName": "<job>",
"configOverrides": {
"type": "Python",
"pythonVersion": "<python-version>",
"sparkVersion": "<spark-version>",
"mainApplicationFile": "gs://<gcs-folder>/<your_main_application_file.py>",
"deps": {
"pyFiles": [
"gs://<gcs-folder>/libs.zip",
"gs://<gcs-folder>/src.zip",
]
}
}
}'

Note that Data Mechanics automatically chooses a Spark image for your app based on the application type (Python, Java, or Scala) and sparkVersion (2.4.4 or 3.0.0).

You can access the dashboard at https://<your-cluster-url>/dashboard/ in order to monitor your Spark application execution.

Java & Scala

The procedure is simpler for JVM-based languages, as Spark has been designed with these in mind.

Once your application is compiled, upload it to your cloud storage:

gsutil cp <main-jar>.jar gs://<gcs-folder>/<main-jar>.jar

Reference your JAR (and its dependencies if it has any) in the configuration of your Spark application:

curl --request POST https://<your-cluster-url>/api/apps/
--header 'Content-Type: application/json'
--header 'X-API-Key: <your-api-key>'
--data-raw '{
"jobName": "<job>",
"configOverrides": {
"type": "Scala",
"sparkVersion": "<spark-version>",
"mainApplicationFile": "gs://<gcs-folder>/<main-jar>.jar",
"deps": {
"jars": [
"gs://<gcs-folder>/<dep1>.jar",
"gs://<gcs-folder>/<dep2>.jar"
]
}
}
}'

Note that Data Mechanics automatically chooses a Spark image for your app based on the application type (Python, Java, or Scala) and sparkVersion (2.4.4 or 3.0.0).

If you need to import a dependency directly from a repository like Maven, the deps->jars list accepts URLs, like:

https://repo1.maven.org/maven2/org/influxdb/influxdb-java/2.14/influxdb-java-2.14.jar