Package local code

In this page we will describe how to package local code and upload it on your cloud storage so that it can be executed in your cluster.

Project structure

In order to run on your cluster, your Spark applications project directory must fit the following structure:

  • a main python file e.g. main.py
  • a requirements.txt file specifying project dependencies
  • a global python package named src containing all project sources. This package can contain modules and packages and does not require source files to be flattened. Because src is a python package it must contain a __init__.py file.

Package python libraries

Run the following command at the root of your project, where the requirements.txt file is located.

rm -rf tmp_libs
pip wheel -r requirements.txt -w tmp_libs
cd tmp_libs
for file in $(ls) ; do
unzip $file
rm $file
done
zip -r ../libs.zip .
cd ..
rm -rf tmp_libs

All your dependencies are now zipped into a libs.zip file.

Package project source files

Zip your project source files from the global package src. This package will be consumed by your Spark application main file using python imports such as:

  • import src.your_module
  • from src.your_package.your_module import your_object
  • ...

Zip the src global package:

zip -r ./src.zip ./src

All your sources modules/packages are now zipped into a src.zip file.

Upload project files

Upload prepared files to your cloud storage:

gsutil cp libs.zip gs://<gcs-folder>/libs.zip
gsutil cp src.zip gs://<gcs-folder>/src.zip
gsutil cp <your_main_application_file.py> gs://<gcs-folder>/<your_main_application_file.py>

Run the application

All required files are uploaded in your cloud storage. The Spark application can now be started:

curl --location --request POST https://<your-cluster-url>/api/apps/
--header 'Content-Type: application/json'
--header 'X-API-Key: <your-api-key>'
--data-raw '{
"jobName": "<job>",
"configOverrides": {
"type": "Python",
"pythonVersion": "<python-version>",
"sparkVersion": "<spark-version>",
"mainApplicationFile": "gs://<gcs-folder>/<your_main_application_file.py>",
"deps": {
"pyFiles": [
"gs://<gcs-folder>/libs.zip",
"gs://<gcs-folder>/src.zip",
]
},
"sparkConf": {
...
<any_spark_configuration_you_want_to_override>
...
}
}
}'

You can access the Dashboard in order to monitor your Spark application execution.