Jupyter notebooks

This page assumes that you know how to create and manage configuration templates on Data Mechanics.

Data Mechanics provides an integration with Jupyter notebooks. It lets you run kernels with Spark support on your cluster.

The notebook service must be deployed on your cluster to proceed. This is controlled by the notebookService.enabled variable in the Data Mechanics Helm chart, which defaults to true.

Connect a Jupyter server to Data Mechanics

The Jupyter notebook server has an option to specify a gateway service in charge of running kernels on its behalf. Data Mechanics's notebook service can be used for this.

Install the Jupyter server with

pip install notebook --upgrade

Without ingress

If you have not set an ingress on your cluster, open a connection to the port of the notebook service, just like we did to the submission service to run a First application:

kubectl port-forward -n data-mechanics service/notebook-service 5001:80

This command forwards all requests to http://localhost:5001/ to Data Mechanics notebook service.

In another shell, you can now launch a Jupyter server pointing to your cluster:

jupyter notebook --gateway-url=http://localhost:5001/notebooks/

With an ingress

If you have set up an Ingress for your Data Mechanics cluster, no need to use port-forwading. The submission service and the notebook service share the same URL:

jupyter notebook --gateway-url=http(s)://<your-cluster-url>/notebooks/

If your ingress is exposed to the internet, you have probably set up Authentication on your cluster. Here are ways to authenticate your Jupyter server depending on your setup.

If you used Data Mechanics authentication mechanism, the notebook service needs your customer key:

jupyter notebook --gateway-url=http(s)://<your-cluster-url>/notebooks/ \
--GatewayClient.auth_token=<your-customer-key>

The key will be used by the notebook service to interact with the Data Mechanics API.

Define kernels with config templates

Jupyter uses kernels to provide support for different languages and enable the configuration of notebook behavior.

When a Jupyter server is connected to Data Mechanics, any Config template can be used as a kernel.

Here we just ensure that we have two config templates available:

curl -X POST \
http(s)://<your-cluster-url>/api/config-templates/ \
-H 'Content-Type: application/json' \
-d '{
"name": "spark-2-4-4-Python-2",
"config": {
"pythonVersion": "2",
"sparkVersion": "2.4.4"
}
}'
curl -X POST \
http(s)://<your-cluster-url>/api/config-templates/ \
-H 'Content-Type: application/json' \
-d '{
"name": "spark-3-0-0-Python-3",
"config": {
"pythonVersion": "3",
"sparkVersion": "3.0.0"
}
}'

In your Jupyter dashboard, you should now be able to create a new notebook using kernels derived from those config templates:

New notebook

At the moment, Data Mechanics only supports Python kernels. Scala kernels are on the way.

Use the notebooks

If you open a notebook, you need to wait for the kernel (ie the Spark driver) to be ready. As long as the kernel is marked as "busy", it means it has not started yet. This may take up to 30 seconds.

Here are the objects you can use to interact with Spark:

  • the Spark context in variable sc
  • the Spark SQL context in variable sqlContext

If those objects are not ready yet, you should see something like this upon invokation:

<__main__.WaitingForSparkSessionToBeInitialized at 0x7f8c15f4f240>

Container images shipped with Data Mechanics come with the scipy stack installed:

  • numpy
  • scipy
  • pandas
  • matplotlib

You can install your own libraries by running:

!pip(3) install <some-library>

If you are new to Jupyter notebooks, you can use this tutorial as a starting point.

Notebooks are normal Spark apps

Data Mechanics makes no distinction between notebooks and Spark applications run by API call.

Notebooks appear in the Dashboard, so you can see their logs and configurations:

A notebook in the dashboard

Additionally, any configuration option for Spark applications can be applied to notebooks via the config template mechanism. For instance, notebooks can be granted access to data sitting in an object storage just like any other Spark application.