Jupyter notebooks

This page shows how to connect a local Jupyter Notebook server to your cluster in order to access Spark kernels in an interactive way. This page assumes that you know how to create and manage configuration templates on Data Mechanics.

Data Mechanics provides an integration with Jupyter notebooks. It lets you run kernels with Spark support on your cluster.

Connect a Jupyter server to Data Mechanics

The Jupyter notebook server has an option to specify a gateway service in charge of running kernels on its behalf. Data Mechanics's notebook service can be used for this.

Install the Jupyter server with

pip install notebook --upgrade

The notebook service needs your user key to be started:

jupyter notebook --gateway-url=https://<your-cluster-url>/notebooks/ \
--GatewayClient.auth_token=<your-user-key> \
--GatewayClient.request_timeout=180

The key will be used by the notebook service to interact with the Data Mechanics API.

Define kernels with config templates

Jupyter uses kernels to provide support for different languages and enable the configuration of notebook behavior.

When a Jupyter server is connected to Data Mechanics, any Config template can be used as a kernel.

Here we just ensure that we have two config templates available:

curl -X POST \
https://<your-cluster-url>/api/config-templates/ \
-H 'Content-Type: application/json' \
-H 'X-API-Key: <your-user-key>' \
-d '{
"name": "spark-2-4-4-Python-2",
"config": {
"pythonVersion": "2",
"sparkVersion": "2.4.4"
}
}'
curl -X POST \
https://<your-cluster-url>/api/config-templates/ \
-H 'Content-Type: application/json' \
-H 'X-API-Key: <your-user-key>' \
-d '{
"name": "spark-3-0-0-Python-3",
"config": {
"pythonVersion": "3",
"sparkVersion": "3.0.0"
}
}'

In your Jupyter dashboard, you should now be able to create a new notebook using kernels derived from those config templates:

New notebook

At the moment, Data Mechanics only supports Python kernels. Scala kernels are on the way.

Use the notebooks

If you open a notebook, you need to wait for the kernel (ie the Spark driver) to be ready. As long as the kernel is marked as "busy", it means it has not started yet. This may take up to 30 seconds.

Here are the objects you can use to interact with Spark:

  • the Spark context in variable sc
  • the Spark SQL context in variable sqlContext

If those objects are not ready yet, you should see something like this upon invokation:

<__main__.WaitingForSparkSessionToBeInitialized at 0x7f8c15f4f240>

Container images shipped with Data Mechanics come with the scipy stack installed:

  • numpy
  • scipy
  • pandas
  • matplotlib

You can install your own libraries by running:

!pip(3) install <some-library>

If you are new to Jupyter notebooks, you can use this tutorial as a starting point.

Notebooks are normal Spark apps

Data Mechanics makes no distinction between notebooks and Spark applications run by API call.

Notebooks appear in the Dashboard, so you can see their logs and configurations:

A notebook in the dashboard

Additionally, any configuration option for Spark applications can be applied to notebooks via the config template mechanism. For instance, notebooks can be granted access to data sitting in an object storage just like any other Spark application.