Common Spark configurations

Controlling the number of executors

For Spark versions 3.0 and above, dynamic allocation is enabled by default for your workloads. It will cause the Spark driver to dynamically adjust the number of Spark executors at runtime based on load:

  • When there are pending tasks, the Spark driver will request more executors.
  • When an executor is idle (not running any task) for a while, it is removed.

Note that in Spark 3.0, an idle executor may have to be kept alive if it is storing an active shuffle file. This limitation will be removed in Spark 3.1.

Here's an example configuration fragment to enable dynamic allocation —the fields should be self-explanatory. Put this config fragment in a config template or in config overrides.

{
"sparkConf": {
"spark.dynamicAllocation.enabled": "true",
"spark.dynamicAllocation.minExecutors": "0",
"spark.dynamicAllocation.maxExecutors": "25",
"spark.dynamicAllocation.initialExecutors": "1"
}
}

Dynamic Allocation works both for batch and for streaming queries. You can read more about it in the official Spark documentation.

In general, dynamic allocation is a great way to save on your cloud costs, particularly for interactive workloads (notebooks) where using a fixed number of executor often leads to wasted capacity.

This being said, you can disable dynamic allocation and control exactly the number of executors to allocate to your application using the following fragment (put it in a config template or in config overrides):

{
"sparkConf": {
"spark.dynamicAllocation.enabled": "false"
},
"executor": {
"instances": 10
}
}

Enabling Adaptive Query Execution (AQE)

Adaptive Query Execution is a Spark peformance optimization feature available from Spark 3.0. It is disabled by default, but you can enable it with the following configuration (put it in a config template or in config overrides):

{
"sparkConf": {
"spark.sql.adaptive.enabled": "true"
}
}

We recommend turning it on for most workloads, as it should speed up your workloads, sometimes significantly. Rare stability issues or performance penalties have been reported, but AQE is expected to be enabled by default in future Spark versions.

To list all configurations you can set in Data Mechanics, check out the API reference.