Connectiong to an External Hive Metastore

This page shows how to connect your Data Mechanics Spark applications and jobs to an external Hive metastore database.

The process involves the following steps:

Make the jar file containing the JDBC driver of your database accessible to Data Mechanics
Configure the Spark Config

JDBC Driver jar file

For PostgreSQL the jar file can be found here. You have different options to resolve the dependency, among them the following two:

Option 1: Download and Copy the JDBC driver jar file to your Data Mechanics image

Download the jar file and create a new Docker image like this:

FROM gcr.io/datamechanics/spark:platform-3.0.2-latest
COPY ./jars/postgresql-42.2.20.jar /opt/spark/jars/postgresql-42.2.20.jar

Option 2: Reference the dependency to the jar file in your Data Mechanics template

"deps": {
    "jars": [
      "https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.20/postgresql-42.2.20.jar"
    ]
  }

For a complete reference on template attributes see here

Spark Configuration

You have different options to configure the credentials, among them the following two:

Option 1: Specify the connection in a Data Mechanics template

The configuration can be specified in a Data Mechanics template or in the core-site.xml file of the Hadoop configuration.

"sparkConf": {
    "spark.sql.catalogImplementation": "hive",
    "spark.sql.hive.metastore.sharedPrefixes": "org.postgresql",
    "spark.hadoop.javax.jdo.option.ConnectionURL": "jdbc:postgresql://hive1...xxx...amazonaws.com:5432/hive1",
    "spark.hadoop.javax.jdo.option.ConnectionPassword": "xxx",
    "spark.hadoop.javax.jdo.option.ConnectionUserName": "xxx",
    "spark.hadoop.javax.jdo.option.ConnectionDriverName": "org.postgresql.Driver",
    "spark.hadoop.hive.metastore.warehouse.dir": "... Accessible Cloud Storage ..." 
  }

Additionnaly if you use older version of Hive you can add:

   "spark.sql.hive.metastore.version": "... version number ...",
   "spark.sql.legacy.timeParserPolicy": "LEGACY"

Option 2: Specify the connection in the core-site.xml file

You can also specify the confidential information in the core-site.xml file. The other parameters remain in the Data Mechanics template.

<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:postgresql://hive1......amazonaws.com:5432/hive1</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>xxx</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>xxx</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>org.postgresql.Driver</value>
    </property>
</configuration>

The core-site.xml file is in the $HADOOP_CONF_DIR path. See configuring environment variables.

JDBC Driver jar file#

Option 1: Download and Copy the JDBC driver jar file to your Data Mechanics image#

Option 2: Reference the dependency to the jar file in your Data Mechanics template#

Spark Configuration#

Option 1: Specify the connection in a Data Mechanics template#