Databricks Integration

Databricks Introduction

Okera has an integration with Databricks Analytics Platform, which offers a commercial Spark and Notebook server (among other things). Combining both systems allows the Databricks users to seamlessly use the data access and schema registry services provided by Okera.

The integration uses standard technologies and is mostly a configuration task for administrators. Both systems can be set up separately and eventually connected to each other with the steps outlined in the Configuration section. Once enabled, the authenticated Databricks users are allowed access to all Okera managed datasets, based on their personal authorization profile. The benefit of this integration is that all access to data is attributed to the proper users accounts (as opposed to a shared technical user account), and therefore reflects properly in the Okera audit event log. In addition, any granted access to the user (which is based on the combination of the roles assigned to the user’s groups, which is discussed in Authorization) is enforced as expected, and any security changes are enforced immediately.

Databricks Authentication Flow

Before enabling the integration between Okera and Databricks, it is helpful to understand the overall flow of authentication between the two platforms. The following diagram shows this in context, and has numbers for each step that are explained next:

The flow is as such:

  1. Any user who is communicating with the Databricks Platform needs to authenticate itself first.
  2. The Databricks Platform uses its own database to track known users and their authorizations.
  3. As soon as a user is requesting or accessing data in Spark or a Notebook from a datasource that is backed by Okera, the Databricks Platform is providing a JWT user token to the Okera client libraries. The user token is signed by a private key that is owned by Databricks.
  4. The JWT token contains the Databricks user name, which is the user’s email address.
  5. Upon receiving the request, the Okera processes will verify the provided token using the Databricks provided public key.
  6. Okera then uses the configured group mapping method to retrieve the user groups associated with the email prefix.
  7. With the groups available, the Okera Planner can retrieve the associated Policy Engine roles and verify the request, while enforcing the matching role-based access control rules.

Notes:

  • Using the email prefix only, which is the part of the email before the “@” symbol, is common practice and done the same way for Kerberos principals.
  • Looking up user groups based on the email prefix may require special handling, which is possible using the pluggable group mapping support.
  • Okera also supports pluggable JWT token verification mechanisms, including the option to call an external verification endpoint.

Databricks Integration Configuration

The next sections explain how administrators can enable the integration between the two platforms. Note that Databricks has built-in support for the integration, but due to the nature of two separately developed software systems (for the sake of flexibility), the setup requires a few manual steps.

Databricks Steps

The following general steps are required to configure the Databricks Platform and prepare the Okera setup:

Note: The details about the steps are explained in more detail in the subsequent sections.

  1. Enable the provision of the JWT token to the Okera client libraries by setting the userJwt option in the Databricks configuration. This requires contacting the Databricks support to enable this feature for your account.
  2. Acquire the Databricks provided public key that is used to verify the JWT tokens (used below in the Okera configuration steps).
  3. Copy the Okera provided libraries to the Databricks staging directory (usually in a shared location).
  4. Set up the init script that copies the provided JARs to the cluster specific, local directories.
  5. Configure, among other settings, the ODAS hostname and port number to use (see the per-component configuration documentation for details).

Notes:

  • Some of these steps require a Databricks administrator to perform the necessary actions.
  • Okera recommends to use a Databricks Runtime Version of at least 4.3. Previous versions are known to have some technical issues in combination with the ODAS integration enabled.
  • See the official Databricks information about init scripts and external Hive Metastores for more details.

Acquiring the public SSL key

For the userJWT feature to work a public key pair is needed. In practice, Databricks reuses the same key pair that is used to secure its web-based UIs with SSL. In other words, you can acquire the public key needed for Okera to validate the Databricks generated JWT tokens by saving the public key that is used by its websites. For example, using a command-line tool, you can get the key like so:

$ openssl s_client -connect <customer_id>.cloud.databricks.com:443 | openssl x509 -pubkey -noout
depth=2 C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert Global Root CA
verify return:1
depth=1 C = US, O = DigiCert Inc, CN = DigiCert SHA2 Secure Server CA
verify return:1
depth=0 C = US, ST = California, L = San Francisco, O = "Databricks, Inc.", CN = *.cloud.databricks.com
verify return:1
-----BEGIN PUBLIC KEY-----
MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAo128B8UT3MyFSXiKCcCB
TGktHi9wWaj7lZeqS58C0JiwYSgq6hIrE9rfmlv3O1V73mpSKyL90DX8dJ4/8n8C
dtzzEMVz7pePm94KcZa+dReve0vqsrMp5uHJG51LOBDrMwDIgHvpm2jLsKrmdxdb
+ty97dqpmDNF46uEqbgAIWPMAH+7Et41XpifLgsFK/m37a99CukP8aVNL9EVWS2C
yy/xlIAEJmhzpedZIt9Ro/NGeTzXz+O/JNQiI9OU21Uij0bEc0JVXMJzRCMHyvzo
ugYnXmJxAH4n+ZMghKSKPM9SpfvPDDllqixlRygOp1VjWjyXc1ow1NVoF2A8Tx7p
PwIDAQAB
-----END PUBLIC KEY-----

The lines starting with dashes and everything in between is the public key that needs to be saved in a new text file and configured below.

Preparing the Libraries and Init Script

The Databricks documentation recommends to download the required Hive JARs to the provided, shared Databricks File System (DBFS) first, and then refer to the JARs using a configuration parameter.

Note: If you do not download the Hive JARs and provision them at cluster start time, Databricks will fetch the JARs on your behalf remotely every single time a cluster is started. This will incur considerable extra time for the download and will slow down the cluster startup time noticeably.

The proper JAR files are staged in a temporary location, that can be determined from the cluster log files. Follow the linked Databricks documentation to copy the JARs, listed here in an annotated and extended form, as applicable for the Okera setup:

  1. Create a new Databricks cluster with spark.sql.hive.metastore.jars set to maven and spark.sql.hive.metastore.version to match the version of your metastore. You can get the version, for example, by searching for hive-metastore within the Databricks 5.1 release notes. This would result in adding the following Spark configuration parameters:
     spark.sql.hive.metastore.jars maven
     spark.sql.hive.metastore.version 1.2.1
    

    This should look similar to this screenshot:

    Setting these parameters causes Databricks to provision the necessary Hive JAR files in a temporary location once the cluster is deployed. Start the cluster and wait for it to be ready.

  2. When the Databricks cluster is running, go to the “Cluster” page and click on the newly created cluster. Then click on the “Driver Logs” tab, and search the logs to find a line like the following:

    Note: Staging the files happens when the cluster is already shown as running. In other words, you may need to wait a few minutes for the directory to be available.

     17/11/18 22:41:19 INFO IsolatedClientLoader: Downloaded metastore jars to <path>
    

    The directory <path> is the location of downloaded JARs in the driver node of the cluster.

    Alternatively you can run the following code in a Scala notebook to print the location of the JARs:

     %scala
     // Note: Retry this command until path is available
     import com.typesafe.config.ConfigFactory
     import com.databricks.backend.daemon.dbutils.FileInfo
     dbutils.fs.ls("file://" + ConfigFactory.load().getString("java.io.tmpdir"))
       .filter(_.path.contains("hive"))
       .foreach{x:FileInfo => println("Hive path is: " + x.path)}
    

    The output should be similar to something like this:

     Hive path is: file:/local_disk0/tmp/hive-v1_2-521e6cc2-d3ee-4623-8192-126ef03e7ed5/
    
  3. Once the Hive JARs are provisioned it is necessary to copy them to a permanent location. To do that, create a new notebook, named, for instance, “Okera Setup”, and run %sh cp -r <path> /dbfs/tmp/hive_<version>_jars, replacing <path> with your cluster’s specific path and <version> with the Hive version for the cluster, for example 1_2_1. This will copy the temporary directory to a directory in DBFS called hive_<version>_jars. The full command may look like this example:

     %sh
     # Note: Example only!
     cp -r /local_disk0/tmp/hive-v1_2-521e6cc2-d3ee-4623-8192-126ef03e7ed5 /dbfs/tmp/hive_1_2_1_jars
    

    Note: Throughout this document the DBFS path /tmp/ is used, which is the same a /dbfs/tmp/ when using the FUSE mounted filesystem on each Databricks node. You can use another path as well, but need to ensure you have read and write permissions. Using /tmp is always possible and is the reason it used here.

  4. Okera requires specific client libraries to be installed in the same JARs directory as used for the Hive integration. It is recommended to create a copy of the persisted Hive JAR directory and add the Okera JARs to that copy. Using the same notebook from above, copy the following into the next cell:

     %sh
     #
     # This cell downloads this version of the Okera client JARs to a shared
     # location. This by itself does not bootstrap any clusters.
     #
     # This should be run once every time the JARs need to be updated and
     # can take up to a minute.
     #
     VERSION="1.3.0"
     HIVE_JARS="hive_1_2_1_jars"
    
     mkdir -p /dbfs/tmp/okera/$VERSION/jars
     cd /dbfs/tmp/okera/$VERSION/jars
     cp -r /dbfs/tmp/$HIVE_JARS/* .
    
     curl -O https://s3.amazonaws.com/okera-release-useast/$VERSION/client/okera-hive-metastore.jar
     curl -O https://s3.amazonaws.com/okera-release-useast/$VERSION/client/recordservice-spark-2.0.jar
     curl -O https://s3.amazonaws.com/okera-release-useast/$VERSION/client/recordservice-hive.jar
    

    Adjust the VERSION and HIVE_JARS variables to the version of Okera and Hive you are using.

  5. In the next cell, copy and paste the following code to create an cluster init script:

     %scala
     //
     // This creates an init script that will place all the jars on new cluster.
     // This can be specified in the cluster create UI as an init script with
     // the path 'dbfs:/tmp/okera/init-okera.sh'
     //
     val VERSION = "1.3.0"
    
     dbutils.fs.put(s"/tmp/okera/init.sh", s"""
     #!/bin/sh
     # Copy the staged Hive and Okera JARs to a node-local directory
     echo "Copying staged Okera JARs to node..."
     mkdir -p /databricks/okera_jars
     cp -r /dbfs/tmp/okera/$VERSION/jars/* /databricks/okera_jars/
     # Installs Okera JARs into the node-wide Databricks JARs directory
     echo "Copying Okera JARs to shared directory..."
     cp /databricks/okera_jars/okera-hive-metastore.jar /databricks/jars/
     cp /databricks/okera_jars/recordservice-spark-2.0.jar /databricks/jars/
     cp /databricks/okera_jars/recordservice-hive.jar /databricks/jars/
     """, true)
    

    Adjust the VERSION variable to the version of Okera you are using. Run both cells to prepare the JARs and init script.

  6. Edit the Spark configuration for the Databricks cluster again and set spark.sql.hive.metastore.jars to use the node-local directory the init script is creating.

     spark.sql.hive.metastore.jars /databricks/okera_jars/*
    

    Note: The location must include the trailing /*. This will instruct Spark to load all the JAR files from that directory.

    Remove the spark.sql.hive.metastore.version parameter as it is not needed anymore. In addition, add any other settings needed to configure the connectivity between the Databricks and ODAS clusters. This includes setting the endpoint for the ODAS Planner with its port, and the maximum number of tasks to run on the compute engine.

    For instance, a more complete set of parameters may look like this:

     hive.metastore.rawstore.impl com.cerebro.hive.metastore.CerebroObjectStore
     recordservice.planner.hostports <odas_planner_host>:<odas_planner_port>
     spark.recordservice.planner.hostports <odas_planner_host>:<odas_planner_port>
     recordservice.task.plan.maxTasks <max_tasks>
     spark.recordservice.task.plan.maxTasks <max_tasks>
     spark.sql.hive.metastore.jars /databricks/okera_jars/*
    

    Replace the placeholders as described throughout the document.

  7. Switch to the “Init Scripts” tab and add the init script created in the previous step. For example:

  8. Restart the cluster.

Optional Steps

The following code can be used for older versions of Databricks that do not have the above options available. See the comments in the code for where it is applicable. Copy and paste the code into another cell in the Scala notebook and execute it as necessary.

Warning: Read the Databricks documentation carefully about using the deprecated cluster-named init script feature.

//
// Only earlier versions of databricks, where the init script cannot be
// configured on the cluster create page, the init script has to be copied
// to per cluster init directory. Note in this case, you will have to run
// this for each new cluster.
//
val CLUSTER_NAME = <YOUR CLUSTER NAME>
dbutils.fs.cp(s"dbfs:/tmp/okera/$VERSION/init.sh", s"dbfs:/databricks/init/$CLUSTER_NAME/")

//
// It is also possible to set the init script to be global, for all clusters.
// This can be used to make all clusters Okera enabled. We do this by copying
// the same init script to a global init directory.
//
// dbutils.fs.cp("dbfs:/tmp/okera/init.sh", s"dbfs:/databricks/init/")

Notes:

  • Adjust the <YOUR CLUSTER NAME> placeholder to match the name of the Databricks cluster you are targeting.
  • Uncomment the last line as needed.

Once all steps are completed you can use the Databricks UI to spin up a cluster and continue with the Okera configuration, explained next.

Okera Steps

For Okera, the steps are as such:

  1. When setting up a ODAS cluster environment using the Deployment Manager, you have to use the Databricks provided public key and specify it as an environment variable.

    For example, set the following variables in your specific env.sh:

    export CEREBRO_JWT_PUBLIC_KEY="s3://acme-bucket/keys/databricks.pub"
    export CEREBRO_JWT_ALGORITHM="RSA512"
    
  2. Optionally configure a group resolution option that works for your environment.

Once these steps are completed, use the Deployment Manager to spin up an ODAS cluster with the specific environment. You should now be able to access data backed by Okera using Spark or a Notebook provided by Databricks.

Note: Okera configurations in Databricks (that is, connection and RPC timeout values) are applied at the notebook level (as opposed to per-user or per-cluster).

For additional information, see the JWT documentation.

Updating ODAS

When you update your ODAS version you will need to also update the client libraries used on the Databricks side. Generally the Okera client libraries are compatible with different ODAS versions, but keeping both in sync is recommended by Okera to potentially unlock all improvements made by a new version.

Databricks Steps

All that is needed is to run steps from the above Databricks steps that update the version numbers for the libraries staging and their use in the init script. In particular, do the following:

  1. Update the VERSION variable in the setup notebook you used to provision the cluster, setting it to the version you are updating to. For example, here we updated from version 1.3.0 to 1.3.1 of Okera:

     %sh
     #
     # This cell downloads this version of the Okera client JARs to a shared
     # location. This by itself does not bootstrap any clusters.
     #
     # This should be run once every time the JARs need to be updated and
     # can take up to a minute.
     #
     VERSION="1.3.1" # <- Updated to 1.3.1
     HIVE_JARS="hive_1_2_1_jars"
    
     mkdir -p /dbfs/tmp/okera/$VERSION/jars
     cd /dbfs/tmp/okera/$VERSION/jars
     cp -r /dbfs/tmp/$HIVE_JARS/* .
    
     curl -O https://s3.amazonaws.com/okera-release-useast/$VERSION/client/okera-hive-metastore.jar
     curl -O https://s3.amazonaws.com/okera-release-useast/$VERSION/client/recordservice-spark-2.0.jar
     curl -O https://s3.amazonaws.com/okera-release-useast/$VERSION/client/recordservice-hive.jar
    
  2. Then update the init script cell to match the same version:

     %scala
     //
     // This creates an init script that will place all the jars on new cluster.
     // This can be specified in the cluster create UI as an init script with
     // the path 'dbfs://okera/init-okera.sh'
     //
     val VERSION = "1.3.1" // <- Updated to 1.3.1
    
     dbutils.fs.put(s"/tmp/okera/init.sh", s"""
     #!/bin/sh
     # Copy the staged Hive and Okera JARs to a node-local directory
     echo "Copying staged Okera JARs to node..."
     mkdir -p /databricks/okera_jars
     cp -r /dbfs/tmp/okera/$VERSION/jars/* /databricks/okera_jars/
     # Installs Okera JARs into the node-wide Databricks JARs directory
     echo "Copying Okera JARs to shared directory..."
     cp /databricks/okera_jars/okera-hive-metastore.jar /databricks/jars/
     cp /databricks/okera_jars/recordservice-spark-2.0.jar /databricks/jars/
     cp /databricks/okera_jars/recordservice-hive.jar /databricks/jars/
     """, true)
    
  3. Run both cells and restart the Databricks cluster