Skip to content

Databricks Integration

Integration Overview

Okera provides seamless integration with the Databricks Analytics Platform. Databricks end users will continue to have the same experience they are used to, specifically:

  • Authentication to Okera is transparent. Databricks users continue to logon with their SSO provider and the user identity is used to authenticate against Okera APIs. No steps are needed to log into Okera.
  • Okera authorizes the policy via a Spark driver integration done at planning time. This means Databricks advanced auto scaling, cluster management, and query optimizations are unchanged.
  • There is no change to query execution and Databricks performs all I/O and query processing, allowing you to use all Databricks functionality such as Delta Lake.
  • The integration applies to Notebooks, Jobs and the Data explorer UI in exactly the same way.

Quickstart

Note

The following setup should be completed by a Databricks admin.

The following is a brief overview of the steps needed to integrate the systems. These are all run in the Databricks cluster creation UI:

  1. Configure the cluster to start with Okera's init script
  2. Configure Okera server and authentication via cluster environment variables.
  3. Verify connectivity with a simple notebook query
  4. Create a cluster policy to simplify cluster setup

1. Configure Okera Init Script

The integration is done using Databricks' built-in support for init scripts and standard cluster configurations options.

The init scripts does the following steps:

  • Download the Okera libraries into the appropriate locations in DBFS
  • Update Spark and metastore cluster configurations to enable the Okera integration

Note

Configuration state is not persisted across cluster restarts, making it easy to update values.

Okera provides the init script as part of the Okera release.

For example:

s3://okera-release-uswest/2.8.0/dbx/init-okera.sh

Init Script on AWS S3

You will need to set up your cluster instance profile to have read permissions on the S3 location and it must include getObjectAcl permission. See the databricks user guide for how to set up the cluster instance profile. Databricks Integration Init Script

Important

The cluster must have an AWS instance profile to read from the S3 destination.

Init Script on DBFS

Note you will need to do this for Azure deployments, since ADLS is not supported for init scripts. If you're familiar with the DBFS CLI and have configured it with your access token, you can upload the downloaded init script.

  1. Create a directory on DBFS that will contain the init script
    dbfs mkdirs dbfs:/okera-integration
    
  2. Upload the downloaded script to the new directory
    dbfs cp \
        /path/to/local/init-okera.sh \
        dbfs:/okera-integration
    
  3. Finally verify the file was successfully uploaded by running the following command:
    dbfs ls -l dbfs:/okera-integration
    
    The output should be:
    file 1234 init-okera.sh
    

Once you've uploaded the init script to DBFS, you can link it in the cluster config. Databricks Integration Init Script DBFS

2. Configure the Cluster

Supported Configurations

Okera supports all types of Databricks clusters:

  • High Concurrency
  • Standard
  • Single Node

However, for clusters that are meant to be simultaneously used by multiple users, it is strongly recommended to use High Concurrency clusters, to leverage the stronger security guarantees provided by this cluster type.

If using a Standard or Single Node cluster, it is recommended to dedicate this cluster to a single user, to avoid users being able to interact with activity by other users. This recommendation is consistent with the Databricks guidance in their documentation.

Supported Versions

Okera supports the Databricks 5.x, 6.x, 7.x, 8.x, and 9.x (up through 9.1) runtime versions (support is tested against Databricks LTS/Extended Support releases). Databricks 8.x and 9.x only support Okera's native Databricks integration, while Databricks 5.x, 6.x, and 7.x support both the native Databricks integration as well as the older (data path) integration.

In Databricks 8.x integrations that use Spark 3 or later, client-side compression is not currently supported.

Catalog Integration

Okera supports all of the metastore integrations supported by Databricks.

3. Specify Okera Cluster Connection Environment Variables

Specify the Host and Port of your Okera cluster in the OKERA_PLANNER environment variable. You can find the host and port information on the System tab in the Okera UI.

OKERA_PLANNER=<okera.yourcompanyhostname.com>:<port>

The default port number is 12050.

Note: Environment variables should be added in the Environment Variables section under Clusters -> Advanced Options -> Spark.

Databricks environment variables section
All other normal Spark configurations are supported.

For example:

Databricks Okera Planner env variable

4. Configure Authentication

Okera provides a few options for configuring transparent authentication, so that users logged into Databricks can be seamlessly authenticated with Okera. While all options provide user authentication, depending on the Databricks cluster mode, some authentication options are recommended. Okera works in conjunction with the Databricks cluster mode security properties to ensure proper authentication.

In all cases, Okera leverages JSON Web Tokens (JWTs) to communicate credentials.

At a high level, Okera provides three authentication options:

  1. Provide a signing key that is shared between the Databricks cluster(s) and the Okera cluster. This is recommended for High Concurrency Clusters.
  2. Provide a per cluster-specific JSON Web Token (JWT) at cluster creation time. All users on the cluster will have the same credentials. This is recommended for Standard and Single Node clusters.
  3. Allow the Okera client library to self-sign the JWT. This should not be used in production because a malicious user can impersonate others. This is suitable for proof of concept instances and tests.

Each option requires you to specify appropriate environment variables, as described in the rest of this section.

High Concurrency Clusters

For High Concurrency clusters, configure the Databricks cluster with a private key that will be used to sign JWTs. Okera recommends you use a dedicated public/private key pair, which can be generated and configured using the instructions in JSON Web Tokens.

Okera highly recommends you use Databricks Secrets to store the private key and avoid having it in clear text in the Databricks configuration UI. The rest of this documentation assumes that Databricks Secrets are being used.

For High Concurrency clusters, add this value in the Environment Variables section under Clusters -> Advanced Options -> Spark:

OKERA_SIGNING_KEY={{secrets/okera/signing_key}}

A full configuration for a High Concurrency cluster might look like this:

Databricks Integration Env Config HC

Standard and Single Node Concurrency Clusters

For Standard and Single Node clusters, users are not sufficiently isolated from each other, and Okera recommends these cluster types be used by a single user. Configure the Databricks cluster with a token specifically for the user who will be using it. This token will be used when communicating with Okera and authorizing the user's actions.

Okera highly recommends you use Databricks Secrets to store the token and avoid having it in clear text in the Databricks configuration UI. The rest of this documentation assumes that Databricks Secrets are being used.

For Standard and Single Node clusters, add this value in the Environment Variables section under Clusters -> Advanced Options -> Spark:

OKERA_USER=john.doe
OKERA_TOKEN={{secrets/okera/user_token}}

A full configuration for a Standard or Single Node cluster might look like this:

Databricks Integration Env Config Standard

Self-Signed JWTs (POC Quickstart)

For test and proof-of-concept clusters, you can configure the cluster in an insecure quickstart mode. This still provides per-user authentication, but a sophisticated user can impersonate other users.

Warning

This option is only suitable for testing, and should not be used in production.

For test and proof-of-concept clusters, add this value in the Environment Variables section under Clusters -> Advanced Options -> Spark:

OKERA_ENABLE_SELF_SIGNED_TOKEN=true

A full configuration for a Self-signed cluster (POC only) might look like this:

Databricks Integration Env Config Self Signed

5. Enable Okera File Access Control

To support Okera file access control, you need to enable the Okera file system driver and enforce path signing. In the Environment Variables section under Clusters -> Advanced Options -> Spark, set the following environment variables to true.

OKERA_ENABLE_OKERA_FS=true
OKERA_FS_REQUIRE_SIGNED_PATHS=true
OKERA_DBX_PATH_SIGN_KEY=<path>

The OKERA_ENABLE_OKERA_FS environment variable installs the Okera file system driver. The OKERA_FS_REQUIRE_SIGNED_PATHS environment variable enforces paths to be signed for authorization purposes.

The OKERA_DBX_PATH_SIGN_KEY environment variable identifies the location of the Databricks secret sign key used to sign the URLs shared between Databricks and Okera. To set up your Databricks secret sign key, see Databricks Secrets. Once you have defined your secrets sign key, specify the path to it in the OKERA_DBX_PATH_SIGN_KEY environment variable. The <path> is usually specified within double braces. For example:

OKERA_DBX_PATH_SIGN_KEY={{//secrets/your/sign-key}}

6. Verify the Integration

After you've completed the above setup, start your cluster. After the cluster has started, you can verify connectivity to Okera by selecting from the okera_sample.whoami table in a Databricks notebook. This can be done using any of the Spark languages, for example:

%sql
SELECT * FROM okera_sample.whoami

This returns the single sign-on information for the Databricks user.

Databricks verify integration

7. Simplify Setup Using Cluster Policies

After you've completed the end-to-end setup, add the Okera configuration to a new or existing cluster policy template.

Create a new cluster policy. The following example shows a cluster policy for high-concurrency clusters, with the init script in S3.

Databricks create cluster policy

Make the following updates:

  • Ensure the correct environment variable and value is set for your authentication method (the below example is for High concurrency clusters)
  • Update OKERA_PLANNER with your Okera host and port number.
  • If you want to enable file access control, set the OKERA_ENABLE_OKERA_FS and OKERA_FS_REQUIRE_SIGNED_PATHS environment variables to true.
  • Update init_scripts.s3.destination with the location of your S3 init script. Change this to init_scripts.dbfs.destination if you are using DBFS.
  • Only include aws_attributes.instance_profile_arn if your init script is in S3.
{

    "spark_env_vars.OKERA_SIGNING_KEY": {
        "type": "fixed",
        "value": "{{secrets/okera/signing_key}}"
    },
    "spark_env_vars.OKERA_PLANNER": {
        "type": "fixed",
        "value": "<okera.yourcompany.com:12050>"
    },
    "spark_env_vars.OKERA_ENABLE_OKERA_FS": {
        "type": "fixed",
        "value": "true"
    },
    "spark_env_vars.OKERA_FS_REQUIRE_SIGNED_PATHS": {
        "type": "fixed",
        "value": "true"
    },
    "init_scripts.*.s3.destination": {
        "type": "fixed",
        "value": "s3://okera-release-uswest/{{ okera_version }}/dbx/init-okera.sh"
    },
    "aws_attributes.instance_profile_arn": {
        "type": "fixed",
        "value": "arn:aws:iam::335456599346:instance-profile/dbx-role"
    }

}