Databricks Integration

Integration Overview

Okera provides seamless integration with the Databricks Analytics Platform. Databricks end users will continue to have the same experience they are used to, specifically:

  • Authentication to Okera is transparent. Databricks users continue to logon with their SSO provider and the user identity is used to authenticate against Okera APIs. No steps are needed to log into Okera.
  • Okera authorizes the policy via a Spark driver integration done at planning time. This means Databricks advanced auto scaling, cluster management, and query optimizations are unchanged.
  • There is no change to query execution and Databricks performs all I/O and query processing, allowing you to use all Databricks functionality such as Delta Lake.
  • The integration applies to Notebooks, Jobs and the Data explorer UI in exactly the same way.

Quickstart

Note

The following setup should be completed by a Databricks admin.

The following is a brief overview of the steps needed to integrate the systems. These are all run in the Databricks cluster creation UI:

  1. Configure the cluster to start with Okera's init script
  2. Configure Okera server and authentication via cluster environment variables.
  3. Verify connectivity with a simple notebook query
  4. Create a cluster policy to simplify cluster setup

1. Configure Okera Init Script

The integration is done using Databricks' built-in support for init scripts and standard cluster configurations options.

The init scripts does the following steps:

  • Download the Okera libraries into the appropriate locations in DBFS
  • Update Spark and metastore cluster configurations to enable the Okera integration

Note

Configuration state is not persisted across cluster restarts, making it easy to update values.

Okera provides the init script as part of the Okera release.

For example:

s3://okera-release-uswest/2.3.0/dbx/init-okera.sh

Init script on AWS S3

You will need to setup your cluster instance profile to have read permissions on the S3 location and it must include getObjectAcl permission. See the databricks user guide for how to setup the cluster instance profile. Databricks Integration Init Script

Important

The cluster must have an AWS instance profile to read from the S3 destination.

Init script on DBFS

Note you will need to do this for Azure deployments, since ADLS is not supported for init scripts. If you're familiar with the DBFS CLI and have configured it with your access token, you can upload the downloaded init script.

1.Create a directory on DBFS that will contain the init script

dbfs mkdirs dbfs:/okera-integration

2.Upload the downloaded script to the new directory

dbfs cp \
    /path/to/local/init-okera.sh \
    dbfs:/okera-integration

3.Finally verify the file was successfully uploaded by running the following command:

dbfs ls -l dbfs mkdirs dbfs:/okera-integration

The output should be:

file 1234 init-okera.sh

Once you've uploaded the init script to DBFS, you can link it in the cluster config. Databricks Integration Init Script DBFS

2. Configure the cluster

Supported Configurations

Okera supports all types of Databricks clusters:

  • High Concurrency
  • Standard
  • Single Node

However, for clusters that are meant to be simultaneously used by multiple users, it is strongly recommended to use High Concurrency clusters, to leverage the stronger security guarantees provided by this cluster type.

If using a Standard or Single Node cluster, it is recommended to dedicate this cluster to a single user, to avoid users being able to interact with activity by other users. This recommendation is consistent with the Databricks guidance in their documentation.

Supported Versions

Okera supports the Databricks 5.x, 6.x and 7.x runtime versions.

Catalog integration

Okera supports all of the metastore integrations supported by Databricks.

3. Configure environment variables

You will need to configure two environment variables, which you can add in the Environment Variables section under Advanced Option -> Spark.

  • Okera cluster connection information.
  • Authentication information.
Databricks environment variables section

Note

All other normal Spark configuration is supported.

Okera Cluster Connection

You will need to specify the Host and Port of your Okera cluster. You can find this value under the System tab in the Okera UI.

OKERA_PLANNER=okera.yourcompany.com:12050
Databricks Okera Planner env variable

4. Configure Authentication

Okera provides a few options to configure transparent authentication, such that Databricks logged in users can be seamlessly authenticated with Okera. While all options provide user authentication, depending on the Databricks cluster mode, some authentication options are recommended. Okera works in conjunction the Databricks cluster mode security properties to ensure proper authentication.

In all cases, Okera leverages JSON Web Tokens (JWTs) to communicate credentials.

At a high level, Okera provides three authentication options:

  1. Providing a signing key, which is shared between the Databricks cluster(s) and Okera cluster. This is recommended for High Concurrency Clusters.
  2. Providing a per cluster specific JWT at cluster creation time. All users on the cluster will have the same credential. This is recommended for Standard and Single Node clusters.
  3. Allowing the Okera client library to self-sign JWT. This should not be used in production as a malicious user can impersonate others. This is suitable for proof of concepts and experiments.

Depending on which authentication method you choose, you need to input the corresponding environment variables.

High Concurrency Clusters

For High Concurrency clusters, you will configure the Databricks cluster with a private key that will be used to sign JWTs. Okera recommends you use a dedicated public/private key pair, which can be generated and configured using the instructions in the JWT section.

It is highly recommended to use Databricks Secrets to store the private key and avoid having it in clear text in the Databricks configuration UI. The rest of the documentation will assume that Databricks Secrets are being used.

If using this authentication option, add this value in the Environment Variables section under Advanced Option -> Spark:

OKERA_SIGNING_KEY={{secrets/okera/signing_key}}

A full configuration for a High Concurrency cluster would look like this:

Databricks Integration Env Config HC

Standard and Single Node Concurrency Clusters

For Standard and Single Node clusters, users are not sufficiently isolated from each other, and Okera recommends these cluster types be used by a single user. You will configure the Databricks cluster with a token specifically for the user who will be using it, and this token will be used when communicating with Okera and authorizing the user's actions.

It is highly recommended to use Databricks Secrets to store the token and avoid having it in clear text in the Databricks configuration UI. The rest of the documentation will assume that Databricks Secrets are being used.

If using this authentication option, add this value in the Environment Variables section under Advanced Option -> Spark:

OKERA_USER=john.doe
OKERA_TOKEN={{secrets/okera/user_token}}

A full configuration for a Standard or Single Node cluster would look like this:

Databricks Integration Env Config Standard

Self-signed JWTs (POC Quickstart)

For test and proof of concept clusters, it is possible to configure the cluster for an insecure quickstart mode. This will still provide per user authentication, but a sophisticated user can impersonate other users.

Warning

This node is only suitable for testing, and should not be used in production.

If using this authentication option, add this value in the Environment Variables section under Advanced Option -> Spark:

OKERA_ENABLE_SELF_SIGNED_TOKEN=true

A full configuration for a Self-signed cluster (POC only) would look like this:

Databricks Integration Env Config Self Signed

5. Verify the integration

Once you've completed the above setup, you can go ahead and start your cluster. Once the cluster has started you can verify connectivity to Okera by selecting from the okera_sample.whoami table in a Databricks notebook. This can be done using any of the Spark languages, for example:

%sql
SELECT * FROM okera_sample.whoami

This will return the SSO logged in Databricks user.

Databricks verify integration

6. Simplify setup by using cluster policies

Once you've gone through the end to end setup, you can add the Okera configs to a new or existing cluster policy template.

Create a new cluster policy

Databricks create cluster policy

Example cluster policy for high concurrency clusters, with the init script in S3.

Replace the values:

  • Ensure the correct env variable and value is set for your authentication method (the below example is for High concurrency clusters)
  • OKERA_PLANNER with your host/port
  • init_scripts.s3.destination with the location of your init script. Change to init_scripts.dbfs.destination if in DBFS.
  • Only include aws_attributes.instance_profile_arn if your init script is in S3.
{

    "spark_env_vars.OKERA_SIGNING_KEY": {
        "type": "fixed",
        "value": "{{secrets/okera/signing_key}}"
    },
    "spark_env_vars.OKERA_PLANNER": {
        "type": "fixed",
        "value": "<okera.yourcompany.com:12050>"
    },
    "init_scripts.*.s3.destination": {
        "type": "fixed",
        "value": "s3://okera-release-uswest/2.3.0/dbx/init-okera.sh"
    },
    "aws_attributes.instance_profile_arn": {
        "type": "fixed",
        "value": "arn:aws:iam::335456599346:instance-profile/dbx-role"
    }

}
Sample Databricks cluster policy