Skip to content

Databricks Integration Steps

Okera provides seamless integration with the Databricks Analytics Platform. Databricks end users will continue to have the same experience they are used to, specifically:

  • Authentication to Okera is transparent. Databricks users continue to logon with their SSO provider and the user identity is used to authenticate against Okera APIs. No steps are needed to log in to Okera.
  • Okera authorizes the policy via a Spark driver integration done at planning time. This means Databricks advanced auto scaling, cluster management, and query optimizations are unchanged.
  • There is no change to query execution and Databricks performs all I/O and query processing, allowing you to use all Databricks functionality such as Delta Lake.
  • The integration applies to Notebooks, Jobs and the Data explorer UI in exactly the same way.

Performance Tip: Connecting to Databricks Using ODBC and JDBC Drivers

When using JDBC to connect to Databricks, Okera recommends that you use the jdbc:databricks scheme to improve performance instead of the legacy jdbc:spark scheme. This will greatly improve performance.

Setup Steps

Note: The following setup should be completed by a Databricks admin.

The following is a brief overview of the steps needed to integrate the systems. These are all run in the Databricks cluster creation UI:

  1. Configure the cluster to start with Okera's init script.
  2. Configure the Databricks cluster for Okera.
  3. Configure the required Okera cluster connection environment variables.
  4. Set up cluster authentication using cluster environment variables.
  5. Optionally enable Okera file access control (OkeraFS), as needed in your environment.
  6. Optionally support Databricks single-tenant clusters, as needed in your environment.
  7. Verify connectivity with a simple notebook query.
  8. Create a cluster policy to simplify cluster setup.

1. Configure the Okera Init Script

This integration is done using Databricks' built-in support for init scripts and standard cluster configurations options.

The init scripts perform the following steps:

  • They download the Okera libraries into the appropriate locations in DBFS.
  • They update Spark and metastore cluster configurations to enable the Okera integration.

Note: The configuration state is not persisted across cluster restarts, making it easy to update values.

Okera provides the init script as part of the Okera release.

For example:

s3://okera-release-uswest/2.10.0/dbx/init-okera.sh

Init Script on AWS S3

Set up your cluster instance profile to have read permissions on the S3 location and to include the getObjectAcl permission. See the databricks user guide for information on how to set up the cluster instance profile. Databricks Integration Init Script

Important

The cluster must have an AWS instance profile to read from the S3 destination.

Init Script on DBFS

You must do this for Azure deployments, since ADLS is not supported for init scripts. If you're familiar with the DBFS CLI and have configured it with your access token, you can upload the downloaded init script.

  1. Create a directory on DBFS that will contain the init script.
    dbfs mkdirs dbfs:/okera-integration
    
  2. Upload the downloaded script to the new directory.
    dbfs cp \
        /path/to/local/init-okera.sh \
        dbfs:/okera-integration
    
  3. Verify the file was successfully uploaded by running the following command:
    dbfs ls -l dbfs:/okera-integration
    
    The output should be:
    file 1234 init-okera.sh
    

Once you've uploaded the init script to DBFS, you can link it in the cluster config. Databricks Integration Init Script DBFS

2. Configure the Cluster

Supported Configurations

Okera supports all types of Databricks clusters:

  • High Concurrency
  • Standard
  • Single Node

However, for clusters that are meant to be simultaneously used by multiple users, Okera strongly recommends using High Concurrency clusters to leverage the stronger security guarantees provided by this cluster type.

When using a Standard or Single Node cluster, Okera recommends you dedicate the cluster to a single user, to avoid user interaction with activity by other users. This recommendation is consistent with the Databricks guidance in their documentation.

Supported Versions

Okera supports the Databricks 5.x, 6.x, 7.x, 8.x, and 9.x (up through 9.1) runtime versions (support is tested against Databricks LTS/Extended Support releases). Databricks 8.x and 9.x only support Okera's native Databricks integration, while Databricks 5.x, 6.x, and 7.x support both the native Databricks integration as well as the older (data path) integrations.

Note: In Databricks 8.x integrations that use Spark 3 or later, client-side compression is not currently supported.

Catalog Integration

Okera supports all of the metastore integrations supported by Databricks.

3. Specify Okera Cluster Connection Environment Variables

Specify the host and port of your Okera cluster in the OKERA_PLANNER environment variable. You can find the host and port information on the System tab in the Okera UI.

OKERA_PLANNER=<okera.yourcompanyhostname.com>:<port>

The default port number is 12050.

If the location of Okera's user-defined functions (UDFs) has been altered from the default of okera_udfs, specify the OKERA_BUILTINS_DB environment variable in Databricks. The value of this environment variable must match the setting of the EXTERNAL_OKERA_SECURE_POLICY_DB configuration parameter as specified in Okera's configuration file. If the EXTERNAL_OKERA_SECURE_POLICY_DB configuration parameter is not specified in the configuration file, the default is assumed and you do not need to specify OKERA_BUILTINS_DB.

Note: Environment variables should be added in the Environment Variables section under Clusters -> Advanced Options -> Spark.

Databricks environment variables section
All other normal Spark configurations are supported.

For example:

Databricks Okera Planner env variable

4. Configure Authentication

Okera provides a few options for configuring transparent authentication, so users logged into Databricks can be seamlessly authenticated with Okera. While all options provide user authentication, depending on the Databricks cluster mode, some authentication options are recommended. Okera works in conjunction with the Databricks cluster mode security properties to ensure proper authentication.

In all cases, Okera uses JSON Web Tokens (JWTs) to communicate credentials.

At a high level, Okera provides three authentication options:

  1. Provide a signing key that is shared between the Databricks cluster(s) and the Okera cluster. This is recommended for High Concurrency Clusters.
  2. Provide a per-cluster-specific JSON Web Token (JWT) at cluster creation time. All users on the cluster will have the same credentials. This is recommended for Standard and Single Node clusters.
  3. Allow the Okera client library to self-sign the JWT. This should not be used in production because a malicious user can impersonate others. This is suitable for proof of concept instances and tests.

Each option requires you to specify appropriate environment variables, as described in the rest of this section.

High Concurrency Clusters

For High Concurrency clusters, configure the Databricks cluster with a private key that will be used to sign JWTs. Okera recommends you use a dedicated public/private key pair, which can be generated and configured using the instructions in JSON Web Tokens.

Okera highly recommends you use Databricks Secrets to store the private key and avoid having it in clear text in the Databricks configuration UI. The rest of this documentation assumes that Databricks Secrets are being used.

For High Concurrency clusters, add this value in the Environment Variables section under Clusters -> Advanced Options -> Spark:

OKERA_SIGNING_KEY={{secrets/okera/signing_key}}

A full configuration for a High Concurrency cluster might look like this:

Databricks Integration Env Config HC

Standard and Single Node Concurrency Clusters

For Standard and Single Node clusters, users are not sufficiently isolated from each other, and Okera recommends these cluster types be used by a single user. Configure the Databricks cluster with a token specifically for the user who will be using it. This token will be used when communicating with Okera and authorizing the user's actions.

Okera highly recommends you use Databricks Secrets to store the token and avoid having it in clear text in the Databricks configuration UI. The rest of this documentation assumes that Databricks Secrets are being used.

For Standard and Single Node clusters, add this value in the Environment Variables section under Clusters -> Advanced Options -> Spark:

OKERA_USER=john.doe
OKERA_TOKEN={{secrets/okera/user_token}}

A full configuration for a Standard or Single Node cluster might look like this:

Databricks Integration Env Config Standard

Self-Signed JWTs (POC Quickstart)

For test and proof-of-concept clusters, you can configure the cluster in an insecure quickstart mode. This still provides per-user authentication, but a sophisticated user can impersonate other users.

Warning

This option is only suitable for testing, and should not be used in production.

For test and proof-of-concept clusters, add this value in the Environment Variables section under Clusters -> Advanced Options -> Spark:

OKERA_ENABLE_SELF_SIGNED_TOKEN=true

A full configuration for a self-signed cluster (POC only) might look like this:

Databricks Integration Env Config Self-Signed

5. (Optional) Enable Okera File Access Control (OkeraFS)

Okera provides an authorization layer that intercepts Databricks data requests to S3 to determine whether users have access to the file and data in the request. If they do, the request is passed to S3 for processing. If they do not have access to the file, the request is rejected and returned. For information on how to enable OkeraFS support for Databricks on AWS S3, see OkeraFS Deployment on Databricks.

6. (Optional) Support Databricks Single-Tenant Clusters

If you need to use Okera with single-tenant Databricks clusters, add the OKERA_TOKEN environment variable to your Databricks settings (in the Environment Variables section under Clusters -> Advanced Options -> Spark). Set this environment variable to the token for a connected Databricks user. If you use a service account, specify the token of the service account.

OKERA_TOKEN=<token>

7. Verify the Integration

After you've completed the above setup, start your cluster. After the cluster has started, you can verify connectivity to Okera by selecting from the okera_sample.whoami table in a Databricks notebook. This can be done using any of the Spark languages, for example:

%sql
SELECT * FROM okera_sample.whoami

This returns the single sign-on information for the Databricks user.

Databricks verify integration

8. Simplify Setup Using Cluster Policies

After you've completed the end-to-end setup, add the Okera configuration to a new or existing cluster policy template.

Create a new cluster policy. The following example shows a cluster policy for high-concurrency clusters, with the init script in S3.

Databricks create cluster policy

Make the following updates:

  • Ensure the correct environment variable and value is set for your authentication method (the below example is for High Concurrency clusters)
  • Update OKERA_PLANNER with your Okera host and port number.
  • If you want to enable file access control, set the OKERA_ENABLE_OKERA_FS and OKERA_FS_REQUIRE_SIGNED_PATHS environment variables to true.
  • Update init_scripts.s3.destination with the location of your S3 init script. Change this to init_scripts.dbfs.destination if you are using DBFS.
  • Only include aws_attributes.instance_profile_arn if your init script is in S3.
{

    "spark_env_vars.OKERA_SIGNING_KEY": {
        "type": "fixed",
        "value": "{{secrets/okera/signing_key}}"
    },
    "spark_env_vars.OKERA_PLANNER": {
        "type": "fixed",
        "value": "<okera.yourcompany.com:12050>"
    },
    "spark_env_vars.OKERA_ENABLE_OKERA_FS": {
        "type": "fixed",
        "value": "true"
    },
    "spark_env_vars.OKERA_FS_REQUIRE_SIGNED_PATHS": {
        "type": "fixed",
        "value": "true"
    },
    "init_scripts.*.s3.destination": {
        "type": "fixed",
        "value": "s3://okera-release-uswest/{{ okera_version }}/dbx/init-okera.sh"
    },
    "aws_attributes.instance_profile_arn": {
        "type": "fixed",
        "value": "arn:aws:iam::335456599346:instance-profile/dbx-role"
    }

}