Databricks Integration¶
Integration Overview¶
Okera provides seamless integration with the Databricks Analytics Platform. Databricks end users will continue to have the same experience they are used to, specifically:
- Authentication to Okera is transparent. Databricks users continue to logon with their SSO provider and the user identity is used to authenticate against Okera APIs. No steps are needed to log into Okera.
- Okera authorizes the policy via a Spark driver integration done at planning time. This means Databricks advanced auto scaling, cluster management, and query optimizations are unchanged.
- There is no change to query execution and Databricks performs all I/O and query processing, allowing you to use all Databricks functionality such as Delta Lake.
- The integration applies to Notebooks, Jobs and the Data explorer UI in exactly the same way.
Quickstart¶
Note
The following setup should be completed by a Databricks admin.
The following is a brief overview of the steps needed to integrate the systems. These are all run in the Databricks cluster creation UI:
- Configure the cluster to start with Okera's init script
- Configure Okera server and authentication via cluster environment variables.
- Verify connectivity with a simple notebook query
- Create a cluster policy to simplify cluster setup
1. Configure Okera Init Script¶
The integration is done using Databricks' built-in support for init scripts
and standard cluster configurations options.
The init scripts does the following steps:
- Download the Okera libraries into the appropriate locations in DBFS
- Update Spark and metastore cluster configurations to enable the Okera integration
Note
Configuration state is not persisted across cluster restarts, making it easy to update values.
Okera provides the init script as part of the Okera release.
For example:
s3://okera-release-uswest/2.3.0/dbx/init-okera.sh
Init script on AWS S3¶
You will need to setup your cluster instance profile to have read permissions on the S3 location and it must include getObjectAcl permission. See the databricks user guide for how to setup the cluster instance profile.
Important
The cluster must have an AWS instance profile to read from the S3 destination.
Init script on DBFS¶
Note you will need to do this for Azure deployments, since ADLS is not supported for init scripts. If you're familiar with the DBFS CLI and have configured it with your access token, you can upload the downloaded init script.
1.Create a directory on DBFS that will contain the init script
dbfs mkdirs dbfs:/okera-integration
2.Upload the downloaded script to the new directory
dbfs cp \
/path/to/local/init-okera.sh \
dbfs:/okera-integration
3.Finally verify the file was successfully uploaded by running the following command:
dbfs ls -l dbfs mkdirs dbfs:/okera-integration
The output should be:
file 1234 init-okera.sh
Once you've uploaded the init
script to DBFS, you can link it in the cluster config.
2. Configure the cluster¶
Supported Configurations¶
Okera supports all types of Databricks clusters:
High Concurrency
Standard
Single Node
However, for clusters that are meant to be simultaneously used by multiple users, it is strongly recommended to use High Concurrency clusters, to leverage the stronger security guarantees provided by this cluster type.
If using a Standard or Single Node cluster, it is recommended to dedicate this cluster to a single user, to avoid users being able to interact with activity by other users. This recommendation is consistent with the Databricks guidance in their documentation.
Supported Versions¶
Okera supports the Databricks 5.x, 6.x and 7.x runtime versions.
Catalog integration¶
Okera supports all of the metastore integrations supported by Databricks.
3. Configure environment variables¶
You will need to configure two environment variables, which you can add in the Environment Variables
section under Advanced Option -> Spark
.
- Okera cluster connection information.
- Authentication information.

Note
All other normal Spark configuration is supported.
Okera Cluster Connection¶
You will need to specify the Host and Port of your Okera cluster.
You can find this value under the System
tab in the Okera UI.
OKERA_PLANNER=okera.yourcompany.com:12050

4. Configure Authentication¶
Okera provides a few options to configure transparent authentication, such that Databricks logged in users can be seamlessly authenticated with Okera. While all options provide user authentication, depending on the Databricks cluster mode, some authentication options are recommended. Okera works in conjunction the Databricks cluster mode security properties to ensure proper authentication.
In all cases, Okera leverages JSON Web Tokens (JWTs) to communicate credentials.
At a high level, Okera provides three authentication options:
- Providing a signing key, which is shared between the Databricks cluster(s) and Okera cluster. This is recommended for High Concurrency Clusters.
- Providing a per cluster specific JWT at cluster creation time. All users on the cluster will have the same credential. This is recommended for Standard and Single Node clusters.
- Allowing the Okera client library to self-sign JWT. This should not be used in production as a malicious user can impersonate others. This is suitable for proof of concepts and experiments.
Depending on which authentication method you choose, you need to input the corresponding environment variables.
High Concurrency Clusters¶
For High Concurrency
clusters, you will configure the Databricks cluster with a private key that will be used to sign JWTs.
Okera recommends you use a dedicated public/private key pair, which can be generated and configured using the instructions in the JWT section.
It is highly recommended to use Databricks Secrets to store the private key and avoid having it in clear text in the Databricks configuration UI. The rest of the documentation will assume that Databricks Secrets are being used.
If using this authentication option, add this value in the Environment Variables
section under Advanced Option -> Spark
:
OKERA_SIGNING_KEY={{secrets/okera/signing_key}}
A full configuration for a High Concurrency
cluster would look like this:
Standard and Single Node Concurrency Clusters¶
For Standard
and Single Node
clusters, users are not sufficiently isolated from each other, and Okera recommends these cluster types be used by a single user.
You will configure the Databricks cluster with a token specifically for the user who will be using it, and this token will be used when communicating with Okera and authorizing the user's actions.
It is highly recommended to use Databricks Secrets to store the token and avoid having it in clear text in the Databricks configuration UI. The rest of the documentation will assume that Databricks Secrets are being used.
If using this authentication option, add this value in the Environment Variables
section under Advanced Option -> Spark
:
OKERA_USER=john.doe
OKERA_TOKEN={{secrets/okera/user_token}}
A full configuration for a Standard
or Single Node
cluster would look like this:
Self-signed JWTs (POC Quickstart)¶
For test and proof of concept clusters, it is possible to configure the cluster for an insecure quickstart mode. This will still provide per user authentication, but a sophisticated user can impersonate other users.
Warning
This node is only suitable for testing, and should not be used in production.
If using this authentication option, add this value in the Environment Variables
section under Advanced Option -> Spark
:
OKERA_ENABLE_SELF_SIGNED_TOKEN=true
A full configuration for a Self-signed
cluster (POC only) would look like this:
5. Verify the integration¶
Once you've completed the above setup, you can go ahead and start your cluster.
Once the cluster has started you can verify connectivity to Okera by selecting from the okera_sample.whoami
table in a Databricks notebook.
This can be done using any of the Spark languages, for example:
%sql
SELECT * FROM okera_sample.whoami
This will return the SSO logged in Databricks user.

6. Simplify setup by using cluster policies¶
Once you've gone through the end to end setup, you can add the Okera configs to a new or existing cluster policy template.
Create a new cluster policy

Example cluster policy for high concurrency clusters, with the init script in S3.
Replace the values:
- Ensure the correct env variable and value is set for your authentication method (the below example is for High concurrency clusters)
OKERA_PLANNER
with your host/portinit_scripts.s3.destination
with the location of yourinit
script. Change toinit_scripts.dbfs.destination
if in DBFS.- Only include
aws_attributes.instance_profile_arn
if your init script is in S3.
{
"spark_env_vars.OKERA_SIGNING_KEY": {
"type": "fixed",
"value": "{{secrets/okera/signing_key}}"
},
"spark_env_vars.OKERA_PLANNER": {
"type": "fixed",
"value": "<okera.yourcompany.com:12050>"
},
"init_scripts.*.s3.destination": {
"type": "fixed",
"value": "s3://okera-release-uswest/2.3.0/dbx/init-okera.sh"
},
"aws_attributes.instance_profile_arn": {
"type": "fixed",
"value": "arn:aws:iam::335456599346:instance-profile/dbx-role"
}
}
