Skip to content

Configuration

Okera uses a YAML configuration file to set its configuration, which is then applied to a cluster using the okctl CLI tool. A reference of the parameters you can specify in this YAML configuration file is provided in Okera Configuration Parameter Reference.

Configuration Example

Here is a sample configuration file:

ports:
  # Ports that must be exposed for clients connecting to Okera. These ports need to
  # be accessible from where the client is connecting from.

  # This is the port for the Okera REST API and Web UI. This needs to be accessible for
  # clients connecting from the browser.
  REST: 8083

  # The planner and worker API ports. These ports are required for all clients (e.g.
  # spark or python users) to access metadata and data.
  PLANNER_API: 12050
  WORKER_API: 13050

  # This is the port to access the presto API endpoint for users connecting via JDBC.
  PRESTO_API: 14050

cluster:
  #
  # These are configurations for the kubernetes cluster. The CIDR blocks are
  # used exclusively within the kubernetes cluster for internal communication.
  # The CIDR blocks should *not* overlap with CIDR blocks currently being used,
  # including the VPC. For example, these changes should *not* be within the
  # VPC range.
  #
  portRange: "1025-65535"
  podCidr: "172.23.0.0/16"
  serviceCidr: "172.34.0.0/16"

config:
  #
  # Configurations for deploying an Okera cluster. Dummy values are set as
  # examples and should be replaced.
  # For configs marked [Optional], simply comment that section out if that
  # configuration is not used.
  #

  #
  # General system-wide configs.
  #
  CLUSTER_NAME: Dev Cluster
  CLUSTER_LABEL: dev
  TZ: "America/New_York"
  UI_TIMEOUT_MS: 60000
  OKERA_LEGACY_TOKEN_ESCAPE: false
  ENABLE_PARAMETRIZED_URI_GRANTS: false
  ENABLE_PARAMETRIZED_GRANTS: false
  ENABLE_LEGACY_URI_CHECKS: false
  CUSTOM_GROUP_RESOLVERS: <java-path1>, <java-path2>
  AUTOTAGGER_CONFIGURATION: true
  TRANSFORM_UDF_PRIORITIES:
  OKERA_STAGING_DIR: <path-to-Okera-audit-logs>
  AUDIT_LOGS_SYNC_FREQUENCY_MINS: 30
  OKERA_SCRIPTS_DIR: /opt/scripts

  # 
  # Set threshold for large queries, in bytes. Queries larger than this are rejected.
  #
  MAX_REQUEST_SIZE_BYTES: 52751601

  #
  # Logging and auditing directories. Okera will need write access to this path prefix.
  #
  WATCHER_AUDIT_LOG_DST_DIR: s3://company/okera/logs
  WATCHER_LOG_DST_DIR: s3://company/okera/audit
  WATCHER_S3_REGION: us-east-1
  WATCHER_S3_ENCRYPT: true
  WATCHER_LOG_PARTITIONED_UPLOADS: false
  REST_SERVER_LOG_LEVEL: DEBUG

  #
  # Users and groups (comma-separated) that have admin privileges on the catalog
  #
  CATALOG_ADMINS: admin

  #
  # Proxy pushdown mode policy enforcement parameters
  #
  PRESTO_ENABLE_PROXY: true
  PRESTO_ENABLE_QUERY_LOGGING: false
  PRESTO_PROXY_JDBC_PUSHDOWN: true
  OKERA_CTE_REWRITE_ENABLED_ENGINES:
  PRESTO_PROXY_DEBUG_ENABLED: true
  PRESTO_RESOURCE_GROUP_FILE_LOCATION: 
  PRESTO_SHOULD_USE_RESOURCE_GROUPS: false

  #
  # Snowflake policy synchronization parameters
  #
  POLICY_SYNC_INTERVAL: 1800
  POLICY_SYNC_USERS_ALLOWED_LIST:
  POLICY_SYNC_ROLE_PATTERN: OKERA_%s
  POLICY_SYNC_SCHEDULER_ENABLED: true

  #
  # MySQL database url and connection credentials.
  #
  CATALOG_DB_ENGINE: mysql
  CATALOG_DB_URL: aurora.xyz.us-east-1.rds.amazon.com:3306
  CATALOG_DB_USER: dbusername
  CATALOG_DB_PASSWORD: password

  #
  # Names of databases within the database instance where Okera stores metadata. Okera
  # will need read and write access to these databases and they must all be unique.
  #
  # CATALOG_DB_HMS_DB can be set to the name of your existing Hive Metastore(HMS) Database
  # (often this is called 'hive') to have the Okera catalog share the existing HMS objects.
  #
  CATALOG_DB_HMS_DB: okera_hms
  CATALOG_DB_SENTRY_DB: okera_sentry
  CATALOG_DB_USERS_DB: okera_users

  # 
  # Enable Hive Metastore (HMS) 2 Schema
  # 
  ENABLE_HMS_2_SCHEMA: false

  #
  # Enable OkeraFS AWS S3 Access Proxy
  #
  REST_SERVER_ENABLE_ACCESS_PROXY: true

  #
  # Configure the JWKS endpoint
  #
  JWT_JWKS_URL: <URL to OAuth identity provider>

  #
  # [Optional] Configuration to enable JWT authentication
  #
  ENABLE_JWT: true
  JWT_ALGORITHM: RSA512
  JWT_PUBLIC_KEY: s3://company/okera/conf/id_rsa.512.pub
  SYSTEM_TOKEN: s3://company/okera/conf/okera.token

  #
  # [Optional] Set RS_ARGS
  #
  RS_ARGS: <options>

  # [Optional] LDAP configuration
  #
  LDAP_HOST: ldap.company.com
  LDAP_PORT: 636
  LDAP_BIND_TEMPLATE: cn=%s,ou=users,dc=company,dc=com

  # [Optional] OAUTH configuration
  #
  OAUTH_PROVIDER: google
  OAUTH_SECRETS: file:///etc/okera/client_secrets.json
  OAUTH_SCOPES: openid profile email api://<xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx>/okera/okera_auth_scope

  # 
  # [Optional] Presto configuration
  # 
  PRESTO_HTTP_CLIENT_MAX_CONNECTIONS_PER_SERVER: 
  PRESTO_HTTP_CLIENT_MAX_REQUESTS_QUEUED_PER_SERVER:

Ports Configuration

This section of the file defines the public ports on which the cluster is accessible. This includes the UI/REST port, Planner API port, Worker API port and the Presto/JDBC API port. You can modify these and then run okctl update to update an existing Okera cluster.

Cluster Creation Configuration

This section is only used when using the Okera installer to deploy a Kubernetes cluster, and is not used when deploying Okera on an existing Kubernetes cluster (such as AKS or EKS).

Settings in this section rarely need to be modified, and can only be done prior to the cluster being created. The okctl prepare command will use these values to prepare the Kubernetes cluster for creation.

Okera Configuration

The config section contains the Okera configuration settings. The sample file above lists a variety of configuration options. You can modify these and then run okctl update to update an existing Okera cluster.

RS_ARGS Options

A variety of configuration options can be specified in RS_ARGS. Some of them are described in the following table. Specify options separated by a space. When specified, each option begins with a double-dash (--).

Option
Values Description
abort_on_error true
false
In 2.1.x and later versions, many data correctness issues fail queries, as opposed to silently ignoring them (for example, when converting data into NULL). To revert the behavior, add --abort_on_error to RS_ARGS.
allow_nl_in_csv true
false
Enables CSVs with embedded newlines within records that are enclosed within the quote separator. To enable such CSVs, specify --allow_nl_in_csv=true.
audit_request_query true
false
Enables or disables Spark query logging. See Enable Spark Query Logging for Databricks.
batch_check integer Specifies the amount of cluster memory used for a query. When querying tables that have columns with very large values (e.g. 100KB), specify batch_check=64 (or another low number).
defer_signed_urls_initial_num_tasks integer Specifies the number of nScale tasks for which deferred URI signing is performed. The default is 64. Okera defers signing for either the number of tasks specified by defer_signed_urls_initial_num_tasks or for the percentage of tasks specified by defer_signed_urls_initial_percent_tasks, whichever is lower.
defer_signed_urls_initial_percent_tasks percentage Specifies the percentage of nScale tasks for which deferred signing should be performed. The default is 25%. Okera defers signing for either the number of tasks specified by defer_signed_urls_initial_num_tasks or for the percentage of tasks specified by defer_signed_urls_initial_percent_tasks, whichever is lower.
encryption_key_path path Specifies the path to the file containing the encryption and decryption key for nScale. See Configuration File Encryption Settings for nScale. There are no constraints on the size of this file (except that it cannot be empty). Okera uses SHA256 on the contents to get a 32-byte key (256-bit).
idle_query_timeout integer The time, in seconds, that a query may be idle ( no processing work is done and no updates are received from the client) before it is cancelled. If set to 0, idle queries never expire. The query option QUERY_TIMEOUT_S overrides this setting, but, if set, --idle_query_timeout represents the maximum allowable timeout.
idle_session_timeout integer The time, in seconds, that a session may be idle for before it is closed (and all running queries cancelled). If 0, idle sessions are never expired.
ssl_enable true
false
Indicates whether SSL is enabled.
ssl_private_key path Specifies the path to the SSL private key.
ssl_server_certificate path Specifies the path to the SSL certificate.
zstd_default_compression_level integer Specifies the default zstd compression level.

For example:

RS_ARGS: --ssl_enable=true --ssl_private_key=/path/in/pod/to/key --ssl_server_certificate=/path/in/pod/to/cert --idle_session_timeout=0 --idle_query_timeout=0

Examples

For the following examples we assume you have a file called odas.yaml that contains your existing configuration.

Modify Ports

Suppose you want to change the UI/REST port from 8083 to 8000. To do this, edit the odas.yaml file and change the REST value (in the ports section) to 8000.

After saving the file, issue okctl update to update the cluster:

$ ./okctl update --config odas.yaml

This restarts the Okera cluster and applies your updated port to your Okera configuration.

Modify the Catalog Database

Suppose you want to change the database server used to back your Okera cluster. To do this, edit the odas.yaml file and and change the following values in the config section:

CATALOG_DB_ENGINE: mysql
CATALOG_DB_URL: odasdb.cyn8yfvyuugz.us-west-2.rds.amazonaws.com
CATALOG_DB_USER: odas
CATALOG_DB_PASSWORD: odas12345!

After saving the file, issue okctl update to update the cluster:

$ ./okctl update --config odas.yaml

This restarts the Okera cluster and applies your database server change to your Okera configuration.

Configuration Kubernetes Model

Okera is a Kubernetes-native application, and uses the ConfigMap and Secret objects in Kubernetes to store its configuration and make it available to the running cluster. The configuration file discussed above is translated by okctl into these two objects.

It may be helpful to understand how this translation happens, in cass you want to update your configuration manually or use a different system to set and update it (e.g. Helm).

The configuration is mounted into each running Pod as follows:

...
envFrom:
- configMapRef:
    name: default-odas-config
- configMapRef:
    name: odas-config
...
volumeMounts:
- mountPath: /etc/secrets
    name: secrets
    readOnly: true
...
volumes:
- name: secrets
  secret:
    defaultMode: 420
    secretName: secrets
```

In other words:

  1. The default-odas-config ConfigMap is mounted as environment variables into each pod. This ConfigMap object stores default values that are necessary for the cluster to be functional but can be overridden.
  2. The odas-config ConfigMap is mounted as environment variables into each pod. This ConfigMap object stores values set by the user.
  3. The secrets Secret is mounted as a set of files under the /etc/secrets folder. This Secret object stores more sensitive values.

When okctl update is run, it does the following:

  1. For each value in the ports section, it updates the Service object with the updated port value.
  2. For each value in the config section, it:

    1. Updates the odas-config ConfigMap, if it is a non-sensitive value.
    2. Updates the the secrets Secret, if it is a sensitive value, and puts a reference to that file in the odas-config ConfigMap. For example, if the configuration file has the following setting:

      SYSTEM_TOKEN: file:///path/to/system.token
      

      It will be stored like this in secrets:

      SYSTEM_TOKEN_0: <base64 contents of /path/to/system.token>
      

      It will be stored like this in odas-config:

      SYSTEM_TOKEN: /etc/secrets/SYSTEM_TOKEN_0
      
  3. It restarts each pod by updating an annotation with SHA256 with the contents of odas-config and secrets.

If you need to update the contents of these objects yourself, follow a similar pattern. After updating the Service port definitions, odas-config or secrets, you can restart all the pods by requesting Kubernetes to delete the existing ones:

kubectl delete pods --all

Notes: You should run this delete command in the Kubernetes namespace in which you installed Okera (the default namespace is the default).

Any value updated in odas-config or secrets will be the same in all Okera pods. To make a change for a specific set of pods (e.g., only Planner pods), edit that specific object type (e.g., Deployment or DaemonSet). This is not recommended and should only be done in consultation with Okera support.

Path Support

For all settings that are considered sensitive, supply the following types of values for those settings:

  1. A local fully qualified path to the file, e.g. file:///path/to/file.
  2. An S3 path to the file, e.g. s3://bucket/path/to/file.
  3. An ADLS Gen2 path to the file, e.g. abfss://<file_system>@<account_name>.dfs.core.windows.net/mypath/.
  4. A base64-encoded version of the value, e.g. base64://<base64 contents>.