Skip to content

Release Notes

2.5.1 (05/05/2021)

Bug Fixes and Improvements

  • Added the ability to control the partition split optimization threshold from Hive/Spark using the recordservice.planner.partition-split-threshold parameter.
  • Fixed an issue when rewriting in_set(needle, haystack) when the haystack was NULL.
  • Improved the performance of symlink tables (e.g. Delta and Hudi).
  • Fixed handling of SQL comments in Workspace.
  • Added support for BINARY datatype for ORC files.
  • Fixed an issue when using MySQL 8 or higher as the backing database.
  • Fixed an issue when updating view lineage.
  • Fixed an issue when dropping some ABAC grants created in prior versions of Okera.
  • Fixed an issue where error messages would not be cleared when re-running a data registration crawler.
  • Improved performance of loading table metadata when many attributes are present.
  • Fixed an issue when initiating a SAML login from the IdP.
  • Improved handling of quoted values in in_set rewrite.
  • Improved Okera cookie security by setting it to httpOnly.
  • Fixed an issue when using multiple Snowflake warehouses in the same Snowflake account in the same Okera deployment.

Notable and Incompatible changes

The Okera UI has changed its cookie to use httpOnly, and is now managed by the server. This approach improves security as no Javascript code will be able to access information stored in the cookie.

Due to this, downgrading from 2.5.1 to 2.5.0, or to any prior version of Okera, in the same cluster will cause a UI login failure since the cookies are not compatible. To resolve this, you can clear all storage for the site, including cookies.

Snowflake Case-Insensitive Identifiers

By default, Okera will now set QUOTED_IDENTIFIERS_IGNORE_CASE to true when communicating with Snowflake, treating all identifiers as case-insensitive. This behavior can be disabled by setting the OKERA_JDBC_CONNECTION_SNOWFLAKE_CASE_INSENSITIVE_COLUMNS configuration parameter to false.

2.5.0 (03/03/2021)

New Features

New data source connections experience

New UI experience to easily create connections to data sources. Read more on Connections.

Data registration now supports creating crawlers on any connection source (such as JDBC-backed datasources), not just object storage. Read more on Crawlers.

UI Connections page

New access levels and scopes for data management delegation

Okera can now be used to grant granular permissions in order to delegate access to role, crawler and data connection creation and management. As part of this, the ROLE, CRAWLER and DATACONNECTION object scopes have been added to the permissions model. These new access levels and scopes are now available in the UI Policy Builder. In addition grants on URI are now also available in the Policy Builder.

These changes impact who can see the Roles page, Registration page and Connections page in the UI. Click for more information on Access Delegation and for the full set of Access Levels available.

Access levels in policy builder


Please read the Notable and Incompatible changes section below about the impact this change can have on how you manage permissions.

Dremio integration (beta)

Dremio is now a supported JDBC data source type, and you can read more about configuring it here.

User Attributes Improvements

A user's attribute values are now visible to that user from the homepage, including the user attribute source that provided them (e.g. ldap).

It is also now possible to configure a script (or multiple scripts) to source user attributes, which is useful if they need to be sourced from a bespoke location such as a custom REST API or storage.

To enable this, set the USER_ATTRIBUTES_SCRIPT configuration setting to a path where the script is available (or comma-separated paths), e.g. on S3 or a local file, and okctl will ensure those scripts are available inside the running pod. Alternatively, if you are not using okctl, ensure that the value is set to a path (or comma-separated paths) that are available inside the container.

You can read more about user attributes here, and specifically about using custom scripts for sourcing them here.

ORC format support

Added support for the ORC file format, allowing to register and query data for files using this format.


When using the Data Registration experience to crawl for data in object storage, ORC data will not be automatically discovered. This will be added in a future release.

Easier diagnostics collection

Added the ability to use the EXECUTE DIAGNOSTICS DDL to automatically collect logs and other diagnostics to an S3 path.

The command will return immediately with the location the diagnostics are being written to, and they will be collected in the background.

By default, these diagnostics will be uploaded to a write-only Okera-owned bucket, okera-diagnostics, but this can be overridden:

  1. On a per invocation basis using EXECUTE DIAGNOSTICS LOCATION 's3://...'.
  2. By setting the DIAGNOSTICS_COLLECTION_DEFAULT_LOCATION configuration setting to a desired value (e.g. s3://company-okera-diags/).


Only administrators can run the EXECUTE DIAGNOSTICS command.

Improvements to okctl

  • Added the ability to see the token claims using okctl tokens describe <token>, where <token> is the name of the token file (in the .auth folder by default).
  • Added the ability to refresh a token using okctl tokens refresh <token>, where <token> is the name of the token file (in the .auth folder by default). This will use the generated private key for refreshing, while preserving the groups in the token.
  • Added a --duration flag for the init, refresh and create sub-commands to control the duration that the token will be valid for. The default duration is one year.
  • Added the ability to disable specific validators (which may not be applicable to the environment) in the configuration YAML, by setting the validator to false:
      disk_space: true
      database: true
      ldap: true
      storage_read: true
      storage_write: true
      dns: true

Bug Fixes and Improvements

  • Added the ability to specify a list of groups during role creation in the Okera UI.
  • Added ability to add/remove multiple groups at the same time in the Okera UI.
  • Added the ability to filter the permissions table by role name in the Okera UI.
  • Fixed an issue where the Policy Builder summary would scroll away in lower resolutions in the Okera UI.
  • Added ability to quarantine specific databases using the QUARANTINED_DATABASES configuration key, which can be set to a comma-separated list of database names.
  • Added ability to specify a list of groups to grant to when creating roles, using CREATE ROLE <role> WITH GROUPS group1 group2 ....
  • Added ability to specify more than one group when granting or revoking a role, using GRANT ROLE <role> TO GROUPS group1 group2 ... and REVOKE ROLE ... FROM GROUPS group1 group2 ....
  • Improved the default timeout for communication between the built-in PrestoDB and the rest of the Okera services.
  • Added ability to configure per-task and per-worker memory limit on nScale workers from various clients.
  • Fixed an issue when using CREATE EXTERNAL TABLE ... LIKE TEXTFILE with a custom delimiter, where that delimiter would not be recognized for inferring the table schema.
  • Added the ability to specify a set of default delimiters to use when inferring schemas in TEXTFILE by setting the OKERA_DEFAULT_FIELD_TERMINATOR configuration setting to a list of characters.
  • Fixed an issue where the underlying connection to a JDBC-backed datasource could leak in some cases when an exception happened.
  • Added ability to control the default fetch size for JDBC-backed datasources by setting the OKERA_JDBC_RECORDS_BATCH_MAX_CAPACITY configuation setting to the desired value.
  • Improved handling for data sources that do not support transparent query pushdown.
  • Performance improvements for planning and execution of queries against JDBC-backed datasources.
  • Fixed an issue when revoking a URI grant with a privilege level other than ALL.
  • Added the ability to enable LDAP/SAML/OAuth UI authentication at the same time if more than one of authentication mode is necessary.
  • Fixed an issue when fully unnesting a table that could cause column names that would exceed the metastore's limit of 128 characters. These column are now auto-truncated to retain as much of the original name while fitting within the character limit. If a truncation is not possible, an error will be thrown.
  • Fixed an issue when doing transparent query pushdown where it would incorrectly use the cluster-external load balancer (e.g. ELB) rather than the cluster-local service.
  • Fixed an issue where a failed crawler would sometimes not report failure properly.
  • Added the ability to specify the default zstd compression level by adding --zstd_default_compression_level=<level> to RS_ARGS.
  • Fixed an issue where creating JDBC-backed tables could fail with a permission error.
  • Fixed an issue in PyOkera where it would fail to properly parse the JWT if pyjwt was installed.
  • Fixed an issue in PyOkera where if both db and a filter were passed to list_datasets, it would incorrectly omit the db parameter.
  • Fixed an issue in the Okera UI where the "Use this dataset" sample text would not escape the database and table names.
  • Fixed an issue when using scan_as_pandas where it would incorrectly reset the row index for every batch.
  • Improved the behavior in PyOkera for refreshing the token when it is expired during a scan_as_json or scan_as_pandas invocation using the presto dialect.
  • Fixed an issue where the value of the JWT_TOKEN_EXPIRATION configuration setting would not always be used, instead using the default of 1 day expiration.
  • Improved memory accounting and dynamic batching when computing queries over large rows.
  • Improved performance of JWT signature validation.
  • Added the in_set(<needle>, <comma-separated haystack>) builtin.
  • Fixed an issue where the generated SQL could be missing enclosing parentheses when containing multiple predicates.
  • Improved error handling when registering a set of tables using ALTER DB ... LOAD DEFINITIONS(). By default (and changed from prior releases), an error will not be fatal and will continue registering tables. The number of tables added/skipped/failed can be seen by looking at the okera.jdbc.tables-XXX properties on the database. To revert to the previous behavior of aborting on any error, set the configuration value of JDBC_LOAD_DEFINITION_ABORT_ON_ERROR to true, either globally or per database (using DBPROPERTIES).
  • Fixed an issue where the identifier quoting character for MySQL and PostgreSQL would not always be used when doing a query rewrite.
  • Fixed an issue where timestamps that were too low to be represented correctly would cause incorrect values to be returned - the data is now clamped to the start of the Gregorian calendar.
  • Improved handling of concurrent partition fetching for large partition counts.
  • Fixed an issue where Okera could fail to drop a partitioned table when the DROP TABLE DDL referenced the table in non-lowercase form.
  • Fixed an issue when creating a table without specifying a ROW FORMAT but that does specify an INPUTFORMAT, causing the specified INPUTFORMAT to be used by default for subsequent table creation when no ROW FORMAT is specified.
  • Fixed an issue where worker discovery could fail if one of the worker pods was stuck in Pending state.
  • Added the ability to specify the result limit for Presto mode queries as well in the Okera Workspace.
  • Fixed an issue when rendering complex types when running using the Presto mode queries in the Okera Workspace.
  • Fixed a bug where Policy Builder failed to properly update policies when there was a conflict.
  • Improved granularity of error reporting in Data Registration. There is now a distinction between an error during background crawling execution and an error regarding a specific table.
  • Added the ability to search for crawlers by their source in the Okera Data Registration UI.
  • Fixed a bug where crawler names that contained reserved characters were not being escaped.
  • Removed the /__api/log endpoint, which was used by the Okera UI to log errors.

Notable and Incompatible changes

Data Connections

  • Connection names are now validated to only include allowed characters by default ([a-zA-z_0-9]+). You can disable this behavior by setting the OKERA_ENABLE_CONNECTION_NAME_VALIDATION setting to false.
  • The user and password parameters when creating connections via DDL have been renamed to user_key and password_key, to aid in understanding they do not store the credentials themselves but only the reference to them (e.g. in AWS Secrets Manager or a Kubernetes Secret).
  • In Okera 2.2.x and 2.3.x, when creating JDBC-backed datasources using a property file, Okera would implicitly create a data connection for it. This behavior is now disabled, as all new registration should happen using data connections. These automatically data connections cannot be used in the Data Registration flow, and should ideally be replaced with explicitly created connections.


Starting in Okera 2.5.0, there are several permission-related behavior changes relative to prior releases. These are generally part of the new permission delegation capabilities, but the notable changes are included below:

  • To create a Crawler, a user now requires either the CREATE_CRAWLER_AS_OWNER or ALL privilege on the CATALOG scope.
  • To use a data connection when creating a table or database, the user must have the USE privilege on that data connection.
  • To grant access to an object that a user has WITH GRANT OPTION on, that user will also need MANAGE_PERMISSIONS on the role they want to grant that permission to. To revert back to the old behavior, set the ENABLE_LEGACY_GRANTABLE_ROLES configuration setting to true.
  • Starting in 2.5.0, access to the Okera Workspace will be granted to all users (it is granted by default to okera_public_role). If you wish to limit access to Workspace to specific users:

    1. Revoke access to Workspace by removing it okera_public_role. You can do this from the Roles UI or by running the DDL:

      REVOKE SELECT ON TABLE okera_system.ui_workspace from ROLE okera_public_role;

    2. Edit your cluster configuration to the value for GRANT_WORKSPACE_TO_PUBLIC_ROLE to false.

    You can then grant the okera_workspace_role to any specific groups or users that you want to have access to the workspace feature. * Starting in 2.5.0, access to the following pages is controlled by whether the user has access to the relevant object, as opposed to explicitly granting access to that page (read more about this here):

    • Roles: access to the Roles page is now available if the user has permission to manage any ROLE object.
    • Tags: access to the Tags page is now available if the user has permission to manage any ATTRIBUTE NAMESPACE object.


  • Database, dataset, and catalog filters have been removed from the Roles page. Permissions now appear on their respective objects on the Data page.
  • If a user has two grants to the data, one which is on an entire scope (e.g. table/database/catalog) with no WHERE clause, and one which has a WHERE clause, the WHERE clause will no longer be added as the other grant provides full access.

SQL Keywords

The following terms are now keywords, starting in 2.5.0:

  • DENY


Bug Fixes and Improvements

  • Fixed an issue when connecting from Databricks and using the Databricks-signed JWT could fail when a query was run multiple times.
  • Fixed an issue where partitioned symlink tables (e.g. Delta) would fail to plan if the number of partitions was high.


Bug Fixes and Improvements

  • Improved logging in the PrestoDB connector to properly log both the Presto query ID as well as the Okera task IDs when available.
  • Added the ability to set the default quote character (the default is ") for CSV files when using the built-in CSV SerDe. This can be set in the following ways:

    1. On the SERDEPROPERTIES when creating or altering a table (e.g. to disable quote handling by removing the quote character):



      ALTER TABLE mydb.mytable SET SERDEPROPERTIES('quoteChar'='')

    2. On the TBLPROPERTIES to set the default value (this can be overwritten with the above SERDEPROPERTIES):


    3. Change it global default for the cluster by setting TEXT_TABLE_DEFAULT_QUOTE_CHAR to the desired value, e.g. '' to disable the quote character.

  • Fixed an issue with handling of CSV files when split across multiple tasks and running count(*).

  • Upgraded the packaged Snowflake JDBC driver to v3.2.17.


Bug Fixes and Improvements

  • Fixed an issue where, when using AWS Glue, loading a specific database in the UI would take a long time to load.
  • Improved handling of S3 connection errors (e.g. retries, service unavailable), including the ability to set new values via configuration.
  • Increased the default PrestoDB TaskUpdate limit.


Bug Fixes and Improvements

  • Fixed an issue where if a view and the underlying base table had mismatched types on a column, Okera would produce data that matched the underlying table type and not the view type, causing an issue for upstream engines (e.g. Presto). The new behavior is that an implicit cast will be added if possible, and if not, the query will be failed.
  • Fixed an issue in the PrestoDB and PrestoSQL client libraries, where if a column name was also a reserved keyword (e.g. database or metadata) AND the column was a complex type (e.g. STRUCT), the client library would produce an invalid planning request.
  • Fixed an issue in the transparent Snowflake access where it would use an external LB (if configured) rather than the cluster-local cerebro-worker service address.
  • Fixed an issue in the transparent Snowflake access where queries that used IF were not properly rewritten.
  • Fixed an issue in the Spark and Hive client libraries where they would not properly maintain millisecond values for TIMESTAMP columns (they would correctly retain microsecond and nanosecond values if present).


Bug Fixes and Improvements

  • Updated PostgreSQL driver to resolve a security vulnerability
  • Fixed an issue when querying tables that have columns with very large values (e.g. 100KB), where a simple query that references that column would fail due to exhausting the cluster memory. To resolve this, set RS_ARGS to include --batch_check=64 (or another relatively low number). In 2.3.x, this value is set to -1 (i.e. no limit) by default, but in future Okera releases (2.4.x and above) it will be set to a low number by default.


Bug Fixes and Improvements

  • Added an option in EMR bootstrap to specify a custom image location using --local-worker-image.
  • Fixed an issue where Presto would report an error of Could not compute splits and not specify the underlying Okera error.
  • Improved S3 IO retry handling for improved latency when errors occur.
  • Fixed an issue in co-located workers that would attempt to open a connection to the planner unnecessarily.
  • Added the ability to specify DROP as a privilege for attribute namespaces, databases, tables and views.
  • Added the ability to control the number of Okera tasks for a query in Presto using the okera.max_tasks Presto session property.

Notable and Incompatible changes

Automatic estimated table statistics

Okera will now automatically collect and store estimated table statistics. These can be optionally enabled (they are disabled by default) and leveraged by Hive, Spark and Presto for query planning and cost-based optimization.

To enable for Spark and Hive, edit hive-site.xml and add:


To enable for Presto, you can do either of the following options:

  1. Edit the Okera connector's and add okera.task.plan.enable-okera-stats=HMS_OKERA.
  2. Set the okera.stats_mode Presto session property to HMS_OKERA.

Note that these estimated statistics are complementary to the normal Hive Metastore statistics, and there is no change in behavior if those statistics are currently being utilized (they take precedence if set over Okera's estimated statistics).

Okera JDBC Driver Update

Okera has added support for specifying TimeZoneID as a URL property when using Okera's Presto JDBC driver to connect via JDBC clients. For example, the connection property can be set as TimeZoneID:UTC. If this value is not specified, the driver would use the system's current timezone ID.

Valid values for this property are specified in the IANA Time zone Database. For a complete list of supported time zone ID, see

Default Docker Repository Changed to

Okera has changed the Docker repository that images are pushed to from DockerHub to Quay, due to the impact of the newly enforced rate limits in DockerHub.

Okera's images are available with the prefix (the image names have not changed).


New Features

Okera Co-located Compute (EMR)

You can now run Okera's scalable data plane co-located with your EMR cluster(s), allowing you to transparently (and with zero or marginal cost) scale your Okera secure compute capacity as you provision more EMR capacity (both by scaling a single cluster or having multiple independent clusters). For supported data sources and queries, the secure data access will happen on the EMR nodes, benefiting from network and compute locality, and allowing you to maintain a much smaller central Okera cluster, thus allowing you to dramatically reduce TCO.

EMR clusters running with Okera's co-located compute do not need to have direct S3 access (via IAM), as the co-located data plane gets temporary secure access to the data it needs, thereby reducing the surface area of data access and allowing you to maintain high security, while not sacrificing usability (such as prohibiting SSH access to EMR).


Okera's co-located data plane is supported beyond EMR. To learn how to leverage it in either deployment environments, such as Kubernetes, please contact Okera Support.

New UI Databases Page

Okera has a new catalog browsing and management experience, centered around Databases and the Datasets in them. Users can now create and manage Okera databases, as well as permissions and tags at the database level.

To search across all datasets, click on Search all datasets to leverage the new dataset search page.

Click here learn more about the new functionality.

Transparent Snowflake Access (Beta)

Okera now supports improved access control on Snowflake data sources, pushing down full queries (including joins and aggregations) to Snowflake while enforcing the complete access policy as well as audit log entries.

Users, such as data analysts, can connect their favorite SQL tool (e.g. DBeaver, Tableau, Looker) via Okera’s ODBC/JDBC endpoint, and their queries will be automatically sent to Snowflake, after being authorized and audited by Okera (and if the user does not have permission to access the data they are trying to access, the query will be rejected). With this new capability, you get the benefit of Snowflake's native performance scale and Okera's complete policy and auditing capabilities.

In future releases, more data sources will be supported for transparent access integration as well.

Read more here.

Improved Databricks Integration

Okera has an improved integration with Databricks, enforcing full fidelity policies while maintaining complete compatibility with Spark and Databricks, including Databricks Delta Lake. The new integration is transparent in its execution, and allows Databricks Spark to fully control the data access, thus retaining it's performance and functionality.

This new functionality is on by default, and you can read more about how to easily integrate a Databricks cluster (or clusters) with Okera here.

PrestoSQL Support

Okera now supports PrestoSQL (both the open-source and Starburst variants) in addition to PrestoDB. This allows you to connect your existing PrestoSQL clusters to Okera, benefiting from Okera's unified catalog, access control and auditing capabilities.


PrestoSQL 338 is supported.

EMR 6.1 Support

EMR 6.1 is now supported, allowing you to leverage the latest functionality on EMR, such as Spark 3, Hive 3 and PrestoSQL.

You can read more about integrating with EMR 6.1 here.


Integration with EMR 6.1 clusters is only supported with Okera clusters 2.3.0 and higher.

Bug Fixes and Improvements

  • Fixed a UI bug where updating a permission without any changes caused an error and would remove the permission.
  • Added a clear error message when a user that does not permission to create an attribute namespace tries to create one in the UI.
  • Fixed an issue where a LEFT OUTER JOIN would cause an error when querying two unnested columns.
  • Fixed an issue where in some cases, a user that was granted WITH GRANT OPTION could grant a higher access level on that object.
  • Okera UDFs that are used by external systems (such as Spark) are now registered in the okera_udfs databases.
  • Ensure that the automatic Presto tuning generates default task counts which are a power of 2 (as required by Presto).
  • Added a request ID to the audit logs for Presto and Spark queries, allowing to link together all the audit log entries for a single query.
  • Added the ability to specify a specific password to use for the Presto connection when using PyOkera, to allow for connecting to non-token enabled Presto clusters.
  • Improved autotuning that automatically detects cluster resizing for the Okera client libraries for Presto, Hive and Spark.
  • Fixed an issue in PyOkera that did not properly take custom user claims into account when using a token_func when a token expired.
  • Improved handling of spaces and periods when put in databases, tables and columns.
  • Fix an issue when running count(*) on JSON data when multiple splits are generated.
  • Added support for setting database description via DDL:

    ALTER DATABASE <db_name> SET COMMENT '<database comment>'

  • Fix an issue for partitioned Delta tables.

  • Improved handling in CREATE TABLE ... LIKE PARQUET for partitioned tables:
    • A data file will automatically be found inside one of the partitions without needing to be manually specified.
    • The partition scheme can be auto-inferred from the on-storage structure (similar to the behavior in data registration crawlers), without needing to explicitly be set.
  • Reject all unparseable view statements when creating or altering the view definition, and improve error handling if an unparseable view is already present in the catalog.
  • In PyOkera, scan_as_json and scan_as_pandas now take an optional presto_headers dict keyword argument for custom headers to use when making the Presto request.
  • Improved metadata fetching performance when executing Presto queries, especially ones that reference many catalog objects.
  • Don't automatically populate large table statistics for Spark and Hive if no real statistics are present. The prior behavior can be enabled by setting the Hive configuration property to true.
  • Increase the default timeout when creating an Okera connection in the client libraries to 30 seconds (prior value was 10 seconds).
  • Fixed an issue where user attributes were not read correctly if the source system (e.g. LDAP) had them in non-lowercase.
  • Fixed an issue in okctl that did not properly handle validation of parameters that supported multiple path values (e.g. `JWT_PUBLIC_KEY: s3://path1,s3://path2).
  • Added the ability to control the timeout for the Kubernetes liveness and readiness probes by setting the OKERA_HEALTHCHECK_TIMEOUT_MS configuration value.
  • Fixed an issue for feature flag toggling for non-catalog administrators.
  • Improved role conflict detection for grants on differing scopes that don't overlap in their ABAC conditions.
  • Improved handling for ALTER DATABASE ... LOAD DEFINITIONS OVERWRITE to not remove tag assignments (at either the table or column level) if they are already present.
  • HAVING ATTRIBUTE conditions will now be taken into account for grants that also contain WHERE filters. The prior behavior can be enabled by setting the IGNORE_HAVING_EXPR_ON_FILTER to true.

Notable and Incompatible changes

Oracle NUMBER type

In 2.3.0 and higher, the NUMBER type in an Oracle table will be represented as a DECIMAL(38,6) in Okera.

Credential files for JDBC-backed data sources

In 2.3.0 and higher, when creating a JDBC-backed data source using a credentials file, the creating user must have permissions on that URI (expressed as a URI grant).

For example, if your credential file is located at s3://mycompany/config/, and you tried executing the following command:

  'credentials.file' = 's3://mycompany/config/',

This will error if you do not have a URI grant that gives you access to s3://mycompany/config/

You can create such a grant with:

GRANT ALL ON URI s3://mycompany/config TO ROLE <some role>

Note that you can also grant access to the entire bucket (or any prefix-level you desire).


Bug Fixes and Improvements

  • Fixed an issue in Hive and Spark client libraries when generating planning SQL that contained DATE types.
  • Fixed an issue in scanning partitioned Delta tables.
  • Fixed an issue in the Spark and Hive client libraries where they would not properly maintain millisecond values for TIMESTAMP columns (they would correctly retain microsecond and nanosecond values if present).
  • Fixed an issue in the PrestoDB client library, where if a column name was also a reserved keyword (e.g. database or metadata) AND the column was a complex type (e.g. STRUCT), the client library would produce an invalid planning request.

Notable and Incompatible changes

Default Docker Repository Changed to

Okera has changed the Docker repository that images are pushed to from DockerHub to Quay, due to the impact of the newly enforced rate limits in DockerHub.

Okera's images are available with the prefix (the image names have not changed).


Bug Fixes and Improvements

  • Fixed an issue in PrestoDB split computation in very large clusters.
  • Removed the restriction on column comments by default (limit was 256 characters). Note that this changes the underlying HMS schema, and if connected to a shared HMS, should be disabled by setting the HMS_REMOVE_LENGTH_RESTRICTION configuration value to false.
  • Improved resilience in handling crawling errors.
  • Fixed an issue with WITH GRANT OPTION on non-ALL privileges.
  • Restrict querying datasets with nested types that have policies on tags that are on the nested type.
  • Fixed an issue when paginating in the Datasets page.
  • Fixed an issue for the /api/get-token endpoint.


New Features

JDBC Data Sources

Custom JDBC Driver Support

Okera has added support for specifying custom JDBC data sources beyond those that ship out of the box. If you would like to connect to a custom JDBC data source, please work with Okera Support to define the JDBC connection information appropriately for your driver.

Secure values for JDBC properties

Okera has added support for referring to secret values in the JDBC properties file from local secret sources such as Kubernetes secrets, as well as secure Cloud services such as AWS Secrets Manager and AWS SSM Parameter Store.

For example:


This will look up in AWS SSM Parameter Store the value for /mysql/username and /mysql/password. You can similarly use file:// for local files (using Kubernetes mounted secrets) or awssm:// to use AWS Secrets Manager.


If using AWS SSM Parameter Store or AWS Secrets Manager, you will need to provide the correct IAM credentials to access these values.

Predicate pushdown enabled by default for JDBC-backed data sources

Starting in 2.2.0, predicate pushdown for JDBC-backed data sources is enabled by default (this was previously available as an opt-in property on a per-data source level), and will be used whenever appropriate.

To disable predicate pushdown for a particular JDBC-backed database or table, you can specify 'jdbc.predicates.pushdown.enabled' = 'false' in the DBPROPERTIES or TBLPROPERTIES (you can read more here).

BLOB/CLOB Datatype support

Okera now supports BLOB and CLOB datatypes for Oracle JDBC Data Sources.

Autotagging for JDBC-backed data sources

When registering JDBC-backed data sources and loading the tables, Okera will now run its autotagger by default when registering.

You can disable this behavior by specifying okera.autotagger.skip=true in your DBPROPERTIES.

UI Improvements for Tabular Results

The UI now makes it easy to copy or download results as CSV from tables. This can be used in the Workspace and when previewing a dataset.

Operability Improvements

  • Okera will now generate correlated IDs for the planner and worker tasks to make it easier to correlate the task information in the logs. For example, the planner may have a task of the form 9b45f8b08c76352e:85a51f5579300000, and if N worker tasks were generated, they would be of the form 9b45f8b08c76352e:85a51f5579300001, 9b45f8b08c76352e:85a51f5579300002, and so on.

  • System administrators can now easily access the the Planner and Worker debug UIs from the System page in the Okera UI.

  • System administrators can now see how many unique users have accessed data via Okera in the System page in the Okera UI, both all-time and in the last 30 days.

Domino Data Labs integration

When run in Domino Data Lab environments (starting in Domino Data Lab version 4.3.0), PyOkera now has built-in integration that can be used to leverage the automatically generated JWT tokens in the Domino Data Labs environment, enabling transparent authentication between Domino Data Labs environments and Okera deployments.

import os
from okera.integration import domino

ctx = domino.context()
with ctx.connect(host=os.environ['OKERA_HOST'], port=int(os.environ['OKERA_PORT'])) as conn:
    df = conn.scan_as_pandas('drug_xyz.trial_july2020')

PrestoDB Improvements

  • Several internal improvements were made to Okera's PrestoDB connector to increase performance in areas such as data deserialization, asynchronous processing, improved memory allocation, etc.
  • Several improvements were made to auto-tune Okera's builtin PrestoDB cluster to better match its environments capabilities.
  • When filtering on columns of DATE type, the PrestoDB connector will now push those filters down into the Okera workers.
  • Okera's PrestoDB connector has added support for table statistics if these are set on the table in the Okera catalog. These can be set by setting the numRows table property, e.g.:

    ALTER TABLE mydb.mytable SET TBLPROPERTIES('numRows'='12345')

These table statistics will be taken into account by Presto's cost-based optimizer (e.g. for JOIN reordering).

User Attributes

Okera has an added the user_attribute(<attribute>) builtin, which retrieves attribute values on a per-user basis. These can be used in policy definitions, e.g. to apply dynamic per-user filters.

These attributes can be fetched from AD/LDAP by setting the LDAP_USER_ATTRIBUTES configuration value to a comma-separated list of attributes to fetch from AD/LDAP, e.g.:

LDAP_USER_ATTRIBUTES: region,manager,businessUnit

If the attribute is missing for the user executing it, the value returned will be null.

Hudi and Delta Lake Support (Experimental)

Okera has added experimental support for Delta Lake and Apache Hudi tables.

You can create Apache Hudi tables using the CREATE EXTERNAL TABLE DDL, e.g.:

CREATE EXTERNAL TABLE mydb.my_hudi_tbl
LIKE PARQUET 's3://path/to/dataset/year=2015/month=03/day=16/5f541af5-ca07-4329-ad8c-40fa9b353f35-0_2-103-391_20200210090618.parquet'
PARTITIONED BY (year int, month int, day int)
LOCATION 's3://path/to/dataset';

You can create Delta Lake tables using the CREATE EXTERNAL TABLE DDL, e.g.:

CREATE EXTERNAL TABLE mydb.my_delta_tbl (id BIGINT)
LOCATION 's3://path/to/dataset/';

The following limitations should be kept in mind:

  • In both cases, tables need to be explicitly registered, as crawling will not properly identify these tables as Hudi or Delta Lake.
  • For Apache Hudi, Okera only supports Snapshot Queries on Copy-on-Write tables and Read Optimized Queries on Merge-on-Read tables.

Okera has added several privacy functions typically used in health- and medical-related environments:

  • phi_zip3
  • phi_age
  • phi_date
  • phi_dob

These are compliant with the HIPAA safe-harbor standard.

Nested Type Tagging (Beta)

Okera has added the capability (disabled by default) to tag nested types (specifically, ARRAY and STRUCT types), and have those tags be inherited when used in views that unnest the nested portion.

For example, if you have a table with the following schema:

id  bigint  
s1  struct<

You could tag s1.a1.f1, s1.a1.f2 and s1.a1.a2, and when unnested, they will retain their tags.

Additionally, Okera has added the ability to fully unnest a table, and inherit the tags on that object into the view - this is done using the SELECT ** operator.

For example, using a table with a schema above, you could create the following view (with tags on the three leaf fields):

CREATE VIEW mydb.unnested_view AS SELECT ** FROM mydb.nested_table

This will create a view which has the following schema:

id  bigint  
s1_a1_item_f1   string  
s1_a1_item_f2   string  
s1_a1_item_f2   string

with s1_a1_item_f1, s1_a1_item_f2 and s1_a1_item_f2 retaining their tags. You can then grant access to this view and use normal attribute-based policies and transformations.

To enable this feature, set the FEATURE_UI_TAGS_COMPLEX_TYPES and ENABLE_COMPLEX_TYPE_TAGS configuration values to true.


ABAC policies that apply to tags assigned to nested types will not be enforced on the base table, so take care to only give access to unnested views in these cases.

Bug Fixes and Improvements

  • Fixed an issue in the Okera Presto connector where some queries against information_schema could cause an exception and fail.
  • Fixed an issue in the Tags management UI where the number of tagged datasets could be incorrect.
  • Improved handling of not displaying internal databases in the UI.
  • Added the ability to run GRANT for ADD_ATTRIBUTE and REMOVE_ATTRIBUTE at the CATALOG scope.
  • Removed the need to have ALTER permissions on tables and databases in order to run add/remove attributes (ADD_ATTRIBUTE and REMOVE_ATTRIBUTE are now sufficient).
  • Added an implementation of listTableNamesByFilter in the HMS connector.
  • Added the ability to configure timeouts for internal checks to account for large network latency, this can be set using OKERA_PINGER_TIMEOUT_SEC.
  • Support implicit upcasting for Parquet columns of type INT32 to be represented by BIGINT in the table schema.
  • Improved experience when previewing JDBC-backed datasets by limiting the amount of data fetch.
  • Added DELETE, UPDATE and INSERT as grantable privileges.
  • Fixed an issue in okctl where it would not report an error and abort if there was an error updating the ports.
  • Improved handling of small files for Parquet-backed datasets.
  • Fixed an issue where the Autotagger would not correctly handle columns with DATE type.
  • Improved handling for JDBC-backed tables where the table name contained . characters.
  • When running a worker load balancer (default for EKS and AKS environments), the built-in Presto cluster will by default use the internal cluster-local load balancer and not the external one.
  • Fixed an issue with pagination on the datasets page where paging to the end of the list and back showed an inaccurate count
  • Improved diagnostic information available when registering a JDBC-backed table that has unsupported types or invalid characters.
  • Improved filter push down for Oracle tables for columns of DATE and TIMESTAMP type.
  • Improved handling of DECIMAL, NCHAR and FLOAT datatypes for JDBC-backed data sources.
  • Improved inference of BIGINT values in text values (e.g. CSV).
  • Fixed an issue where workers were not generating SCAN_END audit events.
  • Fixed an issue where table/view lineage information could be duplicated.
  • Upgrade Gravity to 6.1.39.
  • Remove hardcoded fetch_size in PyOkera and add the ability to explicitly set it using the fetch_size keyword argument to exec_task.
  • Fixed an issue where pagination on the Datasets UI could get into an inconsistent state when filtering by tags.
  • When using tokenize, referential integrity will now also be maintained for INTEGER columns.
  • Add IF NOT EXISTS and IF EXISTS modifiers to GRANT and REVOKE DDLs, respectively.
  • Fixed an issue when doing writes in EMR Spark when metadata bypass was enabled for non-partitioned tables.
  • Added limited support for Avro files with recursive schemas, which will allow a maximum cycle depth of 2.

Notable and Incompatible changes

Upgrading from 2.1.x

When upgrading from Okera 2.1.x that is lower than 2.1.10, some functionality may stop working in the 2.1.x deployment if running side-by-side or downgrading back to 2.1.x. If preserving the behavior is desirable, please upgrade to 2.1.10 or work with Okera Support.

Container user has changed to root

Starting in 2.2.0, the process user inside all the Okera containers (running as Kubernetes pods) is no longer root and is running under dedicated users.

As part of this change, any files that are downloaded into the container (e.g. from S3 for custom certificates) are now placed in /etc/okera and not /etc.

SQL Keywords

The following terms are now keywords, starting in 2.2.0:



Bug Fixes and Improvements

  • Fixed a forward-compatibility issue with 2.2.0


Bug Fixes and Improvements

  • Fixed an issue where a user could create external views in any database using Presto's CREATE VIEW DDL, even though they may not have the appropriate grant on that database.


Bug Fixes and Improvements

  • Fixed an issue where schema inference (used in Data Registration and CREATE TABLE LIKE FILE) for JSON-based tables would incorrectly remove leading underscores and double underscores from column names.


Bug Fixes and Improvements

  • Added the ability to specify additional Presto configuration values using the PRESTO_ARGS configuration value, e.g. `PRESTO_ARGS: "task.concurrency=16 task.http-response-threads=100". Using this capability should be done in coordination with Okera Support.
  • Fixed an issue where the REST Server pod would not restart quickly enough if a failure happened on startup.
  • Fixed an issue where the error dialog in the Data Registration page could be uncloseable.
  • Improved Presto behavior on creating and closing connections to the ODAS workers.
  • Changed the default Presto maximum stage count to 400.
  • Fixed an issue where an ABAC policy that included row filters would generate the WHERE clause missing parentheses around.
  • Fixed an issue where newer Parquet files that included the INT_32 and INT_64 logical types would cause a Parquet read error.


Bug Fixes and Improvements

  • Fixed an issue in Data Registration UI that made pagination behave erratically when using Glue as the backing metastore.
  • Fixed an issue in Data Registration UI where auto-discovered tags would not show up if the column was not editable.
  • Fixed an issue where the in-memory group cache would be overridden with empty groups.
  • Fixed an issue where CSV files that had empty strings would not be automatically converted to NULL values.


Bug Fixes and Improvements

  • Added the ability to increase the REST and UI timeouts to arbitrary values (previously limited to 60 seconds).
  • Removed a restriction when unnesting nested types that did not allow WHERE clauses to be used in those queries.


Bug Fixes and Improvements

  • The HMS length restriction removal will now run on startup for all clusters (unless disabled), not just upgraded clusters.
  • Fixed an issue where keywords were not always escaped in ABAC transforms and filters.
  • Fixed an issue in the UI where the privacy function dropdown in the Visual Policy Builder had the wrong default.
  • Fixed an issue where ODAS errors were not propagating to Presto when creating an external view from Presto.


Bug Fixes and Improvements

  • Updated Presidio to not require any network connectivity in all cases.
  • Fixed an issue where the Datasets UI would render table headers over some dropdowns.
  • Improved the performance of the Datasets page when loading individual datasets.


Bug Fixes and Improvements

  • Fixed an issue when creating a crawler with single-file datasets, causing the registered datasets to use the directory path instead of the file path.
  • Fixed an issue where editing policies in the Policy Builder could in some cases cause an error on saving the edited policy.
  • Fixed an issue where using restricted keywords in Policy Builder would not be escaped properly in some cases.
  • Fixed an issue when using MySQL as the backing database could cause some data types to not be converted correctly via JDBC in some cases, causing exceptions.


Bug Fixes and Improvements

  • Several improvements to handling of S3 errors and failure conditions for very large files.
  • Fixed an issue where in some cases (typically large) Parquet files would cause an error when being queried.
  • Fixed an issue in the Databricks connector where a table would be missing the SerDe path parameter when the table was not cluster local.
  • Fixed an issue in policies where if you had two ABAC policies, one which included a transform and one which did not, they would not compose correctly (this resulted in giving less access than desired in all cases).
  • Fixed an issue when upgrading from 1.5.x where the DB schema upgrade could fail under certain conditions.
  • Fixed an issue in the Presto connector where if a JDBC client issued a query against INFORMATION_SCHEMA with underscores, Presto would error out.


New Features

Extending attribute-based access control policies to support data transformation functions and row filtering

Attribute-based access control policies now support data transformation functions and row filtering. This is supported with an extension to the current ABAC grant syntax. Read more here.

This can significantly simplify how policies can be managed, reduce or eliminate the need to create views and make it much easier to manage complex policies. You can easily create these policies in the UI by specifying ABAC access conditions in the policy builder. See examples of the different policies you can create using Okera's Policy Engine here.

Tag Cascading and Inheritance

Attributes assigned on tables/views and their columns will automatically cascade to all descendant views. Read more about this capability here.

View Lineage

Okera will now maintain the lineage of datasets created after 2.1. It is now possible to know for a given dataset (table or view) what are all the views that descend from it, and for a view to know all its ancestors. This information is also exposed in the UI. Read more here.

Improved Privacy Functions

Okera has a revamped set of privacy-related functions to aid in anonymization with different guarantees. Read more about Okera's privacy and security functions here

Users page and inactivity report

The web UI now offers a users page, where all the users that have authenticated in the system can be viewed, as well as their groups as per the last time they made a request through Okera. This makes it easier to understand if a user should have access to something or not.

The users page also lets you generate a user inactivity report, that shows you all the users who have any level of access on a database, but have not queried that database within a certain time period. This report helps identify users who may not need access to data anymore since they are not utilizing, thereby improving least privilege.

Enable access to the users page in the UI by granting a user or group access to the okera_access_review_role.

GRANT role okera_access_review_role to group marketing_steward_group;

Read more about his capability here.

Access control for attribute namespaces

You can now control access to management of tags by namespaces. ATTRIBUTE NAMESPACE has been added as a new object type, and CREATE ADD ATTRIBUTE and ALL access levels are supported on it. For example if you wanted to give access to a role to create, drop and assign attributes from a particular attribute namespace, you would use the below:

GRANT ALL on ATTRIBUTE NAMESPACE marketing TO ROLE marketing_steward;

In addition to this if you wish to grant access to the tags page in the UI, so that a user can create and manage tags there, grant okera_tags_role to that user's group. Note, in order to assign attributes on data the user will still need to have the correct privileges on the data they are trying to assign on. See the docs for more details.

Other tag management updates

  • Only editable tags (ones the ones a user has CREATE or ALL on) show up on the Tags page.
  • Adding/removing tags from a dataset will ignore tags a user does not have privileges on.
  • Tags page now requires SELECT access on okera_system.ui_tags. The built-in okera_tags_role has this privilege by default.

VIEW_AUDIT privilege to control access to audit logs

You can now grant VIEW_AUDIT privilege on data, to enable a user to view audit log information for that object. For example if the user only had VIEW_AUDIT on two databases, they would only see reports for those two databases in the UI or when querying the okera_system.reporting_audit_logs view. Note to see the reports page in the UI the user also needs the okera_reports_role, see the docs for more details.


The default privilege required to view audit log records for objects has been changed from SELECT to VIEW_AUDIT.

Presto SQL in workspace

The workspace now features Presto SQL mode, which allows executing queries against an Okera cluster using Presto. See the docs for more details.

Creating Views using Presto

It is now possible to create and delete external views via Presto directly. These views will be stored in the Okera catalog (as external views) and be accessible via Presto.

To do this, execute a DDL like this in Presto (e.g. via the Okera Workspace or an application such as SQL Workbench or DBeaver):

CREATE VIEW some_db.some_view AS SELECT ....

In order to support this, Okera has added extensions to the CREATE VIEW DDL statement when executed in Okera:

    <col name> <col type>,

This DDL requires the user to specify the full set of columns that the view statement produces (including types), as the view statement is not parsed or analyzed.

Improved JSON file format support

  • Starting from 2.1.0, Okera uses simdjson to read JSON file format data.
  • Several improvements for auto-inference of JSON file formats with support for appropriate data types. Extensive testing on various JSON files on auto generated files and several internet sources.

Oracle data source support

Oracle is now supported as a JDBC data source. The Oracle JDBC driver will need to be configured as a custom driver. Read more on how to configure this here.

More metadata available on dataset details

There are a number of improvements to the dataset details view in the UI:

  • Much more detailed technical metadata is included
  • It is now possible to edit the description of a dataset
  • It is now possible to edit column comments in the dataset schema
  • View parent/child lineage information is available for views created in Okera
  • Column-level tags are include in the details view along with table-level tags
  • Dataset schema can be filtered by data with column-level tags

Ability to create views from the UI

Admin users can now create an internal view based on an existing view or table from the datasets page. Choosing the destination database and view name and selecting the columns to be included in the new view are supported. For more see the Datasets page documentation.

Permission management improvements

  • A Permissions tab has been added to all details cards on the Datasets page. Like on the Roles page, you'll be able to fully manage permissions associated with the specified dataset. You can read more about this on the Datasets page.
  • Data transforms and row filtering added to Policy Builder UI.
  • Ability to edit existing policies in the UI. To learn more about editing and managing policies, go to Editing Permissions.
  • An admin user can now create a view from a dataset

Reports page improvements

The Reports page has a number of major improvements, including:

  • New reports for Activity overview, Active users over time, Top accessed tags, and Recent queries.
  • SQL used to generate the reports is available in-page and can be run in Workspace.
  • Custom time ranges are available within the last 90 days.
  • Reports queries use human-readable times instead of unix timestamps.
  • Reports can now be filtered by dataset and tag as well as database.
  • Reports filters now allow for multi-selection.
  • Visual updates.

For more details, see the Reports page documentation.

UI Visual and interaction updates

  • There are small visual updates and improvements throughout the UI focussed on clarity and better use of screen real estate.
  • The output of the workspace has been re-worked to better keep a users context and show history.

Updates to reporting and audit views

  • New audit table and view have been added to okera_system database - analytics_audit_logs and reporting_ui_analytics. These are populated by cdas-rest-server container and are primarily used to track and analyze usage of the UI. For now the UI only writes there on page visit. The data is stored in the same logging directory as regular audit logs in its own subfolder.

  • The view used by reports, okera_system.reporting_audit_logs, now includes start_time_utc and end_time_utc columns of type TIMESTAMP_NANOS for better readability.

Improved REST server diagnostics logging

  • Logs now include timestamp and log level.
  • Log level can be set via REST_SERVER_LOG_LEVEL. DEBUG, INFO, WARNING, ERROR, and CRITICAL are all valid.

Bug Fixes and Improvements

  • Upon renaming tables, the attributes from the old tables are now carried to the new renamed table.
  • Performance improvements to parallelize queries with UNION ALL in it. With this enhancement, queries with UNION ALL will leverage Okera multi-tasks across workers vs single task for UNION ALL as previous to this fix.
  • Performance improvements on dropping tables with large number of partitions.
  • Performance improvements on DROP DATABASE CASCADE to drop all tables under the database.
  • For JDBC data sources, large numeric/decimal types (>38 precision) are now handled. The precision is capped at 38 for the larger numeric/decimals or unspecified p/s in the source. If unspecified, for scale, 6 is the default scale post Okera version 2.1.0 .
  • Fix to handle negative decimals in JDBC data scan. The scale is also treated HALF_DOWN for rounding large scale decimals.
  • For create view command if database is not specified, default database name is considered for the view.
  • Fix for parse errors on views with JDBC tables in the view definition (Joins between JDBC and non-JDBC tables).
  • Arrays of arrays are now supported in Okera.
  • log4j2 support: Okera now uses log4j2 as default for logging. A backwards compatibility bridge as recommended by apache project is used for libraries that still use log4j, like certain hadoop/hive libraries.
  • Support for LIMIT on JDBC data sources. This improves the preview of data from JDBC data sources where the data is limited to 100 by default from the Okera WebUI.
  • Better error handling on JDBC data source auto-inference errors on unsupported datatypes. More info here
  • Fix for a regression on authorization on CTEs (WITH Clause) with aggregations in the query.
  • For views involving avro file format, that have column definitions, like complex structs that can have > 4000 characters, use the schema from the avro file instead of creating the physical columns in the database. Note, the describe formatted for such tables/views with > 4k columns still do not show the column details. The describe <table/view> would show the correct definitions.
  • Several bug fixes to handle parquet file format issues gracefully. Example, parquet files with unsupported DataPageHeaderV2 would crash the workers. These are now handled with a graceful error message.
  • Reduce pinger verbose level from error to warn for the Sentry/Hive pinger. This will improve error diagnostics for real catalog exceptions. Earlier, this used to flood the logs with invalid errors.
  • A bug fix for count(*) on a JDBC view to return results instead of a failure.
  • Ability to specify Glue AWS region which can be seperate from the cluster default region.
  • The recordservice catalog in presto is disabled by default starting from 2.1.0 .
  • Additional controls for JDBC (PrestoDB) -> Okera configurations. Example, the rpc timeouts are now parameters that can be controlled from a environment setting. OKERA_PRESTO_PLANNER_RPC_MS and OKERA_PRESTO_WORKER_RPC_MS
  • Minor improvement to remove SerDe info from SHOW CREATE TABLE command. Prior to this fix, re-running the output from SHOW CREATE TABLE command would error out due to the duplication of SerDe info and the FILE FORMAT info. Post this fix, the SHOW CREATE TABLE command would not have the SerDe info and hence re-running the output would work as is.
  • Fix for an avro file format error that has a union with default values in it.
  • UI: Better row hover state highlighting on grouped table rows.
  • UI Error boundaries introduced for increased stability in javascript.
  • Policy Builder layout and formatting improvements.
  • Contextual restrictions on Policy Builder UI including conditional disabled create/edit/delete.
  • More nuanced permission conflict reasons.
  • Upgraded node to 12.15.0.
  • The Presto connector has several improvements for performance, utilizing more efficient APIs and serialization/deserialization formats.
  • Several performance improvements for queries over Parquet files and queries with joins.
  • In the Okera Planner/Worker debug UI, the number of queries displayed has been increased to 256.
  • The audit log has a new field added to it, ae_attribute, which captures all attributes accessed as part of this query.
  • Fixed an issue in the /scan API where some Decimal values would not be serialized correctly.
  • Several improvements to schema detection for TEXT-based files (especially CSV).
  • Added support for md5() (based on the Hive UDF).
  • The has_access() builtin function now supports checking against all privilege levels (previously it only supported ALL and SELECT).
  • Fixed an issue where it was not checked whether an attribute existed or not in some DDL statements that modified attributes.
  • Fixed an issue where the CREATE_AS_OWNER privilege at the catalog level incorrectly gave the SHOW privilege at that scope as well.
  • Improvements to error handling and recovery of metadata operations.
  • Improved default tuning parameters in large memory environments.
  • PyOkera now properly converts all values to JSON-serializable types when scan_as_json is used.
  • Improved admission control when workers are over-subscribed on either active connections or memory metrics.
  • For Gravity-based deploys, Gravity has been upgraded to 6.1.16 LTS.
  • Improved error handling and recovery of the data registration crawler in case of failures.
  • Added the ability to increase the timeout for initializing the catalog on cluster startup by setting the CATALOG_INIT_STARTUP_TIMEOUT configuration value.
  • Fixed an issue where some system tables were not dropped prior to creating them on startup, which can cause an issue on upgrades.
  • Fixed an issue where the audit logs would have incorrect values in case of an error during initialization of an incoming request.
  • Add the ability to specify a column list when executing ALTER VIEW, similar to CREATE VIEW.
  • Improved error message when using non-absolute S3 bucket paths.
  • Improved error handling when parsing a view definition that Okera cannot parse for an external view.
  • Fix an issue where service discovery would consider Kubernetes objects in a different namespace.
  • Fixed an issue where the system would generate unnecessary baseline queries, creating log noise.
  • Added the ability to specify a privilege level filter for the GetTables and GetDatabases APIs.
  • Fixed an issue in PyOkera when handling the CHAR type when there are null values in the data.
  • Fixed an issue where the ae_role column was not always populated for some role-related DDLs.
  • Improved the logging in the Okera REST Server.
  • Added the ability to configure the Planner and Worker RPC timeouts in Okera's Presto, using the OKERA_PRESTO_PLANNER_RPC_MS and OKERA_PRESTO_WORKER_RPC_MS configuration values respectively. The defaults are 300000ms and 1800000ms respectively.
  • Improved retry handling for retryable S3 errors (such as Server Busy, etc).
  • Fixed a bug where database names were not escaped when created in the registration UI.
  • 're-autotag' button on the datasets page now causes the new tags to be fetched upon completion.
  • The UI has a number of new icons.
  • Workspace now includes an execution timer for queries.
  • Improved errors are reported for bad schema found during registration.
  • Fixed a bug where they UI allowed users to 'tag' partitioning columns, but such tags had no effect.
  • Now all dataset views show their view string.
  • "Queries by duration of planner request" is no longer part of the Reports page.

Notable and Incompatible changes

  • Starting from 2.1.0, the published Okera client libraries for PrestoDB support PrestoDB versions greater than 0.234.2 and above.
  • ZooKeeper has been removed as a system component - Okera will now leverage Kubernetes to maintain the worker membership list.
  • The default per-user okera_sandbox database has been removed.
  • When creating Okera views (i.e. internal/secure views), it is now required for the creator to have the ALL privilege on all referenced datasets. This is done to ensure that these tables cannot be incorrectly exposed by users with lesser permissions.
  • Removed the 4000 character limitation on column types. Note that this changes the underlying HMS schema, and if connected to a shared HMS, should be disabled by setting the HMS_REMOVE_LENGTH_RESTRICTION configuration value to false. This is only done for new HMS databases - if you have an existing one from a prior installation, please contact Okera Support for migration procedures.
  • The default privilege required to view audit log records for objects has been changed from SELECT to VIEW_AUDIT. This means some users may no longer be able to see audit logs for their data (if they previously only had SELECT access to it), and will need to be granted VIEW_AUDIT on data they wish to view audit logs for.
  • ML and decision-tree-based autotagging is now enabled by default.
  • OKERA_REPORTING_TIME_RANGE can no longer be used to restrict the available time range in Okera reports.
  • In 2.1.x, many data correctness issues will now fail queries as opposed to silently ignoring them (e.g. converting data into NULL, etc) as in previous versions. To revert the behavior, add --abort_on_error=false to RS_ARGS.

SQL Keywords

The following terms are now keywords, starting in 2.1.0:

  • DO

Known Issues

  • The Okera PrestoDB Connector shipped with this version is compatible with PrestoDB 0.233 and higher. This connector is currently not compatible with any released version of PrestoDB on EMR, as the version of PrestoDB shipped is older than 0.233. This will be fixed in a subsequent 2.1.x maintenance release.


Bug Fixes and Improvements

  • Fixed an issue where many concurrent CREATE TABLE or CREATE VIEW statements could be slowed down waiting on a shared resource.
  • Fixed an issue when authorizing queries on views with complex types.
  • Added an option to use the SYSTEM_TOKEN as the shared HMAC secret for signing and validating tasks (in the Planner and Worker services) rather than using ZooKeeper. This option can be enabled by setting SYSTEM_TOKEN_HMAC: true in the configuration file.
  • Fixed an issue where it was not possible to connect to a Postgres instance that did not have public in the default search_path.
  • Added the ability to specify whether the connection to the database should be done using SSL (this was typically auto-discovered, but in some cases the auto-discovery failed). This can be enabled by setting CATALOG_DB_SSL: true in the configuration file.
  • Fixed an issue where schema upgrades did not work for remote Postgres instances.
  • Fixed an issue where the Workspace UI would scroll beyond the window if there was a long error.


Bug Fixes and Improvements

  • Added the ability to edit dataset and column descriptions in the Okera UI.
  • Fixed an issue where datasets discovered by the crawler that had columns whose type definition exceeded 4,000 characters couldn't be registered.
  • Added more control options for LDAP group resolution configuration:
  • Fixed an issue where Avro datasets that had a union type with a single child (e.g. union(int)) would throw an error. These types of unions are now fully supported.
  • Fixed an issue where decimals that were stored as a byte_array in Parquet files were not read correctly.
  • Added a configuration option to control the maximum number of allowed Sentry and HMS connections:
  • Fixed an issue where changing the description of the view (or a column in it) via DDL was not supported.
  • Fixed an issue where columns that contained arrays or maps with embedded null values were not handled correctly in the Java-based clients.
  • Fixed an issue in PyOkera where it would incorrectly decode negative decimal values with a precision higher than 18.
  • Fixed an issue when --allow_nl_in_csv=True was set and the CSV file used a different quote character than " - it would improperly use the " to escape line breaks.
  • Improve the Crawler's ability to automatically use the OpenCSV SerDe when necessary.
  • Fixed issues for handling complex types that had several nested arrays/structs/maps with null values interspersed.
  • Fixed an issue where reserved keywords were not possible to be used (as escaping them wouldn't work) as attribute namespaces and attribute keys (e.g. myns.true).
  • Add the ability to use CREATE TABLE LIKE TEXTFILE, which will automatically deduce the schema from the CSV file (this assumes the headers are the first line).
  • Improved handling of non-parsable SQL statements when accessing a view that was created outside Okera (e.g. in Hive). This capability is enabled by an environment flag ALLOW_NONPARSEABLE_SQL_IN_VIEWS: true set in the configuration file for the cluster.
  • Fixed an issue where the same tag could appear twice in the UI.
  • Fixed an issue where dropping an external table referencing a bucket that does not exist would fail.
  • Fixed an issue where the crawler Data Registration page for a given crawler would display incorrect "Registered" tables if their path was a simple prefix of the crawler root path.
  • Added support for using a dedicated Postgres server (e.g. on RDS) as the backing metadata database.


New Features

Bucketed Tables

ODAS now supports bucketed tables and applying efficient joins to them. You can find more details here.

AWS Glue

ODAS now supports using AWS Glue as the metastore storage, allowing you to connect ODAS to an existing Glue catalog. You can read more about this support and enabling it in the Glue Integration page.

Auto-tagging Improvements

  • ODAS now employs an ML-based engine for some of the out of the box auto-tagging rules, such as address and phone number detection.

  • You can now create and manage the regular expression-based rules that are used by the auto-tagging engine in the UI. You can read more about this in the Tags page.

  • The number of datasets tagged with a tag is now shown in the UI.

  • ODAS can continuously auto-tag your existing catalog in the background. You can enable this by setting the ENABLE_CATALOG_MAINTENANCE setting in your configuration file.

  • ODAS will now auto-tag the data inside nested complex types, and apply the discovered tag(s) at the root column-level.


ODAS now supports ADLS Gen2 data storage for both querying and data crawling. You can register these data sources by specifying a path with either the abfs:// or abfss:// prefixes.

Web UI

  • The ODAS Web UI has been revamped to be easier to use and update the look-and-feel.

  • A Roles page has been added, allowing you to fully manage roles (create/update/delete) and their group and permission assignments. You can read more about this on the Roles page.

  • The 'About' dialog has been replaced by a System page.

JDBC Data Sources

  • Redshift External Tables are now supported for JDBC data sources of type redshift.


  • There are now DDL statements to work with tags, namely:

    • DESCRIBE <table>, DESCRIBE FORMATTED <table>, DESCRIBE DATABASE <database> will now output tag assignments.
    • CREATE ATTRIBUTE <attr> and DROP ATTRIBUTE <attr> will create/remove attributes (note that namespaces will be automatically created if they don't already exist).
    • SHOW ATTRIBUTE will show the list of currently existing attributes.
    • ALTER TABLE and ALTER VIEW now have new operations of ADD ATTRIBUTE <attr>, REMOVE ATTRIBUTE <attr>, ADD COLUMN ATTRIBUTE <col> <attr> and REMOVE COLUMN ATTRIBUTE <col> <attr> to add/remove attributes at the table-/view- and column-levels respectively.
    • ALTER DATABASE now has new operations of ADD ATTRIBUTE <attr>, REMOVE ATTRIBUTE <attr> to add/remove attributes at the database-level.
    • CREATE TABLE and CREATE VIEW can now take an optional set of attributes during table creation. For example:
      CREATE TABLE mydb.mytable (
          col1 int COMMENT "some comment1" ATTRIBUTE myns.myattr1,
          col2 int COMMENT "some comment2" ATTRIBUTE myns.myattr2,
          col3 int COMMENT "some comment3" ATTRIBUTE myns.myattr3
  • Rule defintions now accept a "name" field. For backwards compatibility and convenience, the "name" is auto-generated if not specified.

Bug Fixes and Improvements

  • ODAS has updated Docker images that update many dependencies including the base OS, Python, OpenSSL and more.
  • Added a way to configure the structure of the data files the crawler will use while crawling. See the docs on creating a crawler for more.
  • Added crawler search box on the data registration page.
  • Added additional validation for the crawler name and path when creating a new crawler.
  • There is now an ability to re-run the autotagging rules on an individual datasets within the Datasets page, by using the Re-autotag button.
  • Fixed an issue where datasets with complex types that had a MAP embedded in a STRUCT embedded in ARRAY would not be handled correctly.
  • Added the ability to revoke grants on objects that no longer exist.

Incompatible changes

  • Previously by default users would only see reports for datasets they had ALL access to. Since many stewards may not have ALL access on the data, this has now been changed so they will see reports for all data they have SELECT access to. If necessary this can be configured back to ALL by editing the view definition of okera_system.steward_audit_logs dataset.
  • Starting from 2.0.0, Okera will only support EMR versions greater than 5.11.0 up to 5.28.0. Note, the versions of EMR less than 5.10.0 would still continue to work but it is recommended to upgrade to a recent EMR version for latest ODAS compatibility.
  • The behavior of using REVOKE on permissions (e.g. REVOKE SELECT) has been changed to not cascade by default. For example, in 1.5.x and earlier versions, REVOKE SELECT ON TABLE mytable would also revoke any
  • Starting in 2.0.0, the published Okera client libraries for PrestoDB support PrestoDB versions greater than 0.225 and above. You can use published Okera client libraries from prior Okera versions (which will continue to work against an ODAS 2.0.x and higher cluster) to support earlier PrestoDB versions.
  • The Permissions page has been removed - all links to it (e.g. in bookmarks) will no longer work.
  • Private tags on datasets have been removed. Datasets can not longer by filtered by private tags.

SQL Keywords

The following terms are now keywords, starting in 2.0.0:


Deprecation Notice

  • Starting in 2.0.0, we are deprecating the ocadm and odb CLI utilities. If you desire to continue using odb, the binary from 2.0.x and prior releases should continue to work against. However, in future releases we will not ship new binaries of these utilities.


Bug Fixes and Improvements

  • Fixed an issue where writing to non-partitioned tables from Spark would fail if Spark bypass was enabled.
  • Improved error handling when doing unsupported operations on complex types.
  • Fixed an issue where running count(struct_field.some_value) would fail when run inside views.
  • Fixed an issue where using ORDER BY in an external view could fail an authorization check.
  • Fixed an issue where some decimals were not serialized properly when accessed via the /scan API.
  • Improved some error handling on the node-remover CronJob for Gravity-based clusters.
  • Fixed an issue where CTEs that contained aggregations would fail.
  • Added the ability to disable Zookeeper-based worker membership and instead leverage the Kubernetes metadata. This can be enabled by setting OKERA_KUBERNETES_MEMBERSHIP: true in the configuration file.


Bug Fixes and Improvements

  • Fixed several issues related to access control on tables and views with complex types.
  • Fixed an issue when registering JDBC tables with many columns.
  • Fixed an issue where small decimals would not be returned correctly when queried via the Presto endpoint.

Notable and Incompatible Changes

  • In PyOkera, scan_as_json now defaults strings_as_utf8 to True, matching the behavior prior to 1.5.2.


Bug Fixes and Improvements

  • Fixed an issue in PyOkera where scan_as_json and scan_as_pandas would ignore the tz option supplied on the context object.
  • Fixed an issue in the Presto client library where it did not properly handle null checks on STRUCT columns.


Bug Fixes and Improvements

  • Fixed an issue where queries on views that referenced STRUCT columns could fail when an ABAC permission applied to it.


Bug Fixes and Improvements

  • Fixed an issue where many concurrent CREATE TABLE or CREATE VIEW statements could be slowed down waiting on a shared resource.
  • Fixed an issue when authorizing queries on views with complex types.
  • Fixed an issue where the server was not properly clearing the effective user when different user utilize the same underlying planner connection (this typically only happens in PyOkera scripts that switch between different users, such as tests).
  • Added an option to use the SYSTEM_TOKEN as the shared HMAC secret for signing and validating tasks (in the Planner and Worker services) rather than using ZooKeeper. This option can be enabled by setting SYSTEM_TOKEN_HMAC: true in the configuration file.


Bug Fixes and Improvements

  • Fixed an issue where in certain EKS environments, the CPU scheduler was not properly saturating the CPU capacity.
  • Fixed an issue where scanning Parquet files would fail if their dictionary_offset was after the data _page_offset.
  • Added an improvement for SerDes that use field-delimiters, to allow specifying field-delimiters within double-quotes


Bug Fixes and Improvements

  • Fixed an issue in PyOkera where it would incorrectly decode negative decimal values with a precision higher than 18.
  • Fixed an issue when --allow_nl_in_csv=True was set and the CSV file used a different quote character than " - it would improperly use the " to escape line breaks.
  • Improve the Crawler's ability to automatically use the OpenCSV SerDe when necessary.
  • Fixed issues for handling complex types that had several nested arrays/structs/maps with null values interspersed.
  • Fixed an issue where reserved keywords were not possible to be used (as escaping them wouldn't work) as attribute namespaces and attribute keys (e.g. myns.true).
  • Add the ability to use CREATE TABLE LIKE TEXTFILE, which will automatically deduce the schema from the CSV file (this assumes the headers are the first line).


Bug Fixes and Improvements

  • Added the ability to edit dataset and column descriptions in the Okera UI.
  • Fixed an issue where datasets discovered by the crawler that had columns whose type definition exceeded 4,000 characters couldn't be registered.
  • Added more control options for LDAP group resolution configuration:
  • Fixed an issue where Avro datasets that had a union type with a single child (e.g. union(int)) would throw an error. These types of unions are now fully supported.
  • Fixed an issue where decimals that were stored as a byte_array in Parquet files were not read correctly.
  • Added a configuration option to control the maximum number of allowed Sentry and HMS connections:
  • Fixed an issue where changing the description of the view (or a column in it) via DDL was not supported.
  • Fixed an issue where columns that contained arrays or maps with embedded null values were not handled correctly in the Java-based clients.


Bug Fixes and Improvements

  • Improved ZooKeeper membership registration and cluster health check capabilities. The cluster can now identify more cases where a node gets incorrectly deregistered and self-heal.
  • Improved handling of non-parsable SQL statements when accessing a view that was created outside Okera (e.g. in Hive). This capability is enabled by an environment flag ALLOW_NONPARSEABLE_SQL_IN_VIEWS: true set in the configuration file for the cluster.


Bug Fixes and Improvements

  • Fixed an issue where Hive/Hue could not load the table listing for a database if it contained a view that Okera could not parse.
  • JWT tokens with a group claim can now have that claim be a simple string denoting the group rather than having it be an array.


Bug Fixes and Improvements

  • Improved performance of attribute access checks on wide views.
  • Fixed an issue where an attribute-based grant on a view with a complex type might not properly omit the complex type column.
  • Added support for CSVs with embedded newlines within records that are enclosed within the quote separator. To enable this, specify --allow_nl_in_csv=true for RS_ARGS in your ODAS configuration.


Bug Fixes and Improvements

  • Fixed an issue where joining or unioning a dataset with itself could cause an invalid query plan to be generated, preventing that query from being run.
  • Fixed an issue where a column-level grant on a view could allow joining on columns other than those granted.
  • Improved the detection in PyOkera of whether Pandas and NumPy are installed, and if not, still allow usage of all functionality that does not require them.
  • Fixed an issue where an external view in Hive which has both row_number() and an ORDER BY clause could cause the query to not succeed.
  • Fixed an issue where non-conformant Parquet files that have a mismatch between the number of records specified in the dictionary header vs. the actual batch would cause the file to not be queryable.
  • Added the ability to specify the CATALOG_DB_PASSWORD, LDAP_GROUP_RESOLVER_PASSWORD and LDAP_USER_QUERY_SERVICE_PASSWORD in a Kubernetes secret.
  • Added the ability to okctl to specify CATALOG_DB_PASSWORD, LDAP_GROUP_RESOLVER_PASSWORD and LDAP_USER_QUERY_SERVICE_PASSWORD as file paths in the configuration file.


Bug Fixes and Improvements

  • Fixed an issue for Parquet files where TIMESTAMP and TIMESTAMP_MILLIS columns that were backed by int64 were not supported.
  • Fixed an issue where an invalid plan could cause the worker to crash.
  • Added two new DDLs that allow changing the comment on a table and column:
    • ALTER TABLE <table> CHANGE COMMENT '<comment>'
    • ALTER TABLE <table> CHANGE COLUMN COMMENT <col> '<comment>'
  • Added APIs to get and set the description on a dataset and column:
    • GET/PUT /datasets/<name>/description
    • GET/PUT /datasets/<name>/columns/<column>/description
  • For PyOkera, execute_ddl now takes an optional requesting_user parameter, similar to the plan and scan_as_... functions.
  • Fixed an issue where a column-level grant on a view could allow filtering (but not viewing) on columns other than those granted when executing a query in Workspace.


Bug Fixes and Improvements

  • Fixed an issue where DECIMAL columns in Avro schemas would not get detected properly.
  • Added the ability to provide a default clamp value for DECIMAL columns whose precision exceeds the maximum precision allowed (38). This can be set using the AVRO_SCHEMA_TOO_HIGH_PRECISION_FALLBACK configuration value.
  • Added support for skip.footer.line.count table property.
  • Performance improvements in the case of many small files in a single partition (NOTE: it is still recommended to avoid having small files).
  • Fixed an issue where some sensitive values would be exposed in the Planner and Worker debug UIs.
  • Added the ability to enable setting X-Frame-Options: DENY for all requests by setting the FRAME_OPTIONS_DENY_ENABLED configuration value.
  • Added the ability to enable the Secure flag on the session cookie using the OKERA_SHARED_COOKIE_SECURE configuration value.
  • Improved default cipher support for TLS1.2.
  • Added the ability to control the duration of the generated JWT when logging in by setting JWT_TOKEN_EXPIRATION to the desired number of seconds (minimum is 300 seconds).


New Features

JDBC Data Sources

  • Added support for Sybase.
  • Added support for filter pushdown.
  • Added support for count(*) for JDBC data sources.
  • Added support for case sensitive column names.
  • Added support for specifying custom SSL CAs to use to validate when making SSL connections to the JDBC data source.

Audit Log Uploads

It is now possible to configure audit logs to be uploaded in an immutable fashion. When enabled, audit logs will be uploaded with a .staging.audit and .staging.reporting suffix until they are finalized, and will then be uploaded without the .staging portion when finalized.

To enable this, set WATCHER_AUDIT_LOG_STAGING_FILES to true or 1.

Additionally, it is possible to force the audit logs to be uploaded after a certain number of seconds have passed, by specifying WATCHER_AUDIT_LOG_MAX_UPLOAD_SEC.


  • PyOkera now has full support for complex types (ARRAY, MAP, STRUCT).
  • context.enable_token_auth now accepts an optional argument called token_func, which can reference a no-argument function that when called, returns a valid token to be used. Note that this function must be pickle-able (and an error will be returned if it isn't), as it will be used across multiprocessing calls.
  • PyOkera now supports running scan_as_json and scan_as_pandas using Presto.

Bug Fixes and Improvements

  • Added the ability to ignore LDAPS certificate errors when doing group resolution.
  • Added the ability to set Presto tuning variables, specifically:
  • Improved handling for Date type in JDBC data sources.
  • Improved handling of broadcast joins using cross-task caching.
  • Fixed an issue where JDBC data sources that had USING VIEW AS did not properly handle single quotes in the view.
  • Fixed an issue where JDBC data sources did not close the connection properly when no more events were necessary, causing poor performance.
  • ODAS Web UI will now automatically redirect to the https URL if a user navigates to the http one.
  • Added the ability to control how long the Web UI waits before timing out a request to the server (default is 30000, in milliseconds), by setting the UI_TIMEOUT_MS configuration.
  • ODAS Web UI will now break out the inner portions of ARRAY and MAP complex type columns.
  • Added the ability to configure ODAS to look for user-specified claims in the JWT to determine the user (JWT_USER_CLAIM_KEY, default is sub) and groups (JWT_GROUP_CLAIM_KEY, default is groups).
  • Added support for partitioning schemes on S3 that do not contain the partition column name in the folder, e.g. s3://company/dataset/2019 vs s3://company/dataset/year=2019. This can be enabled by setting okera.hms.allow-no-name-partitions to true in hive-site.xml.
  • Fixed an issue where array and map indexing in an external view definition would cause ODAS to fail to parse.
  • Added support to specify strings_as_utf8=True when using scan_as_json in PyOkera.
  • Fixed an issue in PyOkera when converting a CHAR column to UTF-8.
  • Upgraded several dependencies.

Notable and Incompatible Changes

  • The bundled Presto service now exposes an additional "catalog" (in Presto terms) called okera (in addition to the existing recordservice one). These are identical and contain the same datasets. In a future version, the recordservice catalog will be removed and is now deprecated. All clients should shift usage to the okera one.

  • Removed the default from deserializer column comment that would appear for Parquet and Avro files when created using CREATE TABLE LIKE FILE.

  • In PyOkera, when using scan_as_json, date columns are now serialized to millisecond precision without the corresponding timezone, to match output of other APIs.

  • The driver type of redshift is now required to connect to Redshift, and the postgresql type will no longer work. This was done as the drivers have deviated and they were updated for security and performance reasons.


New Features

SAML Support

It is now possible to configure authentication to ODAS with SAML providers.

JDBC Data Sources

  • Added support for MS SQL Server.
  • Added support for Redshift External Tables.

LDAP Authentication

It is now possible to configure LDAP authentication to do two-step authentication (DN lookup followed by authentication).

Bug Fixes and Improvements

  • Data Registration Crawler improvements:
    • Increased performance on large partitioned tables.
    • Improved filetype classification.
    • Avro schema comment fields (i.e. description) will now be inherited by ODAS when registered.
  • Azure improvements and fixes:
    • Add support for Azure MySQL connections where SSL is required.
    • Fix an issue where CREATE TABLE LIKE FILE was not properly loading Avro schema files from ADLS.
  • Fix a bug where ODAS was caching UDFs when a pattern was set in a call to SHOW FUNCTIONS.
  • Added the table property to control whether automatic partition recovery is enabled for a particular table: 'okera.auto_partition_recovery.disable'='true'.
  • Improved handling for when doing DROP DATABASE CASCADE on a database that does not exist.
  • ODAS will now respect the LOCATION field set on a database.
  • Kubernetes liveliness and readiness probes have been tuned to cause less load on the system.
  • Added tables in okera_system to expose role and group information.
  • Fixed an issue in the Hive SerDe to properly initialize the header skip flag.
  • Fixed an issue where the compiler was generating invalid CPU instructions for Decimal types due to bad memory alignment.
  • Respect the value of OKERA_WORKER_LOAD_BALANCER if it is passed in.
  • Disable an optimization when doing a join where the second table was larger than 128MB.
  • Fixed an issue in the Avro parser that did not allow for default values of empty arrays and maps.
  • Fixed an issue where partition names were not properly escaped in Hive.
  • Fixed an issue in the Kubernetes resource files for Presto to reference the correct version.
  • Improved system availability when registering a high number of partitions.

Notable and Incompatible Changes

  • Previously, changes to the CATALOG_ADMINS setting would not get fully reflected on a cluster that had previously configured these. In this release, users and groups referred to by CATALOG_ADMINS will be automatically granted admin_role on startup. If you have users that you no longer want to be admins, you should remove them from CATALOG_ADMINS.


New Features

Policy builder

A new interactive policy builder in the Okera Portal. Table access policies and fine-grained permissions can now be granted through the UI.

Attribute-based access control updates

Updated syntax and and other improvements to attribute-based access control (ABAC).

See the ABAC docs for more.

Other improvements

  • Azure: added experimental support for ADLS Gen2 - users can now CREATE EXTERNAL TABLE on data that is stored in Gen2 storage, and query that data.
  • Added IF EXISTS to DROP ROLE, so you can now do DROP ROLE IF EXISTS <role>.
  • Changed to how we deploy ZooKeeper on Kubernetes to better handle node failures.
  • Updated the underlying Thrift library to version 11 to stay more current. This should have no user-visible impact.
  • Improvements to ALTER TABLE <table> RECOVER PARTITIONS to improve its runtime. There is more work planned for future releases.
  • Added a new table property that allows CREATE TABLE <name> LIKE <FILETYPE> to handle cases where a partition column and data column exist with the same name.
  • Improved handling of automatic file type detection in crawlers for Avro and JSON files.
  • The mask() UDF is now always available.
  • Permission model now supports CREATE_AS_OWNER, which lets users create objects in the catalog and be given owner (i.e. ALL) privileges on the new object. This can be used to create per user (staging) tables or to support distributed stewardship.
  • Fixed a bug where it was not possible to override the database named used for the CATALOG_DB_OKERA_DB database.
  • Fixed a bug where you could create grants that were invalid and would fail downstream - we now fail them at the point of creation.
  • Added a num_results_read column to okera_system.audit_logs, denoting the number of records read during a particular operation.
  • Support for special characters in column names. Okera now expands the special characters supported in column names on par with ANSI-SQL specification. Characters that are still not supported in the column name are ., `, : and !. The special characters in a column name can be escaped by backticks. For example, if the column name is Special Chars (name) then the column name can be specified as, CREATE TABLE special_chars.sample `Special Chars (name)` STRING
  • The cerebro-web Kubernetes service was removed. All functionality is now consolidated into the cdas-rest-server service. Note: on using the Deployment Manager to upgrade from previous versions to 1.5.0, the cerebro-web service will continue to exist after the upgrade. The service is vestigial, however, and should not be used. If there is need to remove this service entirely, please open a support ticket.
  • Improved robustness of service discovery in several places.
  • Added CEREBRO_EXTERNAL_PLANNER_HOST and CEREBRO_EXTERNAL_PLANNER_PORT, which can be set to override the planner's external host/port shown in the UI.

Incompatible changes

  • Any external tooling checking for the existence of the cerebro-web service will no longer function. These tools should be updated to point at the cdas-rest-server service, which now encompasses the functionality.
  • Removed okera_system.weekly_audit_logs and okera_system.monthly_audit_logs views, since the UI preview was not functioning properly for them.
  • OKERA_PORT_CONFIGURATION, set in for Deployment Manager installs, no longer recognizes the cerebro_web:webui port. Please change this value to cdas_rest_server:webui for new clusters.


New Features

Improved cluster deployment

Okera clusters can now be created without using the Deployment Manager.

Support for granting column access to views

In previous Okera versions, it was not possible to grant column-level access on views, only tables. It is now possible to grant on columns in views as well.

See the docs for more.

LDAP group resolution

Okera can now issue an ldapsearch to retrieve the groups associated with the username contained in a JWT if no groups are embedded in the JWT.

See the docs

Other Improvements

  • Added a new way to set up automatic multi-tenant authentication for EMR and CDH integrations.
  • Added an ability to create one-node quickstart clusters that have out-of-the-box configuration including SSL, JWT, user/group settings.
  • Improved automatic service discovery for inter-service communication, allowing us to increase resiliency in the case of node failures.
  • Improved handling of unsupported or invalid views, typically inherited from an existing metastore. The view metadata can now be returned (but they are still un-queryable).
  • Okera now supports hms escaped partition paths. Additional characters that were not escaped previously can now be used in the partition path. For example, spaces and hyphens: timestamp-partition/time_val=2019-06-11 00:00:00. Note, partition paths with '=' or '/' are not yet supported.
  • Full support for complex map types in parquet data.
  • Added support for complex types of map<string, array<string>>.
  • Added a new builtin function, current_date, which is like current_timestamp but just returns the date portion.
  • Enabled selecting current_date and current_timestamp as columns, e.g. select current_timestamp vs select current_timestamp().
  • Upgraded kube-prometheus to 0.1.0 (latest at time of publishing).
  • Added support for timestamps outside of typical data ranges. While we don't expect a lot of user data from the dark ages, sentinel values in those ranges as well as year 0 are valid. They will be passed through without transformation so that the data values can be read.
  • Added better support for Hue when some fields are null.
  • REPORTING_TIME_RANGE can now be set directly in
  • Reduced number of retries and yield time for HDFS connection attempts.
  • Okera now escapes partition columns to support keywords as partition column names.
  • Fixed a bug where data registration crawlers were treating hidden files as possible dataset files.
  • Fixed a bad error message in the UI when a database was not found on the permissions page. The error is clearer now.
  • Removed CORS headers from REST Server. Fixed a security bug where the REST server was returning a wildcard hostname in its CORS headers. This has been fixed by removing CORS headers entirely.
  • Fixed a bug where if a view had any constant-time expressions such as decode we would not do any access checks.
  • Fixed a bug where in some cases, select count(*) did not work if a user only had column-level access.
  • Fixed to skip check for table format on views.
  • Fixed storage descriptor path for Databricks based on provider for spark.
  • Fixed column access check for count(*) on views.
  • Fixed an issue with spark and presto clients where select * queries returned incorrect results for users with partial access to views.
  • If defaultdb property is not provided, JDBC connections will now use as default db for connecting.


New features


Tags can now be assigned to datasets or columns to mark the type of data they contain. For example, a ‘Sensitive’ tag can be created and assigned to any columns containing sensitive data. The Datasets page can be filtered by these tags to view only datasets or columns with certain attributes. Complex-type columns can be tagged, but not nested elements within a complex type.

Tags may only be created and assigned by users in admin roles and will be visible to all users. Admin users may also give other roles the ability to assign tags in the Workspace page.

Any user may still create Private Tags for their own use.

  • See the docs for more details.


In order to reduce the manual work of tagging, an Auto-Tagger can be configured to detect when a column is likely to contain a certain type of formatted data, such as a Phone Number or Social Security Number and will apply the relevant tag to that column. This occurs when a new dataset is discovered on the Data Registration page.

Attribute-Based Access Grants (ABAC)

Admin users can now grant access to tables based on tags. For example, an admin may grant users access to all data tagged as ‘Sales’ inside a particular database. This allows access grants to be based on data attributes instead of only on technical metadata (e.g database name or dataset name). Please note that ABAC grants are currently only fully supported on tables and not views. ABAC grants on views will only be enforced when tags are on the view level, but not on the column level. For ABAC grants on tables, both table level and column level grants are fully supported. Full support for views is coming soon. All existing RBAC grants remain unaffected and you can still create RBAC grants. ABAC and RBAC grants are additive, which means if either grant give the user access, the user will be able to see that table.

JDBC Support

Added a JDBC endpoint and native Presto support. A new cluster type, STANDALONE_JDBC_CLUSTER, is now available. Specifying STANDALONE_JDBC_CLUSTER will bring up a cluster that includes Presto and exposes a JDBC endpoint for use with Tableau and other JDBC-enabled analytics clients.

JSON file format

  • JSON file formats are now supported by ODAS.
  • All data types similar to avro and parquet are supported with the exception of maps. Maps can already be represented as valid json structure.
  • JSON tables can be created via auto-inference or stored-as-json syntax
  • See the docs for more details.
  • JSON files are now supported in the data registration wizard.

DATE type

  • DATE type is now supported.
  • See the docs for more details.

AWS CloudTrail Integration

  • Okera can consume AWS CloudTrail API event logs to more accurately determine when it is appropriate to perform maintenance operations. For example, the automatic discovery of new datasets and dataset partitions can occur faster and more efficiently when Okera receives direct notifications from AWS regarding S3 write operations. Without CloudTrail event consumption, Okera will fall back onto a polling model for detection of dataset changes. Refer to the Quick Start Guide: AWS CloudTrail Integration document for details.

Performance Improvements

  • The improvements includes specific optimizations for partitions metadata handling to improve performance on scanning data with partition filters.
  • Introduced new compression method (zstd) for efficient transfer between ODAS cluster and clients like spark and hive. The default compression is now zstd.
  • Introducing Okera SQL Extensions to our spark client.
  • This is an extension capability provided by spark using which we can augment the spark plan to pass additional information to ODAS.
  • This is primarily for two optimizations at this point,
    • To push down functions that are supported by ODAS like CAST/UPPER/LOWER/UNIX_TIMESTAMP
    • Implemented metadata only optimization for queries that have aggregation on just partition columns This is inspired from spark's own version of such optimization as here

Other Improvements

  • AWS Athena can be registered and used as a JDBC data source. See docs
  • New CREATE_AS_OWNER privilege that grants ability to create a database and automatically receive ALL privileges on that database. Note: CREATE_AS_OWNER does not cascade to all tables. You will not be able to create tables inside databases you have not created with this privilege.
  • Cluster name may be customized and will display in the navigation bar.
  • Crawlers may now be deleted on the Data Registration page.
  • Crawlers can now discover JSON data types on the Data Registration page.
  • The Permission page now displays the full list of permissions for the column, dataset, database, and server scopes affecting a given database. For example, if there is a group that only has access to the selected database, then that group will appear in the full list.
  • The Permission page indicates any Attribute Based Access Control expressions granting a group's level of access.
  • Improved error messaging throughout the Okera Web UI, specifically in the Workspace page and Dataset Preview.
  • Decimal types in i32 and i64 storage formats are supported with latest versions of Parquet, instead of just fixed_length_byte_array. Starting from 1.4.0 version, ODAS supports handling these additional i32 and i64 decimal storage formats along with fixed_length_byte_array.
  • ODAS shares existing HMSs which contain ORC files created by hive. However, the metadata load will fail for such cases. With 1.4.0 version of ODAS, we support ORC file format for metadata load. Note scans will still fail for ORC files with 'ORC files are not currently supported.' error.
  • Extended support for MAP complex types for PARQUET file formats. It is now possible to use MAP<STRING, STRUCT> and MAP<STRING, ARRAY>. Note, this is still not available for AVRO types.

Incompatible changes

  • The default resolution of parquet schema has been changed to be by name. To be explicit, this flag is now by default set to true, default_parquet_resolve_by_name=true. Prior to 1.4 the default was by ordinal (position).
  • The way access is controlled for the Workspace and Reports features in the UI has changed. Current users may need their access updated as a result:
  • Where before a user needed ALL or SELECT access on any dataset in the Okera catalog to access Workspace, that user now needs SELECT access on okera_system.ui_workspace. See docs for more info.
  • Where before a user needed ALL or SELECT access on okera_system.reporting_audit_logs to access Reports, that user now needs SELECT access on okera_system.ui_reports. See docs for more info.

Known Issues

  • The following Okera configurations cannot be set directly in and must instead be listed in the SERVICE_ENVIRONMENT_CONFIGS environment variable in

Example for this case:

  • ABAC grants on views will only be enforced when tags are at the view level, but not at the column level i.e If you assign some tags on columns on a view and then create a grant on that view for only those columns, it will not be enforced. If however the tag is at the view level, the grant will be enforced. For ABAC grants on tables, both table level and column level grants are fully supported.

1.3.4 (Mar 2019)

This release contains the following changes

  • Enhancement to ALTER TABLE statement to allow partition location change.
  • Support for scanning alternate partition location outside the table base path.
  • Adjust health-check frequency to accommodate longer cluster start times.
  • Optimized concurrent loading of metadata in workers, to prevent overloading the catalog with calls.
  • Reduce noise from logs from in-memory cache management and repeated log entries from custom UDF log errors.
  • Speeds up UI preview for large tables with many partitions to avoid time outs. Preview will show results from the last partition.
  • Control docker log size in containers with log size restrictions and log rotation policy.
  • Handle presence of unsupported complex type fields in text format data, gracefully.
  • Fixed a memory leak that occurs in the REST container when a query invoked via Workspace times out.
  • Fixed an env variable that controls the number of PyOkera worker processes in the REST container.
  • Increased the number of Gunicorn worker processes in the REST container from 4 to 8.
  • Support for EMR5.20. ODAS now handles backward compatibility breaking changes in Presto SPI.

Known issues

  • Hive does not support scanning partitions where the partition name and the physical location do not match. eg. we do not support scanning via hive if the partition is year=2010,month=2,date=29 and the partition location is s3://foo/year=2012/month=4/date=21/, or s3://a/b/ ,
  • Hive does not support scanning partitions where the partition is outside the table base directory. eg., if table base dir is s3://foo/loc1/ and the partittion is at s3://foo/loc2. For the above cases, you may use spark, databricks, pyokera or the workspace

1.3.1 (Feb 2019)

  • This release contains a hot fix for the large partitions optimization introduced in 1.3.0 release. Due to this issue, the filters on partition columns in certain cases result in full table scan and can result in incorrect results.

1.3.0 (January 2019)

1.3.0 is the next major Okera release with significant new functionality and improvements throughout the platform.

Major Features

Dataset Registration

Datasets can now be registered in bulk through the Okera Web Portal. Choose an S3 path to crawl, and ODAS will inspect all files in that S3 path, finding possible datasets. Those datasets can be verified, modified, and registered one at a time or in bulk as needed. See docs for details.


ODAS clusters leverage grafana for monitoring and has been updated substantially in this release. The metrics are now backed by Prometheus and the out of box monitoring dashboards have been improved.

Support for Parquet formats

Parquet formats are now fully supported with exception to complex types on Map. Full details, see docs

Other improvements

  • AWS EMR support extended and can be seen on

  • PyOkera is now supported on python 3.6 and 3.7.
    We recommend all clients update to the 1.3.0 PyOkera version.

  • Added support for DESCRIBE DATABASE <db_name>. See describe database

  • Partitioned column information in DESCRIBE FORMATTED <view_name> command output. See describe statement

  • Base tables referenced in the view in DESCRIBE FORMATTED <view_name> command output. See describe statement

  • Global UDF support.
    User defined functions can now be created and shared across database. Access them without the need to qualify everytime. More details here.

  • Planner UDF caching.
    Hadoop clients (for example Hive and Spark running on EMR) will load the UDFs on startup. For large catalogs (specifically number of databases), this can impact client startup time. In this release, the registered UDFs are cached on the planner, with a default time to live (TTL) of 30 seconds. This should significantly speedup client start up time in these cases.

  • Added support for SHOW GRANT GROUP and SHOW GRANT USER.
    These provide convinient ways to list all the grants for a group or user, in addition to SHOW GRANT ROLE.

  • It is now possible to create a table against a fully qualified path.
    Previously, tables (and partitions) had to be created over a directory. It is now possible to create a table over a single file, simply using the full path as the LOCATION. For more details, see supported sql.

  • ODAS clusters will now default to starting up with multiple planners.
    For clusters (larger than 1 node), the default number of planners will be greater than 1. This offers better availability and load balancing. This value can still be controlled as before, by specifying the --numPlanners option when creating the cluster.

  • ocadm now supports restarting a single service in a running cluster.
    Previously, it was only possible to restart the entire cluster (all services). See ocadm clusters restart help for details.

  • Grafana and monitoring improvements.
    ODAS clusters leverage grafana for monitoring and has been updated substantially in this release. The metrics are now backed by Prometheus and the out of box monitoring dashboards have been improved.

  • Idle clients now timeout after 120 seconds.
    A client is considered idle if it has no active requests for more than the configured time. Idle clients will now timeout after 120 seconds and the queries associated with that client will be cancelled. Note that requests that take a long time and keep the server busy are not considered idle. This config can be controlled by the idle_session_timeout service config.

  • (beta) ODAS workers cache bytes from storage
    ODAS workers now support a variant of LRU caching which automatically caches bytes from the storage system. This is only supported for the file system datasources: S3 and HDFS. This cache is enabled by default but defaults to a small size (1GB per worker). The cache size can be configured via the worker config io_cache_size which controls the size in bytes. Setting it to a value <= 0 disables the cache.

  • Performance enhancements for heavily partitioned tables.
    This release has some significant performance improvements for operations on partitioned tables. The automatic partition recovery that scans for new folders added on s3, is optimized to run faster than before. Similarly operations like ALTER TABLE ADD PARTITION, ALTER TABLE RECOVER PARTITIONS are optimized by effectively scanning changes on s3 buckets and also using more parallel ways to manage HMS partitions. On the scan side, if the number of partitions are greater that 200, we now load the partition metadata in the workers instead of the planner. Loading metadata for large partitions in the planner was resulting in timeouts of queries. The planner would now roundrobin the partitions (based on number of partitions) to the workers and the workers would now load the partitions metadata for the partitions that it has to fetch the records. This ensures queries do not timeout at planning phase and the overall time to execute queries for the partitioned table is faster.

  • JDBC queries can now run in parallel.
    JDBC queries are now run in parallel provided there is a suitable numeric type field specified in the mapred.jdbc.scan.key as tblproperty in the catalog table. See scan records in parallel

Incompatible changes

  • Package path for Java client has been renamed to com.okera.*.
    This should not impact typical use cases as backward compatible classes have been added. For example, there exists two copies of RecordServiceHiveInputFormat in the old and new namespace. Clients that were developed against the Java client library will need to be updated.

  • The default LDAP port changed from 389 to 636.
    Previously, if unspecified, ODAS defaults to SSL enabled connecting to the server on port 389 for LDAP. This configuration is atypical, as 389 is the default non-SSL enabled port. This release changes the default to use the standard port (636) by default as SSL is enabled by default. For users that are explicitly specifying this configuration (LDAP_PORT), this will have no impact.

1.2.3 (December 2018)

1.2.3 is a point release with some fixes to critical issues.

Bug Fixes

  • Fix web UI's 'Preview Dataset' by making scans with record limits much faster for partitioned datasets, significantly reducing the likelihood of timeouts. In the event of a timeout, a more accurate error message is now shown.

  • Significantly improve the performance of the web UI's Dataset List when the total number of datasets is large (1000+).

  • The machines in the ODAS cluster will now install Java 8 if Java is not already installed. ODAS has always required Java 8 but some newer linux distros have updated the default java version to java 11, which is not compatible. This version is now properly pinned to Java 8.

  • ODAS clusters will by default start up with multiple planners. This previously could be optionally specified when creating the cluster but defaulted to a single planner. As part of this change, a client now has sticky sessions meaning clients will be pinned to a planner for some duration, allowing APIs such as scan_paged to work correctly.

  • Fixed issue with idle session expiry. Previously some idle sessions were not tracked correctly and did not expire as promptly as expected.

  • Fixed client side issue scanning some complex schemas with a particular combination of nested structs.

1.2.2 (October 2018)

1.2.2 is a point release with some fixes to critical issues.

Bug Fixes

  • Fixed 'ocadm agent start-minion' to aid in manual cluster repair

  • Properly return an error message for queries that contain a LEFT ANTI JOIN

  • Idle sessions now timeout by default with a timeout of 120 seconds. A session is considered idle if the client did not make any request in that time window. This config can be controlled via the planner or workers idle_session_timeout config.

  • A fix to optimize processing of datasets with large numbers of partitions.

  • Fix web UI's 'Preview Dataset' so that it relies on LAST PARTITION

1.2.1 (September 2018)

1.2.1 is a point release with some fixes to critical issues.

Bug fixes

  • Fixed a critical issue with scanning nested collections.

  • Added support for AVRO schema files specified using an HTTPS URI.

  • Fixed some error handling in PyOkera.

  • Increased the default connection limit to 512.

1.2.0 (September 2018)

1.2.0 is the next major Okera release with significant new functionality and improvements throughout the platform.

Major Features

Data usage and reporting

Using the Okera Portal, users can now understand how the datasets in the system are being used. This can be useful for system administrators and data owners to understand which datasets are being used most often, by whom and with which applications. The reporting insights are built on the audits and automatically capture system activity. For more details, see here.

Support for data sources using JDBC

Okera Data Access Platform (ODAS) now supports data sources connected via JDBC, typically relational data bases. These datasets can now be registered in the Okera Catalog and then read and managed as any other Okera dataset. For more details on how to register and configure these sources, see here.

Improved access level granularity

ODAS now supports richer access levels, in addition to SELECT and ALL. It is now possible to, for example, grant users only the ability to find and look at metadata, only to alter dataset properties. We've also added the concept of public role, which can simplify permission management. For details and best practices, see here.

Access Control Builtins

ODAS now supports a family of access control builtins. These are intended to be used in view definitions and can dramatically simplify implementing fine grained access control. See this document for more details.

Improvements to LAST SQL clause

ODAS supports the LAST PARTITION clause to facilitate sampling large datasets. In this release, this support was extended to support LAST N PARTITIONS and LAST N FILES. In addition, it is now possible to set this as metadata on the catalog object, to prevent queries trying to read all partitions. See here for best practices.

Improvements to Workspace

Workspace can now run multiple queries at once and supports monospace format outputs and datetime queries.


  • PyCerebro has been renamed PyOkera. The API is effectively unchanged except that now instead of importing 'from cerebro' you will need to import 'from okera'.

  • Parallel Execution of Tasks. PyOkera will now schedule and execute worker tasks in parallel to minimize network latency. The scan_as_pandas() and scan_as_json() API calls will by default spawn worker processes to concurrently execute tasks where possible. The default number of local worker processes is defined as 2 times the number of CPU Cores on the machine on which it is being processed. This has demonstrated a reduction in run-time duration for queries by minimizing the network latency involved with establishing network connections with the Okera Worker Nodes.


  • Improved planner task generation. One of the responsibilities of the planner is to break up the files that need to be read into tasks. In this release, we've implemented a new cost-based algorithm which should result in tasks that are more even. This should lead to less execution skew across tasks and overall reduction in job completion times.

  • Improved planning time for queries on tables with large number of partitions. The planner now loads metadata more lazily, deferring as much as possible to after partitioning pruning. This can result in significantly better latency for queries that scan only a few partitions of a heavily partitioned table.

  • Improved expression handling in the planner. The planner will fold constant expressions trees and reorder expressions. Queries that push complex expressions to ODAS should see improvements.

  • Dramatically improved planner and worker RPC handling. Server side RPC handling is much more robust to slow clients or if there are transient slowdowns in dependent authentication services.

  • Worker fetch rpc now does keep alive for clients, eliminating the need to set high client RPC timeouts. Users previously worked around this by setting a very high value for recordservice.worker.rpc.timeoutMs.

  • Support for caching authenticated JWT tokens and increasing the timeout.

Other improvements

  • Added support for ALTER DATABASE <db_name> SET DBPROPERTIES

  • Added support for 'ALTER TABLE RENAME TO ' in hive.

  • EMR support included through EMR 5.16.0

  • Support for MySQL 5.7 as the backing database for the Okera catalog.

  • Support Avro bytes data type. This is treated by ODAS as STRING

  • EXPLAIN is now supported as a DDL command and can be run via odb or from the the web portal.

  • Support for the UNION ALL operator. Note that UNION DISTINCT is not supported.

  • Updated Kubernetes to 1.10.4 and the Kubernetes dashboard to 1.8.3.

  • Users can now user the keyword CATALOG as an alias to SERVER to grant access to the entire catalog. For example, GRANT SHOW ON CATALOG TO ROLE common_role would enable metadata browsing of the entire catalog.

  • (Beta) DeploymentManager now supports deploying ODAS clusters on externally managed Kubernetes clusters. In this case, the DeploymentManager just deploys ODAS services without managing machines or the Kubernetes.

  • The Okera portal can now store token credentials using cookies, allowing users to share credentials across web applications.

  • Added support for naming the sentryDB when creating a cluster. The name can be specified with the "--sentryDbName" flag when using ocadm.

  • Partitioned tables can now have their partitions automatically recovered.

  • Diagnostic Bundler now captures network route and firewall details

  • Support for IF NOT EXISTS in CREATE ROLE and SHOW ROLES LIKE SQL statements.

Changes in behavior

  • View usage for Python now uses the pyokera client instead of the REST API.

  • Okera views now inherit stats from the base tables and views the view is created from. These can be overwritten using ALTER TABLE but will provide better behavior for the vast majority of use cases.

  • Okera portal no longer estimates the total number of datasets as this can cause performance issues with very large catalogs.

  • Workspace will no longer render more than 500 rows per query.

  • Workspace terminal will drop older queries if the terminal output exceeds more than 750 rows of queries in total. This will improve workspace rendering speeds.

  • Post_scan request now utilizes the pyokera client.

Bug fixes

  • Improved error handling for invalid data metadata, for example, an Avro schema path that is no longer valid

  • When using field name schema resolution for parquet files, the field comparison is now case insensitive. This matches the behavior from the Apache Parquet Java implementation (parquet-mr).

  • The double datatypes now returns as many digits as possible when scanned via the REST API. Previously, this would round or return scientific notation for some range of values.

  • Allow revoking URIs to non-existent (typically deleted) S3 buckets. This would previously error out.

  • Fix issues with creating SELECT * views in some cases. Previously, in same cases, this could fail if layers of views are used.

  • Table property, 'skip.header.line.count', is now properly respected.

Incompatible and Breaking Changes

  • Deprecated the CEREBRO_INSTALL_DIR and DEPLOYMENT_MANAGER_INSTALL_DIR environment variables. OKERA_INSTALL_DIR should be used.

Known issues

  • Partitioned tables where the user only has SHOW access will mistakenly show the user as having access to the partitions columns in the UI. The UI properly shows that the other columns are inaccessible. -* While multiple ODAS clusters can be configured to share their HMS and Sentry databases, datasets created by a 1.2.0 ODAS cluster cannot be read by ODAS clusters running earlier versions (1.1.x or earlier).

1.1.0 (june 2018)

The 1.1.0 introduces two major items but no significant alterations to existing features or functionality. It includes all of the fixes from 1.0.1.

Support for Array and Map collection data types

This completes the complex types support started in 0.9.0, when the struct type was introduced. This release adds support for Arrays and Maps. As with struct support, only data stored in Avro or Parquet is supported. See the docs for more details.

Migration from company rename

In 1.0.0, we renamed the product but maintained backwards compatibility. For example, system paths that contained the product or company name were not changed. In 1.1.0, we have completely the product renaming and users upgrading will need to migrate.

Incompatible and Breaking Changes

  • AWS AutoScalingGroup launch scripts (the parameter passed to the --clusterLaunchScript flag when creating an ODAS environment) should accept the --numHosts parameter. This is a breaking change from 1.0.0 when the flag was introduced as --hosts.

  • Okera Portal (UI) Workspace page no longer accepts queries in the URL. Bookmarks and links from previous versions that included a query will still go to the page, but query arguments will not populate the input box.

  • Pyokera (fka Pycerebro) no longer returns a byte array to represent strings from calls to scan_as_json. Instead, it returns a UTF-8 encoded Python string, which is automatically serializable to JSON.

Known issues

  • Support for AWS AutoScalingGroups (ASGs) is in beta and not recommended for production use. There may be issues scaling an ASG cluster down, depending on which EC2 VM is terminated.

1.0.1 (may 2018)

1.0.1 is a patch release that fixes some critical issues in 1.0.0. We recommend all 1.0.0 users switch to this version for both the server and java client libraries. The java client library for this release is also 1.0.1.


  • Fix issue health checking planner and workers with kerberos enabled. The health checks were failing continuously causing the cluster stability issues.

  • Fix to diagnostic bundler when some of the collected log files are either very big or corrupt. In some cases, log files on the host OS in /var/log could be corrupt or very large. This used to cause the bundle to fail and now they are skipped with the issue logged.

  • Fix issue when dropping favorited datasets in the UI. In some cases, looking at favorited datasets that don't exist anymore or are no longer accessible were displayed incorrectly.

  • Enable small row-group optimizable when reading parquet. Parquet files with small row groups had very poor performance in some cases, particularly if the table had many columns. The implementation for how these kind of files are read has been changed to handle this case better.

  • Fix some queries when scanning partitioned tables from hive with filters. Some queries using filters on partitioned tables would result in the client library generated an invalid planner request. This issue was specific to some very particular hive queries but has been resolved in this version.

1.0.0 (may 2018)

The 1.0.0 release is a major version, introducing significant new functionality and improvements across the platform.

Name Change Notice

With the 1.0.0 generally available (GA) release, our company and product name changes as well. Cerebro Data, Inc is now Okera, Inc, and the Cerebro Data Access Platform is now the Okera Active Data Access Platform. Component names have mostly been updated, and documentation should reflect the current state of all component names. In this release, we have maintained backwards compatibility and existing automation will continue to work. The binary paths continue to use the Cerebro name.

Deprecations Notice

  • Environment variables that begin with CEREBRO_ will continue to be supported only until version 1.2.0 is released. For standard environment variables established during installation, CEREBRO will alias OKERA.

Upgrading from prior versions

It is not possible to upgrade an existing ODAS cluster to 1.0.0 (i.e. ocadm clusters upgrade will not work). Instead, a new ODAS cluster must be created. Note that this is only the ODAS cluster itself, the catalog metadata from older clusters can be read with no issues.

Diagnosability and Monitoring

  • Support for collecting logs across all services and machines for diagnostics. This bundle can be sent to Okera or used by users to improve the troubleshooting experience. Cluster administrators can use the CLI to generate this support bundle across cluster services with one command.

  • Support for Kubernetes dashboard and Grafana for UI based administration and monitoring. ODAS now installs and enables the Kubernetes, which helps cluster admins manage a ODAS cluster (such as restarting a service, inspecting configs, etc) and Grafana to look at metrics across the cluster (cpu usage, memory usage, etc). These can now be optionally enabled at cluster creation time.

Robustness and Stability

  • Resolved issue where occasionally multiple workers can be assigned to the same VM, causing load skew and cluster membership issues.

  • Improved healthchecking to be able to detect service availability, triggering repair of individual containers more reliably. The healthcheck now more closely emulates what a client would do. If the healthcheck detects an issue, the individual container is restarted.

  • Beta support for Amazon Web Services Auto Scaling Groups (ASG). Users can now provide an ASG launch script instead of the instance launch script and the ODAS cluster will be built using the ASG. This should improve cluster launch times at scale, make it easier to manage the VMs (a single ODAS cluster is one ASG) as well as handle node failures. When the ASG relaunches a machine for the failed nodes, the Deployment Manager will automatically detect this, remove the failed nodes and add the new ones.


In this release, the product documentation has been significantly improved. Search is now supported throughout the documentation. Content has been added to cover FAQs, product overview and many other topics.

UI improvements

  • Workspace is now enabled for admin users. The workspace provides an easy interface for data producers and data stewards to issue DDL and metadata queries against the system, without the need to use another tool. Workspace provides basic query capabilities but is not intended to be used for analytics. Features include: scan queries, DDL queries, and query history.

  • Permissions page has been added to help users understand how permissions have been configured. For data stewards, it provides the ability to look up which users have been granted access and how. For data consumers, it allows him or her to lookup the information to acquire access to more datasets.

  • Users can now tag and favorite datasets to be able to easily work with them again in the future. These tags and favorites are currently unique per user.

  • Performance and scale improvements. The UI scales much better with larger catalogs.

Performance and scalability

  • Significant performance improvements for metadata operations on large catalogs. Numerous improvements were made to the responsiveness of metadata operations (DDL) against large catalogs and tables with large number of partitions. This includes operations such as show partitions and show tables as well as the planning phase at the start of each scan request.

  • Improvements to task generation in the planner. The planner is responsible for generating tasks run by the workers. The tasks generated by the planner should now be more even across more cases - for example, skew in partition sizes, data file sizes, etc.

  • Improvements to worker scalability with high concurrency. The workers are now more efficient under high concurrency (200+ clients per worker).

  • Better batch size handling in workers. Workers produce results in batches which are returned to the client. Improvements were made to manage the batch size and memory usage better across a wider range of schemas.

  • The default replication factor of some of the services (e.g. odas-rest-server) has been increased to improve fault tolerance and scalability.

EMR integration improvements

  • Spark: SparkSQL can be run directly against tables in the catalog with no need to create a temporary view. While this worked previous, the integration is improved and now performs as well as the temporary view usage pattern. For example, SparkSQL users can just run spark.sql('SELECT * FROM okera_sample.users').

  • Hive: Improvements to handling of partitioned tables to improve planning performance. Queries that scan a few partitions in a dataset with many partitions should see significant improvements.

  • Deprecating single-tenant install support. While it is still supported in this release, the single-tenant EMR integration is being deprecated. Instead it is recommended to use the multi-tenant install but only bootstrap a single (typically hadoop) user. Improvements were made to the user bootstrapping and setup scripts. See the [emr docs][emr-integration].

  • EMR support extended to include 5.1 and 5.2.


  • Added support for BINARY and REAL data types. See the data types docs.

  • Support for registering bucketed tables and views using lateral views. Note that since ODAS does not support these, the user must have all access on these tables and views and direct access to these files in the file system.

  • Extended ODAS SQL to add DROP_TABLE_OR_VIEW. See supported-sql for more details.

  • Audit logs now include the Okera client version as part of the application field. For example, instead of presto, new clients will identity as presto (1.0.0). Note that this is the version of the Okera client, not the version of presto.

Incompatible and Breaking Changes

REST server default timeout increased to 60 seconds from 30

This is mostly to accommodate long DDL commands, such as alter table recover partitions.

New log filename format

The log files that are uploaded to S3 have a new naming scheme. The old naming scheme was <service>-<pid>-<id>-<timestamp>-<guid>.log. The new naming scheme is: <timestamp>-<service>-<ip>-<id>-<guid>.log.

This makes it more efficient to find logs from a given time window.

ODAS REST Service returns decimal data as string, instead of double

Decimal data is converted to string, when data is accessed via the REST APIs. The json result set now returns the decimal values as string to prevent any precision loss. Client can now control the rounding behavior in the application and cast the type as needed.

Deployment Manager now requires java 8

Known issues

Using AWS autoscaling groups

If an ODAS cluster is created using a clusterLaunchScript, it will instantiate an autoscaling group of the specified size in AWS. Scaling an ODAS cluster that is running on an ASG is not supported in 1.0.0. Specifically, scaling down is known to have issues. This will be remedied in the next release.

Earlier Releases

Release notes for 0.9.0 (april 2018) or earlier.