Skip to content

Okera Version 1 Release Notes

This topic provides Release Notes for all 1.x versions of Okera:


Bug Fixes and Improvements

  • Fixed an issue where writing to non-partitioned tables from Spark would fail if Spark bypass was enabled.
  • Improved error handling when doing unsupported operations on complex types.
  • Fixed an issue where running count(struct_field.some_value) would fail when run inside views.
  • Fixed an issue where using ORDER BY in an external view could fail an authorization check.
  • Fixed an issue where some decimals were not serialized properly when accessed via the /scan API.
  • Improved some error handling on the node-remover CronJob for Gravity-based clusters.
  • Fixed an issue where CTEs that contained aggregations would fail.
  • Added the ability to disable Zookeeper-based worker membership and instead leverage the Kubernetes metadata. This can be enabled by setting OKERA_KUBERNETES_MEMBERSHIP: true in the configuration file.


Bug Fixes and Improvements

  • Fixed several issues related to access control on tables and views with complex types.
  • Fixed an issue when registering JDBC tables with many columns.
  • Fixed an issue where small decimals would not be returned correctly when queried via the Presto endpoint.

Notable and Incompatible Changes

  • In PyOkera, scan_as_json now defaults strings_as_utf8 to True, matching the behavior prior to 1.5.2.


Bug Fixes and Improvements

  • Fixed an issue in PyOkera where scan_as_json and scan_as_pandas would ignore the tz option supplied on the context object.
  • Fixed an issue in the Presto client library where it did not properly handle null checks on STRUCT columns.


Bug Fixes and Improvements

  • Fixed an issue where queries on views that referenced STRUCT columns could fail when an ABAC permission applied to it.


Bug Fixes and Improvements

  • Fixed an issue where many concurrent CREATE TABLE or CREATE VIEW statements could be slowed down waiting on a shared resource.
  • Fixed an issue when authorizing queries on views with complex types.
  • Fixed an issue where the server was not properly clearing the effective user when different user utilize the same underlying planner connection (this typically only happens in PyOkera scripts that switch between different users, such as tests).
  • Added an option to use the SYSTEM_TOKEN as the shared HMAC secret for signing and validating tasks (in the Planner and Worker services) rather than using ZooKeeper. This option can be enabled by setting SYSTEM_TOKEN_HMAC: true in the configuration file.


Bug Fixes and Improvements

  • Fixed an issue where in certain EKS environments, the CPU scheduler was not properly saturating the CPU capacity.
  • Fixed an issue where scanning Parquet files would fail if their dictionary_offset was after the data _page_offset.
  • Added an improvement for SerDes that use field-delimiters, to allow specifying field-delimiters within double-quotes


Bug Fixes and Improvements

  • Fixed an issue in PyOkera where it would incorrectly decode negative decimal values with a precision higher than 18.
  • Fixed an issue when --allow_nl_in_csv=True was set and the CSV file used a different quote character than " - it would improperly use the " to escape line breaks.
  • Improve the Crawler's ability to automatically use the OpenCSV SerDe when necessary.
  • Fixed issues for handling complex types that had several nested arrays/structs/maps with null values interspersed.
  • Fixed an issue where reserved keywords were not possible to be used (as escaping them wouldn't work) as attribute namespaces and attribute keys (e.g. myns.true).
  • Add the ability to use CREATE TABLE LIKE TEXTFILE, which will automatically deduce the schema from the CSV file (this assumes the headers are the first line).


Bug Fixes and Improvements

  • Added the ability to edit dataset and column descriptions in the Okera UI.
  • Fixed an issue where datasets discovered by the crawler that had columns whose type definition exceeded 4,000 characters couldn't be registered.
  • Added more control options for LDAP group resolution configuration:
  • Fixed an issue where Avro datasets that had a union type with a single child (e.g. union(int)) would throw an error. These types of unions are now fully supported.
  • Fixed an issue where decimals that were stored as a byte_array in Parquet files were not read correctly.
  • Added a configuration option to control the maximum number of allowed Sentry and HMS connections:
  • Fixed an issue where changing the description of the view (or a column in it) via DDL was not supported.
  • Fixed an issue where columns that contained arrays or maps with embedded null values were not handled correctly in the Java-based clients.


Bug Fixes and Improvements

  • Improved ZooKeeper membership registration and cluster health check capabilities. The cluster can now identify more cases where a node gets incorrectly deregistered and self-heal.
  • Improved handling of non-parsable SQL statements when accessing a view that was created outside Okera (e.g. in Hive). This capability is enabled by an environment flag ALLOW_NONPARSEABLE_SQL_IN_VIEWS: true set in the configuration file for the cluster.


Bug Fixes and Improvements

  • Fixed an issue where Hive/Hue could not load the table listing for a database if it contained a view that Okera could not parse.
  • JWT tokens with a group claim can now have that claim be a simple string denoting the group rather than having it be an array.


Bug Fixes and Improvements

  • Improved performance of attribute access checks on wide views.
  • Fixed an issue where an attribute-based grant on a view with a complex type might not properly omit the complex type column.
  • Added support for CSVs with embedded newlines within records that are enclosed within the quote separator. To enable this, specify --allow_nl_in_csv=true for RS_ARGS in your ODAS configuration.


Bug Fixes and Improvements

  • Fixed an issue where joining or unioning a dataset with itself could cause an invalid query plan to be generated, preventing that query from being run.
  • Fixed an issue where a column-level grant on a view could allow joining on columns other than those granted.
  • Improved the detection in PyOkera of whether Pandas and NumPy are installed, and if not, still allow usage of all functionality that does not require them.
  • Fixed an issue where an external view in Hive which has both row_number() and an ORDER BY clause could cause the query to not succeed.
  • Fixed an issue where non-conformant Parquet files that have a mismatch between the number of records specified in the dictionary header vs. the actual batch would cause the file to not be queryable.
  • Added the ability to specify the CATALOG_DB_PASSWORD, LDAP_GROUP_RESOLVER_PASSWORD and LDAP_USER_QUERY_SERVICE_PASSWORD in a Kubernetes secret.
  • Added the ability to okctl to specify CATALOG_DB_PASSWORD, LDAP_GROUP_RESOLVER_PASSWORD and LDAP_USER_QUERY_SERVICE_PASSWORD as file paths in the configuration file.


Bug Fixes and Improvements

  • Fixed an issue for Parquet files where TIMESTAMP and TIMESTAMP_MILLIS columns that were backed by int64 were not supported.
  • Fixed an issue where an invalid plan could cause the worker to crash.
  • Added two new DDLs that allow changing the comment on a table and column:
    • ALTER TABLE <table> CHANGE COMMENT '<comment>'
    • ALTER TABLE <table> CHANGE COLUMN COMMENT <col> '<comment>'
  • Added APIs to get and set the description on a dataset and column:
    • GET/PUT /datasets/<name>/description
    • GET/PUT /datasets/<name>/columns/<column>/description
  • For PyOkera, execute_ddl now takes an optional requesting_user parameter, similar to the plan and scan_as_... functions.
  • Fixed an issue where a column-level grant on a view could allow filtering (but not viewing) on columns other than those granted when executing a query in Workspace.


Bug Fixes and Improvements

  • Fixed an issue where DECIMAL columns in Avro schemas would not get detected properly.
  • Added the ability to provide a default clamp value for DECIMAL columns whose precision exceeds the maximum precision allowed (38). This can be set using the AVRO_SCHEMA_TOO_HIGH_PRECISION_FALLBACK configuration value.
  • Added support for skip.footer.line.count table property.
  • Performance improvements in the case of many small files in a single partition.

    Note: Okera still recommends that you avoid having small files.

  • Fixed an issue where some sensitive values would be exposed in the Planner and Worker debug UIs.
  • Added the ability to enable setting X-Frame-Options: DENY for all requests by setting the FRAME_OPTIONS_DENY_ENABLED configuration value.
  • Added the ability to enable the Secure flag on the session cookie using the OKERA_SHARED_COOKIE_SECURE configuration value.
  • Improved default cipher support for TLS1.2.
  • Added the ability to control the duration of the generated JWT when logging in by setting JWT_TOKEN_EXPIRATION to the desired number of seconds (minimum is 300 seconds).


New Features

JDBC Data Sources

  • Added support for Sybase.
  • Added support for filter pushdown.
  • Added support for count(*) for JDBC data sources.
  • Added support for case sensitive column names.
  • Added support for specifying custom SSL CAs to use to validate when making SSL connections to the JDBC data source.

Audit Log Uploads

It is now possible to configure audit logs to be uploaded in an immutable fashion. When enabled, audit logs will be uploaded with a .staging.audit and .staging.reporting suffix until they are finalized, and will then be uploaded without the .staging portion when finalized.

To enable this, set WATCHER_AUDIT_LOG_STAGING_FILES to true or 1.

Additionally, it is possible to force the audit logs to be uploaded after a certain number of seconds have passed, by specifying WATCHER_AUDIT_LOG_MAX_UPLOAD_SEC.


  • PyOkera now has full support for complex types (ARRAY, MAP, STRUCT).
  • context.enable_token_auth now accepts an optional argument called token_func, which can reference a no-argument function that when called, returns a valid token to be used.

    Note: This function must be pickle-able (and an error will be returned if it isn't), as it will be used across multiprocessing calls.

  • PyOkera now supports running scan_as_json and scan_as_pandas using Presto.

Bug Fixes and Improvements

  • Added the ability to ignore LDAPS certificate errors when doing group resolution.
  • Added the ability to set Presto tuning variables, specifically:
  • Improved handling for Date type in JDBC data sources.
  • Improved handling of broadcast joins using cross-task caching.
  • Fixed an issue where JDBC data sources that had USING VIEW AS did not properly handle single quotes in the view.
  • Fixed an issue where JDBC data sources did not close the connection properly when no more events were necessary, causing poor performance.
  • ODAS Web UI will now automatically redirect to the https URL if a user navigates to the http one.
  • Added the ability to control how long the Web UI waits before timing out a request to the server (default is 30000, in milliseconds), by setting the UI_TIMEOUT_MS configuration.
  • ODAS Web UI will now break out the inner portions of ARRAY and MAP complex type columns.
  • Added the ability to configure ODAS to look for user-specified claims in the JWT to determine the user (JWT_USER_CLAIM_KEY, default is sub) and groups (JWT_GROUP_CLAIM_KEY, default is groups).
  • Added support for partitioning schemes on S3 that do not contain the partition column name in the folder, e.g. s3://company/dataset/2019 vs s3://company/dataset/year=2019. This can be enabled by setting okera.hms.allow-no-name-partitions to true in hive-site.xml.
  • Fixed an issue where array and map indexing in an external view definition would cause ODAS to fail to parse.
  • Added support to specify strings_as_utf8=True when using scan_as_json in PyOkera.
  • Fixed an issue in PyOkera when converting a CHAR column to UTF-8.
  • Upgraded several dependencies.

Notable and Incompatible Changes

  • The bundled Presto service now exposes an additional "catalog" (in Presto terms) called okera (in addition to the existing recordservice one). These are identical and contain the same datasets. In a future version, the recordservice catalog will be removed and is now deprecated. All clients should shift usage to the okera one.

  • Removed the default from deserializer column comment that would appear for Parquet and Avro files when created using CREATE TABLE LIKE FILE.

  • In PyOkera, when using scan_as_json, date columns are now serialized to millisecond precision without the corresponding timezone, to match output of other APIs.

  • The driver type of redshift is now required to connect to Redshift, and the postgresql type will no longer work. This was done as the drivers have deviated and they were updated for security and performance reasons.


New Features

SAML Support

You can now configure authentication to ODAS with SAML providers.

JDBC Data Sources

  • Added support for MS SQL Server.
  • Added support for Redshift External Tables.

LDAP Authentication

You can now configure LDAP authentication to do two-step authentication (DN lookup followed by authentication).

Bug Fixes and Improvements

  • Data Registration Crawler improvements:
    • Increased performance on large partitioned tables.
    • Improved filetype classification.
    • Avro schema comment fields (i.e. description) will now be inherited by ODAS when registered.
  • Azure improvements and fixes:
    • Add support for Azure MySQL connections where SSL is required.
    • Fix an issue where CREATE TABLE LIKE FILE was not properly loading Avro schema files from ADLS.
  • Fix a bug where ODAS was caching UDFs when a pattern was set in a call to SHOW FUNCTIONS.
  • Added the table property to control whether automatic partition recovery is enabled for a particular table: 'okera.auto_partition_recovery.disable'='true'.
  • Improved handling for when doing DROP DATABASE CASCADE on a database that does not exist.
  • ODAS will now respect the LOCATION field set on a database.
  • Kubernetes liveliness and readiness probes have been tuned to cause less load on the system.
  • Added tables in okera_system to expose role and group information.
  • Fixed an issue in the Hive SerDe to properly initialize the header skip flag.
  • Fixed an issue where the compiler was generating invalid CPU instructions for Decimal types due to bad memory alignment.
  • Respect the value of OKERA_WORKER_LOAD_BALANCER if it is passed in.
  • Disable an optimization when doing a join where the second table was larger than 128MB.
  • Fixed an issue in the Avro parser that did not allow for default values of empty arrays and maps.
  • Fixed an issue where partition names were not properly escaped in Hive.
  • Fixed an issue in the Kubernetes resource files for Presto to reference the correct version.
  • Improved system availability when registering a high number of partitions.

Notable and Incompatible Changes

  • Previously, changes to the CATALOG_ADMINS setting would not get fully reflected on a cluster that had previously configured these. In this release, users and groups referred to by CATALOG_ADMINS will be automatically granted admin_role on startup. If you have users that you no longer want to be admins, you should remove them from CATALOG_ADMINS.


New Features

Policy Builder

A new interactive policy builder in the Okera Portal. Table access policies and fine-grained permissions can now be granted through the UI.

Attribute-Based Access Control Updates

Updated syntax and and other improvements to attribute-based access control (ABAC).

See Attribute-Based Access Control (ABAC) for more information.

Other Improvements

  • Azure: added experimental support for ADLS Gen2 - users can now CREATE EXTERNAL TABLE on data that is stored in Gen2 storage, and query that data.
  • Added IF EXISTS to DROP ROLE, so you can now do DROP ROLE IF EXISTS <role>.
  • Changed to how we deploy ZooKeeper on Kubernetes to better handle node failures.
  • Updated the underlying Thrift library to version 11 to stay more current. This should have no user-visible impact.
  • Improvements to ALTER TABLE <table> RECOVER PARTITIONS to improve its runtime. There is more work planned for future releases.
  • Added a new table property that allows CREATE TABLE <name> LIKE <FILETYPE> to handle cases where a partition column and data column exist with the same name.
  • Improved handling of automatic file type detection in crawlers for Avro and JSON files.
  • The mask() UDF is now always available.
  • Permission model now supports CREATE_AS_OWNER, which lets users create objects in the catalog and be given owner (i.e. ALL) privileges on the new object. This can be used to create per user (staging) tables or to support distributed stewardship.
  • Fixed a bug where it was not possible to override the database named used for the CATALOG_DB_OKERA_DB database.
  • Fixed a bug where you could create grants that were invalid and would fail downstream - we now fail them at the point of creation.
  • Added a num_results_read column to okera_system.audit_logs, denoting the number of records read during a particular operation.
  • Support for special characters in column names. Okera now expands the special characters supported in column names on par with ANSI-SQL specification. Characters that are still not supported in the column name are ., `, : and !. The special characters in a column name can be escaped by backticks. For example, if the column name is Special Chars (name) then the column name can be specified as, CREATE TABLE special_chars.sample `Special Chars (name)` STRING
  • The cerebro-web Kubernetes service was removed. All functionality is now consolidated into the cdas-rest-server service.

    Note: When using the Deployment Manager to upgrade from versions prior to 1.5.0, the cerebro-web service continues to exist after the upgrade. The service is vestigial, however, and should not be used. If there is need to remove this service entirely, please open a support ticket.

  • Improved robustness of service discovery in several places.
  • Added CEREBRO_EXTERNAL_PLANNER_HOST and CEREBRO_EXTERNAL_PLANNER_PORT, which can be set to override the planner's external host/port shown in the UI.

Incompatible Changes

  • Any external tooling checking for the existence of the cerebro-web service will no longer function. These tools should be updated to point at the cdas-rest-server service, which now encompasses the functionality.
  • Removed okera_system.weekly_audit_logs and okera_system.monthly_audit_logs views, since the UI preview was not functioning properly for them.
  • OKERA_PORT_CONFIGURATION, set in for Deployment Manager installs, no longer recognizes the cerebro_web:webui port. Please change this value to cdas_rest_server:webui for new clusters.


New Features

Improved Cluster Deployment

Okera clusters can now be created without using the Deployment Manager.

Support for Granting Column Access to Views

In previous Okera versions, it was not possible to grant column-level access on views, only tables. It is now possible to grant on columns in views as well.

See Managing Data Access for more information.

LDAP Group Resolution

Okera can now issue an ldapsearch to retrieve the groups associated with the username contained in a JWT if no groups are embedded in the JWT.

See LDAP for more information.

Other Improvements

  • Added a new way to set up automatic multi-tenant authentication for EMR and CDH integrations.
  • Added an ability to create one-node quickstart clusters that have out-of-the-box configuration including SSL, JWT, user/group settings.
  • Improved automatic service discovery for inter-service communication, allowing us to increase resiliency in the case of node failures.
  • Improved handling of unsupported or invalid views, typically inherited from an existing metastore. The view metadata can now be returned (but they are still un-queryable).
  • Okera now supports hms escaped partition paths. Additional characters that were not escaped previously can now be used in the partition path. For example, spaces and hyphens: timestamp-partition/time_val=2019-06-11 00:00:00.

    Note: Partition paths with '=' or '/' are not yet supported.

  • Full support for complex map types in parquet data.
  • Added support for complex types of map<string, array<string>>.
  • Added a new builtin function, current_date, which is like current_timestamp but just returns the date portion.
  • Enabled selecting current_date and current_timestamp as columns, e.g. select current_timestamp vs select current_timestamp().
  • Upgraded kube-prometheus to 0.1.0 (latest at time of publishing).
  • Added support for timestamps outside of typical data ranges. While we don't expect a lot of user data from the dark ages, sentinel values in those ranges as well as year 0 are valid. They will be passed through without transformation so that the data values can be read.
  • Added better support for Hue when some fields are null.
  • REPORTING_TIME_RANGE can now be set directly in
  • Reduced number of retries and yield time for HDFS connection attempts.
  • Okera now escapes partition columns to support keywords as partition column names.
  • Fixed a bug where data registration crawlers were treating hidden files as possible dataset files.
  • Fixed a bad error message in the UI when a database was not found on the permissions page. The error is clearer now.
  • Removed CORS headers from REST Server. Fixed a security bug where the REST server was returning a wildcard hostname in its CORS headers. This has been fixed by removing CORS headers entirely.
  • Fixed a bug where if a view had any constant-time expressions such as decode we would not do any access checks.
  • Fixed a bug where in some cases, select count(*) did not work if a user only had column-level access.
  • Fixed to skip check for table format on views.
  • Fixed storage descriptor path for Databricks based on provider for spark.
  • Fixed column access check for count(*) on views.
  • Fixed an issue with spark and presto clients where select * queries returned incorrect results for users with partial access to views.
  • If defaultdb property is not provided, JDBC connections will now use as default db for connecting.

Notable Changes

  • Use of is deprecated. Use the configuration YAML file and the okctl CLI tool instead. See Configuration.


New features


Tags can now be assigned to datasets or columns to mark the type of data they contain. For example, a ‘Sensitive’ tag can be created and assigned to any columns containing sensitive data. The Datasets page can be filtered by these tags to view only datasets or columns with certain attributes. Complex-type columns can be tagged, but not nested elements within a complex type.

Tags may only be created and assigned by users in admin roles and will be visible to all users. Admin users may also give other roles the ability to assign tags in the Workspace page.

Any user may still create Private Tags for their own use.


In order to reduce the manual work of tagging, an autotagger can be configured to detect when a column is likely to contain a certain type of formatted data, such as a Phone Number or Social Security Number and will apply the relevant tag to that column. This occurs when a new dataset is discovered on the Data Registration page.

Attribute-Based Access Grants (ABAC)

Admin users can now grant access to tables based on tags. For example, an admin can grant users access to all data tagged as Sales inside a particular database. This allows access grants to be based on data attributes instead of only on technical metadata (e.g database name or dataset name).

Note: ABAC grants are currently only fully supported on tables and not views. ABAC grants on views will only be enforced when tags are on the view level, but not on the column level. For ABAC grants on tables, both table-level and column-level grants are fully supported. Full support for views is coming soon. All existing RBAC grants remain unaffected and you can still create RBAC grants. ABAC and RBAC grants are additive, which means if either grant gives the user access, the user will be able to see that table.

See Atribute-Based Access Control (ABAC) for more details as well as the ABAC FAQs.

JDBC Support

Added a JDBC endpoint and native Presto support. A new cluster type, STANDALONE_JDBC_CLUSTER, is now available. Specifying STANDALONE_JDBC_CLUSTER will bring up a cluster that includes Presto and exposes a JDBC endpoint for use with Tableau and other JDBC-enabled analytics clients.

JSON File Format

  • JSON file formats are now supported by ODAS.
  • All data types similar to avro and parquet are supported with the exception of maps. Maps can already be represented as valid json structure.
  • JSON tables can be created via auto-inference or stored-as-json syntax
  • See JSON File Format for more details.
  • JSON files are now supported in the data registration wizard.


  • DATE type is now supported.
  • See Data Types for more details.

AWS CloudTrail Integration

  • Okera can consume AWS CloudTrail API event logs to more accurately determine when it is appropriate to perform maintenance operations. For example, the automatic discovery of new datasets and dataset partitions can occur faster and more efficiently when Okera receives direct notifications from AWS regarding S3 write operations. Without CloudTrail event consumption, Okera will fall back onto a polling model for detection of dataset changes. Refer to the Quick Start Guide: AWS CloudTrail Integration document for details.

Performance Improvements

  • The improvements includes specific optimizations for partitions metadata handling to improve performance on scanning data with partition filters.
  • Introduced new compression method (zstd) for efficient transfer between ODAS cluster and clients like spark and hive. The default compression is now zstd.
  • Introducing Okera SQL Extensions to our spark client.
  • This is an extension capability provided by spark using which we can augment the spark plan to pass additional information to ODAS.
  • This is primarily for two optimizations at this point,
    • To push down functions that are supported by ODAS like CAST/UPPER/LOWER/UNIX_TIMESTAMP
    • Implemented metadata only optimization for queries that have aggregation on just partition columns This is inspired from spark's own version of such optimization as here

Other Improvements

  • AWS Athena can be registered and used as a JDBC data source. See Athena.
  • New CREATE_AS_OWNER privilege that grants ability to create a database and automatically receive ALL privileges on that database.

    Note: CREATE_AS_OWNER does not cascade to all tables. You will not be able to create tables inside databases you have not created with this privilege.

  • Cluster name may be customized and will display in the navigation bar.
  • Crawlers may now be deleted on the Data Registration page.
  • Crawlers can now discover JSON data types on the Data Registration page.
  • The Permission page now displays the full list of permissions for the column, dataset, database, and server scopes affecting a given database. For example, if there is a group that only has access to the selected database, then that group will appear in the full list.
  • The Permission page indicates any Attribute Based Access Control expressions granting a group's level of access.
  • Improved error messaging throughout the Okera Web UI, specifically in the Workspace page and Dataset Preview.
  • Decimal types in i32 and i64 storage formats are supported with latest versions of Parquet, instead of just fixed_length_byte_array. Starting from 1.4.0 version, ODAS supports handling these additional i32 and i64 decimal storage formats along with fixed_length_byte_array.
  • ODAS shares existing HMSs which contain ORC files created by hive. However, the metadata load will fail for such cases. With 1.4.0 version of ODAS, we support ORC file format for metadata load. Note scans will still fail for ORC files with 'ORC files are not currently supported.' error.
  • Extended support for MAP complex types for PARQUET file formats. It is now possible to use MAP<STRING, STRUCT> and MAP<STRING, ARRAY>. This is still not available for AVRO types.

Incompatible Changes

  • The default resolution of parquet schema has been changed to be by name. To be explicit, this flag is now by default set to true, default_parquet_resolve_by_name=true. Prior to 1.4 the default was by ordinal (position).
  • The way access is controlled for the Workspace and Reports features in the UI has changed. Current users may need their access updated as a result:
  • Where before a user needed ALL or SELECT access on any dataset in the Okera catalog to access Workspace, that user now needs SELECT access on okera_system.ui_workspace. See Access to the Workspace for more info.
  • Where before a user needed ALL or SELECT access on okera_system.reporting_audit_logs to access Insights (Reports), that user now needs SELECT access on okera_system.ui_reports. See Access to the Insights Page for more info.

Known Issues

  • The following Okera configurations cannot be set directly in and must instead be listed in the SERVICE_ENVIRONMENT_CONFIGS environment variable in

Example for this case:

* ABAC grants on views will only be enforced when tags are at the view level, but not at the column level i.e If you assign some tags on columns on a view and then create a grant on that view for only those columns, it will not be enforced. If however the tag is at the view level, the grant will be enforced. For ABAC grants on tables, both table level and column level grants are fully supported.

1.3.4 (March 2019)

This release contains the following changes

  • Enhancement to ALTER TABLE statement to allow partition location change.
  • Support for scanning alternate partition location outside the table base path.
  • Adjust health-check frequency to accommodate longer cluster start times.
  • Optimized concurrent loading of metadata in workers, to prevent overloading the catalog with calls.
  • Reduce noise from logs from in-memory cache management and repeated log entries from custom UDF log errors.
  • Speeds up UI preview for large tables with many partitions to avoid time outs. Preview will show results from the last partition.
  • Control docker log size in containers with log size restrictions and log rotation policy.
  • Handle presence of unsupported complex type fields in text format data, gracefully.
  • Fixed a memory leak that occurs in the REST container when a query invoked via Workspace times out.
  • Fixed an env variable that controls the number of PyOkera worker processes in the REST container.
  • Increased the number of Gunicorn worker processes in the REST container from 4 to 8.
  • Support for EMR5.20. ODAS now handles backward compatibility breaking changes in Presto SPI.

Known Issues

  • Hive does not support scanning partitions where the partition name and the physical location do not match. eg. we do not support scanning via hive if the partition is year=2010,month=2,date=29 and the partition location is s3://foo/year=2012/month=4/date=21/, or s3://a/b/ ,
  • Hive does not support scanning partitions where the partition is outside the table base directory. eg., if table base dir is s3://foo/loc1/ and the partittion is at s3://foo/loc2. For the above cases, you may use spark, databricks, pyokera or the workspace

1.3.1 (February 2019)

  • This release contains a hot fix for the large partitions optimization introduced in 1.3.0 release. Due to this issue, the filters on partition columns in certain cases result in full table scan and can result in incorrect results.

1.3.0 (January 2019)

1.3.0 is the next major Okera release with significant new functionality and improvements throughout the platform.

Major Features

Dataset Registration

Datasets can now be registered in bulk through the Okera Web Portal. Choose an S3 path to crawl, and ODAS will inspect all files in that S3 path, finding possible datasets. Those datasets can be verified, modified, and registered one at a time or in bulk as needed. See Registering Data With Crawlers for details.


ODAS clusters leverage grafana for monitoring and has been updated substantially in this release. The metrics are now backed by Prometheus and the out of box monitoring dashboards have been improved.

Support for Parquet Formats

Parquet formats are now fully supported with exception to complex types on Map. For full details, see Complex Data Types.

Other Improvements

  • AWS EMR support extended and can be seen on

  • PyOkera is now supported on python 3.6 and 3.7.
    We recommend all clients update to the 1.3.0 PyOkera version.

  • Added support for DESCRIBE DATABASE <db_name>. See describe database

  • Partitioned column information in DESCRIBE FORMATTED <view_name> command output. See describe statement

  • Base tables referenced in the view in DESCRIBE FORMATTED <view_name> command output. See describe statement

  • Global UDF support.
    User defined functions can now be created and shared across database. Access them without the need to qualify everytime. More details here.

  • Planner UDF caching.
    Hadoop clients (for example Hive and Spark running on EMR) will load the UDFs on startup. For large catalogs (specifically number of databases), this can impact client startup time. In this release, the registered UDFs are cached on the planner, with a default time to live (TTL) of 30 seconds. This should significantly speedup client start up time in these cases.

  • Added support for SHOW GRANT GROUP and SHOW GRANT USER.
    These provide convinient ways to list all the grants for a group or user, in addition to SHOW GRANT ROLE.

  • It is now possible to create a table against a fully qualified path.
    Previously, tables (and partitions) had to be created over a directory. It is now possible to create a table over a single file, simply using the full path as the LOCATION. For more details, see supported sql.

  • ODAS clusters will now default to starting up with multiple planners.
    For clusters (larger than 1 node), the default number of planners will be greater than 1. This offers better availability and load balancing. This value can still be controlled as before, by specifying the --numPlanners option when creating the cluster.

  • ocadm now supports restarting a single service in a running cluster.
    Previously, it was only possible to restart the entire cluster (all services). See ocadm clusters restart help for details.

  • Grafana and monitoring improvements.
    ODAS clusters leverage grafana for monitoring and has been updated substantially in this release. The metrics are now backed by Prometheus and the out of box monitoring dashboards have been improved.

  • Idle clients now timeout after 120 seconds.
    A client is considered idle if it has no active requests for more than the configured time. Idle clients will now timeout after 120 seconds and the queries associated with that client will be cancelled. This config can be controlled by the idle_session_timeout service config.

    Note: Requests that take a long time and keep the server busy are not considered idle.

  • (beta) ODAS workers cache bytes from storage
    ODAS workers now support a variant of LRU caching which automatically caches bytes from the storage system. This is only supported for the file system datasources: S3 and HDFS. This cache is enabled by default but defaults to a small size (1GB per worker). The cache size can be configured via the worker config io_cache_size which controls the size in bytes. Setting it to a value <= 0 disables the cache.

  • Performance enhancements for heavily partitioned tables.
    This release has some significant performance improvements for operations on partitioned tables. The automatic partition recovery that scans for new folders added on s3, is optimized to run faster than before. Similarly operations like ALTER TABLE ADD PARTITION, ALTER TABLE RECOVER PARTITIONS are optimized by effectively scanning changes on s3 buckets and also using more parallel ways to manage HMS partitions. On the scan side, if the number of partitions are greater that 200, we now load the partition metadata in the workers instead of the planner. Loading metadata for large partitions in the planner was resulting in timeouts of queries. The planner would now roundrobin the partitions (based on number of partitions) to the workers and the workers would now load the partitions metadata for the partitions that it has to fetch the records. This ensures queries do not timeout at planning phase and the overall time to execute queries for the partitioned table is faster.

  • JDBC queries can now run in parallel.
    JDBC queries are now run in parallel provided there is a suitable numeric type field specified in the mapred.jdbc.scan.key as tblproperty in the catalog table. See scan records in parallel

Incompatible Changes

  • Package path for Java client has been renamed to com.okera.*.
    This should not impact typical use cases as backward compatible classes have been added. For example, there exists two copies of RecordServiceHiveInputFormat in the old and new namespace. Clients that were developed against the Java client library will need to be updated.

  • The default LDAP port changed from 389 to 636.
    Previously, if unspecified, ODAS defaults to SSL enabled connecting to the server on port 389 for LDAP. This configuration is atypical, as 389 is the default non-SSL enabled port. This release changes the default to use the standard port (636) by default as SSL is enabled by default. For users that are explicitly specifying this configuration (LDAP_PORT), this will have no impact.

1.2.3 (December 2018)

1.2.3 is a point release with some fixes to critical issues.

Bug Fixes

  • Fix web UI's 'Preview Dataset' by making scans with record limits much faster for partitioned datasets, significantly reducing the likelihood of timeouts. In the event of a timeout, a more accurate error message is now shown.

  • Significantly improve the performance of the web UI's Dataset List when the total number of datasets is large (1000+).

  • The machines in the ODAS cluster will now install Java 8 if Java is not already installed. ODAS has always required Java 8 but some newer linux distros have updated the default java version to java 11, which is not compatible. This version is now properly pinned to Java 8.

  • ODAS clusters will by default start up with multiple planners. This previously could be optionally specified when creating the cluster but defaulted to a single planner. As part of this change, a client now has sticky sessions meaning clients will be pinned to a planner for some duration, allowing APIs such as scan_paged to work correctly.

  • Fixed issue with idle session expiry. Previously some idle sessions were not tracked correctly and did not expire as promptly as expected.

  • Fixed client side issue scanning some complex schemas with a particular combination of nested structs.

1.2.2 (October 2018)

1.2.2 is a point release with some fixes to critical issues.

Bug Fixes

  • Fixed ocadm agent start-minion to aid in manual cluster repair

  • Properly return an error message for queries that contain a LEFT ANTI JOIN

  • Idle sessions now timeout by default with a timeout of 120 seconds. A session is considered idle if the client did not make any request in that time window. This config can be controlled via the planner or workers idle_session_timeout config.

  • A fix to optimize processing of datasets with large numbers of partitions.

  • Fix web UI's 'Preview Dataset' so that it relies on LAST PARTITION

1.2.1 (September 2018)

1.2.1 is a point release with some fixes to critical issues.

Bug Fixes

  • Fixed a critical issue with scanning nested collections.

  • Added support for AVRO schema files specified using an HTTPS URI.

  • Fixed some error handling in PyOkera.

  • Increased the default connection limit to 512.

1.2.0 (September 2018)

1.2.0 is the next major Okera release with significant new functionality and improvements throughout the platform.

Major Features

Data Usage and Reporting

Using the Okera Portal, users can now understand how the datasets in the system are being used. This can be useful for system administrators and data owners to understand which datasets are being used most often, by whom and with which applications. The reporting insights are built on the audits and automatically capture system activity. For more details, see here.

Support for data sources using JDBC

Okera Data Access Platform (ODAS) now supports data sources connected via JDBC, typically relational data bases. These datasets can now be registered in the Okera Catalog and then read and managed as any other Okera dataset. For more details on how to register and configure these sources, see here.

Improved Access-Level Granularity

ODAS now supports richer access levels, in addition to SELECT and ALL. It is now possible to, for example, grant users only the ability to find and look at metadata, only to alter dataset properties. We've also added the concept of public role, which can simplify permission management. For details and best practices, see here.

Access Control Built-Ins

ODAS now supports a family of access control builtins. These are intended to be used in view definitions and can dramatically simplify implementing fine grained access control. See this document for more details.

Improvements to LAST SQL Clause

ODAS supports the LAST PARTITION clause to facilitate sampling large datasets. In this release, this support was extended to support LAST N PARTITIONS and LAST N FILES. In addition, it is now possible to set this as metadata on the catalog object, to prevent queries trying to read all partitions. See here for best practices.

Workspace Improvements

Workspace can now run multiple queries at once and supports monospace format outputs and datetime queries.


  • PyCerebro has been renamed PyOkera. The API is effectively unchanged except that now instead of importing 'from cerebro' you will need to import 'from okera'.

  • Parallel Execution of Tasks. PyOkera will now schedule and execute worker tasks in parallel to minimize network latency. The scan_as_pandas() and scan_as_json() API calls will by default spawn worker processes to concurrently execute tasks where possible. The default number of local worker processes is defined as 2 times the number of CPU Cores on the machine on which it is being processed. This has demonstrated a reduction in run-time duration for queries by minimizing the network latency involved with establishing network connections with the Okera Worker Nodes.


  • Improved planner task generation. One of the responsibilities of the planner is to break up the files that need to be read into tasks. In this release, we've implemented a new cost-based algorithm which should result in tasks that are more even. This should lead to less execution skew across tasks and overall reduction in job completion times.

  • Improved planning time for queries on tables with large number of partitions. The planner now loads metadata more lazily, deferring as much as possible to after partitioning pruning. This can result in significantly better latency for queries that scan only a few partitions of a heavily partitioned table.

  • Improved expression handling in the planner. The planner will fold constant expressions trees and reorder expressions. Queries that push complex expressions to ODAS should see improvements.

  • Dramatically improved planner and worker RPC handling. Server side RPC handling is much more robust to slow clients or if there are transient slowdowns in dependent authentication services.

  • Worker fetch rpc now does keep alive for clients, eliminating the need to set high client RPC timeouts. Users previously worked around this by setting a very high value for recordservice.worker.rpc.timeoutMs.

  • Support for caching authenticated JWT tokens and increasing the timeout.

Other Improvements

  • Added support for ALTER DATABASE <db_name> SET DBPROPERTIES

  • Added support for 'ALTER TABLE RENAME TO ' in hive.

  • EMR support included through EMR 5.16.0

  • Support for MySQL 5.7 as the backing database for the Okera catalog.

  • Support Avro bytes data type. This is treated by ODAS as STRING

  • EXPLAIN is now supported as a DDL command and can be run via odb or from the the web portal.

  • Support for the UNION ALL operator. UNION DISTINCT is not supported.

  • Updated Kubernetes to 1.10.4 and the Kubernetes dashboard to 1.8.3.

  • Users can now user the keyword CATALOG as an alias to SERVER to grant access to the entire catalog. For example, GRANT SHOW ON CATALOG TO ROLE common_role would enable metadata browsing of the entire catalog.

  • (Beta) DeploymentManager now supports deploying ODAS clusters on externally managed Kubernetes clusters. In this case, the DeploymentManager just deploys ODAS services without managing machines or the Kubernetes.

  • The Okera portal can now store token credentials using cookies, allowing users to share credentials across web applications.

  • Added support for naming the sentryDB when creating a cluster. The name can be specified with the "--sentryDbName" flag when using ocadm.

  • Partitioned tables can now have their partitions automatically recovered.

  • Diagnostic Bundler now captures network route and firewall details

  • Support for IF NOT EXISTS in CREATE ROLE and SHOW ROLES LIKE SQL statements.

Changes in Behavior

  • View usage for Python now uses the pyokera client instead of the REST API.

  • Okera views now inherit stats from the base tables and views the view is created from. These can be overwritten using ALTER TABLE but will provide better behavior for the vast majority of use cases.

  • Okera portal no longer estimates the total number of datasets as this can cause performance issues with very large catalogs.

  • Workspace will no longer render more than 500 rows per query.

  • Workspace terminal will drop older queries if the terminal output exceeds more than 750 rows of queries in total. This will improve workspace rendering speeds.

  • Post_scan request now utilizes the pyokera client.

Bug Fixes

  • Improved error handling for invalid data metadata, for example, an Avro schema path that is no longer valid

  • When using field name schema resolution for parquet files, the field comparison is now case insensitive. This matches the behavior from the Apache Parquet Java implementation (parquet-mr).

  • The double datatypes now returns as many digits as possible when scanned via the REST API. Previously, this would round or return scientific notation for some range of values.

  • Allow revoking URIs to non-existent (typically deleted) S3 buckets. This would previously error out.

  • Fix issues with creating SELECT * views in some cases. Previously, in same cases, this could fail if layers of views are used.

  • Table property, 'skip.header.line.count', is now properly respected.

Incompatible and Breaking Changes

  • Deprecated the CEREBRO_INSTALL_DIR and DEPLOYMENT_MANAGER_INSTALL_DIR environment variables. OKERA_INSTALL_DIR should be used.

Known Issues

  • Partitioned tables where the user only has SHOW access will mistakenly show the user as having access to the partitions columns in the UI. The UI properly shows that the other columns are inaccessible.

  • While multiple ODAS clusters can be configured to share their HMS and Sentry databases, datasets created by a 1.2.0 ODAS cluster cannot be read by ODAS clusters running earlier versions (1.1.x or earlier).

1.1.0 (June 2018)

The 1.1.0 introduces two major items but no significant alterations to existing features or functionality. It includes all of the fixes from 1.0.1.

Support for Array and Map Collection Data Types

This completes the complex types support started in 0.9.0, when the struct type was introduced. This release adds support for Arrays and Maps. As with struct support, only data stored in Avro or Parquet is supported. See Complex Data Types for more details.

Migration From Company Rename

In 1.0.0, we renamed the product but maintained backwards compatibility. For example, system paths that contained the product or company name were not changed. In 1.1.0, we have completely the product renaming and users upgrading will need to migrate.

Incompatible and Breaking Changes

  • AWS AutoScalingGroup launch scripts (the parameter passed to the --clusterLaunchScript flag when creating an ODAS environment) should accept the --numHosts parameter. This is a breaking change from 1.0.0 when the flag was introduced as --hosts.

  • Okera Portal (UI) Workspace page no longer accepts queries in the URL. Bookmarks and links from previous versions that included a query will still go to the page, but query arguments will not populate the input box.

  • Pyokera (fka Pycerebro) no longer returns a byte array to represent strings from calls to scan_as_json. Instead, it returns a UTF-8 encoded Python string, which is automatically serializable to JSON.

Known Issues

  • Support for AWS AutoScalingGroups (ASGs) is in beta and not recommended for production use. There may be issues scaling an ASG cluster down, depending on which EC2 VM is terminated.

1.0.1 (May 2018)

1.0.1 is a patch release that fixes some critical issues in 1.0.0. We recommend all 1.0.0 users switch to this version for both the server and java client libraries. The java client library for this release is also 1.0.1.


  • Fix issue health checking planner and workers with kerberos enabled. The health checks were failing continuously causing the cluster stability issues.

  • Fix to diagnostic bundler when some of the collected log files are either very big or corrupt. In some cases, log files on the host OS in /var/log could be corrupt or very large. This used to cause the bundle to fail and now they are skipped with the issue logged.

  • Fix issue when dropping favorited datasets in the UI. In some cases, looking at favorited datasets that don't exist anymore or are no longer accessible were displayed incorrectly.

  • Enable small row-group optimizable when reading parquet. Parquet files with small row groups had very poor performance in some cases, particularly if the table had many columns. The implementation for how these kind of files are read has been changed to handle this case better.

  • Fix some queries when scanning partitioned tables from hive with filters. Some queries using filters on partitioned tables would result in the client library generated an invalid planner request. This issue was specific to some very particular hive queries but has been resolved in this version.

1.0.0 (May 2018)

The 1.0.0 release is a major version, introducing significant new functionality and improvements across the platform.

Name Change Notice

With the 1.0.0 generally available (GA) release, our company and product name changes as well. Cerebro Data, Inc is now Okera, Inc, and the Cerebro Data Access Platform is now the Okera Active Data Access Platform. Component names have mostly been updated, and documentation should reflect the current state of all component names. In this release, we have maintained backwards compatibility and existing automation will continue to work. The binary paths continue to use the Cerebro name.

Deprecations Notice

  • Environment variables that begin with CEREBRO_ will continue to be supported only until version 1.2.0 is released. For standard environment variables established during installation, CEREBRO will alias OKERA.

Upgrading From Prior Versions

You cannot upgrade an existing ODAS cluster to 1.0.0 (i.e. ocadm clusters upgrade will not work). Instead, a new ODAS cluster must be created.

Note: This is only the ODAS cluster itself. The catalog metadata from older clusters can be read with no issues.

Diagnosability and Monitoring

  • Support for collecting logs across all services and machines for diagnostics. This bundle can be sent to Okera or used by users to improve the troubleshooting experience. Cluster administrators can use the CLI to generate this support bundle across cluster services with one command.

  • Support for Kubernetes dashboard and Grafana for UI based administration and monitoring. ODAS now installs and enables the Kubernetes, which helps cluster admins manage a ODAS cluster (such as restarting a service, inspecting configs, etc) and Grafana to look at metrics across the cluster (cpu usage, memory usage, etc). These can now be optionally enabled at cluster creation time.

Robustness and Stability

  • Resolved issue where occasionally multiple workers can be assigned to the same VM, causing load skew and cluster membership issues.

  • Improved healthchecking to be able to detect service availability, triggering repair of individual containers more reliably. The healthcheck now more closely emulates what a client would do. If the healthcheck detects an issue, the individual container is restarted.

  • Beta support for Amazon Web Services Auto Scaling Groups (ASG). Users can now provide an ASG launch script instead of the instance launch script and the ODAS cluster will be built using the ASG. This should improve cluster launch times at scale, make it easier to manage the VMs (a single ODAS cluster is one ASG) as well as handle node failures. When the ASG relaunches a machine for the failed nodes, the Deployment Manager will automatically detect this, remove the failed nodes and add the new ones.


In this release, the product documentation has been significantly improved. Search is now supported throughout the documentation. Content has been added to cover FAQs, product overview and many other topics.

UI Improvements

  • Workspace is now enabled for admin users. The workspace provides an easy interface for data producers and data stewards to issue DDL and metadata queries against the system, without the need to use another tool. Workspace provides basic query capabilities but is not intended to be used for analytics. Features include: scan queries, DDL queries, and query history.

  • Permissions page has been added to help users understand how permissions have been configured. For data stewards, it provides the ability to look up which users have been granted access and how. For data consumers, it allows him or her to lookup the information to acquire access to more datasets.

  • Users can now tag and favorite datasets to be able to easily work with them again in the future. These tags and favorites are currently unique per user.

  • Performance and scale improvements. The UI scales much better with larger catalogs.

Performance and Scalability

  • Significant performance improvements for metadata operations on large catalogs. Numerous improvements were made to the responsiveness of metadata operations (DDL) against large catalogs and tables with large number of partitions. This includes operations such as show partitions and show tables as well as the planning phase at the start of each scan request.

  • Improvements to task generation in the planner. The planner is responsible for generating tasks run by the workers. The tasks generated by the planner should now be more even across more cases - for example, skew in partition sizes, data file sizes, etc.

  • Improvements to worker scalability with high concurrency. The workers are now more efficient under high concurrency (200+ clients per worker).

  • Better batch size handling in workers. Workers produce results in batches which are returned to the client. Improvements were made to manage the batch size and memory usage better across a wider range of schemas.

  • The default replication factor of some of the services (e.g. odas-rest-server) has been increased to improve fault tolerance and scalability.

EMR Integration Improvements

  • Spark: SparkSQL can be run directly against tables in the catalog with no need to create a temporary view. While this worked previous, the integration is improved and now performs as well as the temporary view usage pattern. For example, SparkSQL users can just run spark.sql('SELECT * FROM okera_sample.users').

  • Hive: Improvements to handling of partitioned tables to improve planning performance. Queries that scan a few partitions in a dataset with many partitions should see significant improvements.

  • Deprecating single-tenant install support. While it is still supported in this release, the single-tenant EMR integration is being deprecated. Instead it is recommended to use the multi-tenant install but only bootstrap a single (typically hadoop) user. Improvements were made to the user bootstrapping and setup scripts. See the [emr docs][emr-integration].

  • EMR support extended to include 5.1 and 5.2.

Miscellaneous Improvements

  • Added support for BINARY and REAL data types. See Data Types for more information.

  • Support for registering bucketed tables and views using lateral views.

    Note: Since ODAS does not support these, you must have full access to these tables and views and direct access to these files in the file system.

  • Extended ODAS SQL to add DROP_TABLE_OR_VIEW. See supported-sql for more details.

  • Audit logs now include the Okera client version as part of the application field. For example, instead of presto, new clients will identity as presto (1.0.0).

    Note: This is the version of the Okera client, not the version of Presto.

Incompatible and Breaking Changes

REST server default timeout increased to 60 seconds from 30

This is mostly to accommodate long DDL commands, such as alter table recover partitions.

New log filename format

The log files that are uploaded to S3 have a new naming scheme. The old naming scheme was <service>-<pid>-<id>-<timestamp>-<guid>.log. The new naming scheme is: <timestamp>-<service>-<ip>-<id>-<guid>.log.

This makes it more efficient to find logs from a given time window.

ODAS REST Service returns decimal data as string, instead of double

Decimal data is converted to string, when data is accessed via the REST APIs. The json result set now returns the decimal values as string to prevent any precision loss. Client can now control the rounding behavior in the application and cast the type as needed.

Deployment Manager now requires java 8

Known Issues

Using AWS autoscaling groups

If an ODAS cluster is created using a clusterLaunchScript, it will instantiate an autoscaling group of the specified size in AWS. Scaling an ODAS cluster that is running on an ASG is not supported in 1.0.0. Specifically, scaling down is known to have issues. This will be remedied in the next release.