Skip to content

Okera Version 2 Release Notes

This topic provides Release Notes for all 2.x versions of Okera.

2.8.4 (1/12/2022)

Bug Fixes and Improvements

  • Fixed an issue introduced in 2.8.3 that caused some policy permissions to fail.

  • Fixed an issue that occurred when using toPandas() in Databricks.

2.8.3 (1/10/2022)

This release contains an additional upgrade of log4j to resolve additional instances of the Log4Shell vulnerability.

DDL Improvements

Okera now supports dropping multiple partitions in a single DDL statement. For example:

ALTER TABLE page_view DROP 
PARTITION (dt='2008-08-08', country='us')
PARTITION (dt='2008-08-09', country='us');
See Support for Adding or Dropping Multiple Partitions in a Single ALTER TABLE Statement.

In addition, Okera now supports dropping partitions by specifying only part of the partition specification. For example, if a table is partitioned on year=INT/month=INT/day=INT, then you can specify only some of the columns and all matching partitions will be dropped. For example, the following statement will drop all partitions that have year set to 2020:

ALTER TABLE my_table DROP PARTITION(year=2020)

See Support for Dropping Partial Partitions in an ALTER TABLE Statement.

Bug Fixes and Improvements

  • Fixed an issue when providing an invalid BUCKET_TO_ROLE_MAP_FILE configuration.
  • Arrays with duplicate names are now unnested successfully.
  • Okera no longer attempts to autotag external views, as this is an undefined operation.
  • Improved performance for queries that require many metadata calls.
  • The database Permissions tab is no longer cut off when a safeguard policy message displays.

2.8.2 (12/20/2021)

Bug Fixes and Improvements

  • This release contains an additional upgrade of log4j to resolve additional instances of the Log4Shell vulnerability.
  • Several enhancements were made in this release to support Databricks file access control. Specifically, support was added for configuring the signing key used by the integration, which can be configured using Databricks Secrets. See Enable Okera File Access Control for more information. Errors will occur if a signing key is not found, and no static secret is provided.
  • With this release, Okera now supports Databricks versions 8.3, 8.4, 9.0 and 9.1. See Supported Versions.

  • Fixed several issues related to connecting to Databricks via JDBC/ODBC.

  • Fixed an issue with adding the full Spark query to the Okera audit log when issuing queries on Databricks. This capability is enabled by default. For more information, see Enable Spark Query Logging for Databricks.

Notable and Incompatible changes

  • In past versions, Okera's operational logs did not use a partitioning scheme when uploading. This made it hard to locate the logs you needed and, in some environments, increased the time to list the log files. With this release, a new configuration option, WATCHER_LOG_PARTITIONED_UPLOADS has been added to the configuration file to enable partitioned log uploads. Valid values are true and false. When enabled (true), operational log files will use the ymd=YMD/h=H/component=C partitioning scheme for operational log file uploads. By default, this setting is disabled (false) so older clusters are not affected. However, in a future version of Okera, it will be enabled by default, so Okera recommends that users adopt this in new deployments.

  • For Databricks version 8.0 and up, Okera now only supports the native Databricks integration (where Okera is not on the data path).

2.8.1 (12/13/2021)

This release contains an upgrade of log4j to resolve the Log4Shell vulnerability.

2.8.0 (12/10/2021)

File Access Control

In past versions, Okera supported authorization for data access via SQL. However, big data environments and cloud object storage environments allow users to access files directly, which introduces the need to control who should be able to perform operations on files and what operations should be allowed. This release of Okera introduces the ability to perform file access control.

Okera's first implementation of file access control is for S3 in a Databricks environment. With this feature, Okera provides an authorization layer that intercepts Databricks data requests to S3 to determine whether users have access to the file and data in the request. If they do, the request is passed to S3 for processing. If they do not have access to the file, the request is rejected and returned.

This feature is implemented automatically, except for two environment variables that must be set in Databricks: OKERA_ENABLE_OKERA_FS and OKERA_FS_REQUIRE_SIGNED_PATHS. For information on how to set these environment variables, see Enable Okera File Access Control.

Timebound Permissions

You can now add start and end dates and times for permissions. The permission definitions are only enforced during the specified date and time range. You can also specify only a start date and time or only an end date and time. If you specify both, the end date and time must be later than the start date and time. See Set Time-Based Conditions.

Enable and Disable Permissions

You can now enable and disable permissions for a role. By default, a permission is enabled when it is created. Disabled permissions remain assigned to the role but are not enforced. A new toggle has been added to the Permissions dialog that you can use to enable and disable the permission. In addition, a new Enabled column has been added to Permissions lists.

See Disable Permissions and Enable Permissions.

Databricks 8 Support

This release introduces support for Databricks 8 through 8.3. It mostly provides the same scope of functionality as Okera's support for Databricks 7. However, in Databricks 8 integrations that use Spark 3 or later, client-side compression is not currently supported. See Supported Versions.

Improved OAuth Authentication Using a JSON Web Key Set (JWKS) Endpoint

You can now configure Okera with a JWKS endpoint that will be used to dynamically fetch the appropriate public key needed for OAuth authentication from the JWKS content supplied by OAuth services.

We recommend that all OAuth users configure this endpoint to improve OAuth authentication in Okera. This is an improvement over past releases in which you had to manually configure Okera with the appropriate public key from OAuth services that did not provide it in an easily consumable format.

Use the JWT_JWKS_URL configuration setting to supply the URL of your OAuth identity provider (for example, Okta, Auth0, or AzureAD). For more information, see OAuth Authentication.

Pushdown Processing for Optimized Row Column (ORC) Table Query Predicates

Numeric data type query predicates for optimized row column (ORC) tables are now pushed down to ORC libraries for processing.

Novice User Experience

Tooltips have been added to the Okera Portal (the UI) in this release. The first time you access a page, the tooltips display automatically. Thereafter, the tooltips do not automatically display, but are available by selecting the tooltip icon ().

Contextual User Inactivity Analysis

The User Inactivity Report has been retitled User Inactivity Analysis and is now located as a fourth tab on the Databases page and on the dataset details page. It is no longer available as the second tab on the Users page. See User Inactivity Analysis.

Permission List Consistency Enhancements

The permission lists in the Okera Portal (UI) on the Roles page and on the Data page now look and work consistently. See View and Manage Permissions in the UI.

Schema Edit Usability Improvement

This release improves the user experience while editing a dataset schema. You are now no longer required to select the checkmark icon () after making an edit. Instead, just clicking anywhere else on the screen will save your changes. Selecting the checkmark icon continues to save your changes as well.

Okera Portal Table Consistency Enhancements

This release improves the usability and consistency of table behavior in the Okera Portal (the UI). Specifically:

  • The user experience when creating a new table object (such as when creating a new database, a new role, or a new connection) has been made consistent across the UI.
  • The headers are formatted consistently for all tables in the UI.
  • The search (filter) bar and table header are static as you scroll through tables. They remain in a consistent location while scrolling.
  • The look and feel of all tables in the UI are consistent.
  • You can now filter the Roles page table by multiple groups. In the past you could not. In addition, the group and user filter boxes on the Roles page now include dropdown lists from which you can select groups and users for the filter.

Bug Fixes

  • Fixed several issues when rewriting queries (e.g., for Snowflake) related to casing and escaping.
  • Fixed an issue when connecting to an external HMS using Thrift where the Databases page would be blank.
  • Fixed an issue that prevented revoking permissions on tables that no longer existed.
  • Fixed an issue when running SHOW CREATE TABLE when a column name was a reserved keyword.
  • Fixed an issue when creating a crawler over a Hadoop Distributed File System (HDFS) path.
  • Fixed an issue when creating connections where the username or password paths were prefixed with white space.

2.7.8 (1/10/2022)

This release contains an additional upgrade of log4j to resolve additional instances of the Log4Shell vulnerability.

Bug Fixes and Improvements

  • Arrays with duplicate names are now unnested successfully.

2.7.7 (12/20/2021)

Bug Fixes and Enhancements

  • This release contains an additional upgrade of log4j to resolve additional instances of the Log4Shell vulnerability.

Notable and Incompatible changes

  • In past versions, Okera's operational logs did not use a partitioning scheme when uploading. This made it hard to locate the logs you needed and, in some environments, increased the time to list the log files. With this release, a new configuration option, WATCHER_LOG_PARTITIONED_UPLOADS has been added to the configuration file to enable partitioned log uploads. Valid values are true and false. When enabled (true), operational log files will use the ymd=YMD/h=H/component=C partitioning scheme for operational log file uploads. By default, this setting is disabled (false) so older clusters are not affected. However, in a future version of Okera, it will be enabled by default, so Okera recommends that users adopt this in new deployments.

2.7.6 (12/13/2021)

This release contains an upgrade of log4j to resolve the Log4Shell vulnerability.

2.7.5 (11/18/2021)

Bug Fixes and Improvements

  • Made corrections to Parquet file optimization.

2.7.4 (11/01/2021)

Bug Fixes and Improvements

  • Added support that scans Avro data files with enum data types containing default values. In previous versions, only enum data types with no default values were supported.
  • This release makes Parquet file scan speed improvements work for files with statistics stored in deprecated fields.

2.7.3 (10/19/2021)

Bug Fixes and Improvements

  • Resolved a crash that occurred when scanning ORC data.
  • This release improves Parquet file scan speeds for queries by evaluating filters against file statistics. Currently, this only supports filters on integer-type columns.

  • Query rewrites for Snowflake now support using array functions in Okera policies.

  • This release resolves some incorrect processing of the Create Temporary Table/View statement by a Spark client library. The fix specifically handles cases when the RecordServiceTable is specified with a direct table name and a LAST clause or when the RecordServiceTable is specified using a SELECT statement.

  • A CREATE_TIME column was added to the all_tables system view.

  • Corrected the date offset conversion used for dates before UNIX epoch time. This previously could be off by one for certain time zones.

  • With this release, Databricks connections via JDBC endpoint connect users correctly.

  • The Java client libraries can now fail due to misconfiguration-related authentication errors. These errors have been and continue to be rejected on the server but adding them to client checks improves error diagnosis. This feature is optional but is enabled by default for Databricks. To enable it, set the REQUIRE_AUTHENTICATED_CONNECTIONS environment variable to true.

2.7.2 (09/28/2021)

Bug Fixes and Improvements

  • Fixed an issue that occurred while listing databases when Okera is connected to an external Hive metastore (HMS).
  • Fixed an issue in which crawlers for the Hadoop File System (HDFS) were not working.
  • Added a guard that rejects queries larger than a specified byte size. Use the MAX_REQUEST_SIZE_BYTES environment variable to configure the byte size limit. The default size is 52751601 bytes, which is approximately 52MB. Queries that are smaller than this are not guaranteed to succeed and may fail if they encounter other guards.
  • Added options that will log the current application stack should a crash occur.
  • Fixed a crash that occurred when reading Apache Avro data.

2.7.1 (09/09/2021)

Bug Fixes and Improvements

  • Fixed an issue when creating non-Delta Parquet tables in Databricks when connected to Okera.
  • Improved the performance of authorizing very wide tables with many tags on them.
  • Fixed an issue when using Symlink tables where the manifest file contained s3a://... URIs instead of s3://... URIs.
  • Fixed an issue where new partitions were not being discovered for the audit log tables when automatic partition recovery was disabled.
  • Fixed an issue when trying to create tables in Databricks using non-S3 paths (e.g., abfs://, gs://, etc.).
  • Fixed an issue when generating a JWT that would use the server's time zone instead of UTC to calculate the expiry date.
  • Improved support for creating views in Databricks that contain SQL that cannot be analyzed by Okera.
  • Fixed an issue when autotagging tables that had STRUCT values.

Notable and Incompatible Changes

  • When privacy functions such as mask() and tokenize() apply to a column that is a complex type (e.g., STRUCT) in Databricks, it will now convert those columns to null as the new default behavior. To revert to the previous behavior, set the COMPLEX_TYPES_NULL_ALL_TRANSFORMS configuration to false.
  • Okera will now leverage Strict Transport Security (HSTS) when requesting pages over HTTPS, letting the browser know to never request those over HTTP for all subdomains. If users were accessing web pages in subdomains of the Okera domain via HTTP where Okera's domain was hosted over HTTPS, this is no longer possible. Instead, host all subdomains over HTTPS.
  • The Okera REST API (cdas-rest-server service) now explicitly sets Cache-Control: no-store headers for all mutable API resources. This ensures better security semantics when using a browser to access Okera resources.

2.7.0 (08/20/2021)

New Features and Enhancements

Registration Improvements

Okera's Data Registration UI has been significantly improved, now matching the look and feel of the rest of the Okera UI, as well as adding new functionality:

  • Enhanced attribute editing, easing the review and modification of attribute assignments.
  • Edit more table metadata, including table-level attributes, the table name and table- and column-level comments.
New Register datasets design

Reading Data Via Assume Role

When reading data from S3, Okera now supports the ability to assume secondary roles to read data, with different roles for different buckets. For example, you can configure Okera to use role-a when reading data from s3://bucket-a and role-b when reading data from s3://bucket-b.

To configure this capability, use the BUCKET_TO_ROLE_MAP_FILE configuration setting, with the value being a path to a file (e.g., s3://path/to/mapping.json or file:///path/to/mapping.json) that has the following structure:

{
  "version": "v1",
  "buckets": {
    "bucket-a": {
      "role": "arn:aws:iam::<account>:role/role-a"
    },
    "bucket-b": {
      "role": "arn:aws:iam::<account>:role/role-b"
    }
}

Audit and Operational Log Storage

Okera now supports storing audit and operational log storage on ADLS Gen2 and Google Cloud Storage.

For ADLS Gen2, configure the WATCHER_AUDIT_LOG_DST_DIR and WATCHER_LOG_DST_DIR settings to a path such as abfs://okera@mycompany.dfs.core.windows.net/logs/audit/.

For Google Cloud Storage, configure the WATCHER_AUDIT_LOG_DST_DIR and WATCHER_LOG_DST_DIR settings to a path such as gs://mycompany/okera/logs/audit/.

DDL Improvements

  • Okera now supports adding table-level attributes when creating a table or views using CREATE TABLE and CREATE VIEW. For example:
CREATE TABLE users (
    id BIGINT,
    name STRING
)
ATTRIBUTE classification.sensitive
  • Okera now supports adding multiple partitions in a single DDL statement. For example:
ALTER TABLE rs.multi_add_partition_test_table ADD
PARTITION(year='2020') LOCATION
's3://cerebrodata-test-readonly/readonlypartitiontest/year=2020/'
PARTITION(year='2021') LOCATION
's3://cerebrodata-test-readonly/readonlypartitiontest/year=2021/'
PARTITION(year='2022') LOCATION
's3://cerebrodata-test-readonly/readonlypartitiontest/year=2022/'

Databricks 8 Support

Okera now supports Databricks 8.0, 8.1 and 8.2 runtimes for the new native integration.

UI Improvements

  • Okera now supports enabling all supported authentication types (AD/LDAP, OAuth, SAML and Token-based) enabled at the same time, including on the UI login page.
  • It is now possible to filter the list of Connections by both the name of the connection as well as the underlying type (e.g., Snowflake, AWS Redshift, etc.).
  • Okera's Policy Builder now allows editing policies that have granular row filtering permissions while retaining the structured builder UI.
  • The Okera Presto JDBC driver is available to download on the System page.

Bug Fixes and General Improvements

  • Fixed an issue where errors in the permissions table on the Data page were not shown properly.
  • Fixed an issue where in the permission table on the Data page, permissions with grant option would not be shown correctly.
  • Fixed an issue where extra whitespace could cause incorrect token parsing in Authorization headers.
  • Fixed an issue where the directory settings in the Data Registration UI were not being applied.
  • Fixed an issue with OAuth logins, where if originating from a deep link, the OAuth login would fail with an error due to an unrecognized redirect URL.
  • Fixed an issue where conflict computation for new permissions included global policies.
  • Fixed an issue where while creating or editing a permission in Policy Builder, access level dropdown could have options that were not actually available.
  • Fixed an issue where the crawler could create a table with an invalid column name, rendering that crawler unusable by the system.
  • Added the ability to control the Tags dropdown when editing a schema using the keyboard.
  • Fixed an issue when renaming tables using ALTER TABLE <table> RENAME TO <new table>.
  • Fixed an issue when creating a table using the Spark CSV provider in Databricks.
  • Fixed an issue when querying struct types using the new native Databricks integration.
  • Fixed an issue when running SHOW CREATE TABLE <table> on a JDBC-backed table.
  • Fixed an issue that caused the same user to be displayed (with different casing) on the Users page.
  • Fixed an issue when accessing a table in a database in cases where a table existed in the default database that had the same name as the original database.

Notable and Incompatible Changes

  • The REST server endpoint no longer hosts the /__status endpoint. Where https://mycompany.okera.com:8083/__status used to return ok it will now return a 404. If a status endpoint is needed, use /api/health instead: a 200 response indicates a functioning status.
  • The Data Registration UI does not support renaming tables when using AWS Glue as a catalog. If this functionality is necessary when using Glue, please set FEATURE_UI_ENABLE_LEGACY_REGISTRATION_PAGE: true in your Okera configuration, which will use the prior version of the Data Registration UI.
  • Databases created through the Okera UI can no longer start with an underscore.
  • Okera's Presto proxy will require a minimum TLS version of 1.2 by default.

2.6.8 (1/10/2022)

This release contains an additional upgrade of log4j to resolve additional instances of the Log4Shell vulnerability.

Bug Fixes and Improvements

  • Arrays with duplicate names are now unnested successfully.

2.6.7 (12/20/2021)

Bug Fixes and Improvements

  • This release contains an additional upgrade of log4j to resolve additional instances of the Log4Shell vulnerability.

Notable and Incompatible changes

  • In past versions, Okera's operational logs did not use a partitioning scheme when uploading. This made it hard to locate the logs you needed and, in some environments, increased the time to list the log files. With this release, a new configuration option, WATCHER_LOG_PARTITIONED_UPLOADS has been added to the configuration file to enable partitioned log uploads. Valid values are true and false. When enabled (true), operational log files will use the ymd=YMD/h=H/component=C partitioning scheme for operational log file uploads. By default, this setting is disabled (false) so older clusters are not affected. However, in a future version of Okera, it will be enabled by default, so Okera recommends that users adopt this in new deployments.

2.6.6 (12/13/2021)

This release contains an upgrade of log4j to resolve the Log4Shell vulnerability.

Bug Fixes and Improvements

  • Fixed an issue that occurred while listing databases when Okera is connected to an external Hive metastore (HMS).
  • Fixed an issue in which crawlers for the Hadoop File System (HDFS) were not working.

2.6.5 (10/22/2021)

Bug Fixes and Improvements

  • Resolved a crash that occurred when scanning ORC data.

  • Corrected the date offset conversion used for dates before UNIX epoch time. This previously could be off by one for certain time zones.

  • This release resolves some incorrect processing of the Create Temporary Table/View statement by a Spark client library. The fix specifically handles cases when the RecordServiceTable is specified with a direct table name and a LAST clause or when the RecordServiceTable is specified using a SELECT statement.

  • Added a guard that rejects queries larger than a specified byte size. Use the MAX_REQUEST_SIZE_BYTES environment variable to configure the byte size limit. The default size is 52751601 bytes, which is approximately 52MB. Queries that are smaller than this are not guaranteed to succeed and may fail if they encounter other guards.

  • Added support that scans Avro data files with enum data types containing default values. In previous versions, only enum data types with no default values were supported.

  • The Java client libraries can now fail due to misconfiguration-related authentication errors. These errors have been and continue to be rejected on the server but adding them to client checks improves error diagnosis. This feature is optional but is enabled by default for Databricks. To enable it, set the REQUIRE_AUTHENTICATED_CONNECTIONS environment variable to true.

2.6.4 (09/10/2021)

Bug Fixes and Improvements

  • Fixed several issues when scanning nested types that could cause a crash or query failure.
  • Fixed an issue when using Symlink tables where the manifest file contained s3a://... URIs instead of s3://... URIs.
  • Fixed an issue where new partitions were not being discovered for the audit log tables when automatic partition recovery was disabled.
  • Fixed an issue when creating non-Delta Parquet tables in Databricks when connected to Okera.
  • Improved support for creating views in Databricks that contain SQL that cannot be analyzed by Okera.
  • Improved the performance of authorizing very wide tables with many tags on them.
  • Fixed an issue when trying to create tables in Databricks using non-S3 paths (e.g., abfs://, gs://, etc.).
  • Fixed an issue when generating a JWT that would use the server's time zone instead of UTC to calculate the expiry date.

Notable and Incompatible Changes

  • When privacy functions such as mask() and tokenize() apply to a column that is a complex type (e.g., STRUCT) in Databricks, it will now convert those columns to null as the new default behavior. To revert to the previous behavior, set the COMPLEX_TYPES_NULL_ALL_TRANSFORMS configuration to false.

2.6.3 (08/22/2021)

Bug Fixes and Improvements

  • Fixed an issue where some permissions appeared twice on the Permissions tab in the Datasets page.
  • Fixed an issue when creating a table using the Spark CSV provider in Databricks.
  • Fixed an issue when querying struct types using the new native Databricks integration.
  • Fixed an issue when autotagging tables that had STRUCT values.
  • Fixed an issue that could cause a crash in the workers when scanning an unnested table.

2.6.2 (08/02/2021)

Bug Fixes and General Improvements

  • Fixed an issue that could prevent startup if a bad regular expression was configured for an autotagging rule.
  • Fixed an issue when creating a crawler and using the "Dataset files are not in separate directories" option.
  • Fixed an issue when cascading tags on nested parts of complex columns to child views when multiple tags are present.
  • Fixed an issue that can occur with MySQL 5.7 as the backing database.
  • Added the ability to configure the timeout for Presidio autotagging using the OKERA_PRESIDIO_TIMEOUT_MS configuration parameter.
  • Okera's Java client libraries will now detect when running in a Domino Data Labs environment and automatically leverage the auto-generated JWT specified in DOMINO_TOKEN_FILE.
  • Improved the behavior of non-alpha characters when using tokenize() in a Databricks environment.
  • Added the ability to blacklist specific tags from Presidio autotagging matches using OKERA_PRESIDIO_TAG_BLACKLIST, e.g., OKERA_PRESIDIO_TAG_BLACKLIST: pii.address.
  • Added the ability to specify that Okera is connecting to a pre-existing HMS/Sentry to avoid any changes to those schemas - enable this by setting the CATALOG_EXISTING_HMS_SCHEMA: true configuration parameter.
  • Fixed an issue when loading tables that have type definitions that contain Okera keywords (this typically only happens with a pre-existing HMS).
  • Fixed an issue when authorizing view access for Databricks when all views are external.
  • Fixed an issue when using the legacy and deprecated input format com.uber.hoodie.hadoop.HoodieInputFormat for Hudi - please switch to using org.apache.hudi.hadoop.HoodieParquetInputFormat.
  • Improved SSL handling for Okera's planner and worker - enable this by setting:
    • RS_SSL_ENABLED: true
    • RS_ARGS: --ssl_enable=true --ssl_private_key=/path/in/pod/to/key --ssl_server_certificate=/path/in/pod/to/cert
  • Added configuration options for clients to connect to planner and worker using SSL:
    • For Hive/Spark: recordservice.planner.connection.ssl, recordservice.worker.connection.ssl
    • For Presto: okera.planner.connection.ssl, okera.worker.connection.ssl
    • Alternatively, setting the RS_SSL_ENABLED variable will auto-set these.
  • Added the ability for nScale workers to also have SSL enabled - in addition to the above parameters, you also need to enable authentication on the client using recordservice.worker.force-auth (for Hive/Spark) and okera.worker.force-auth (for Presto).
  • Fixed an issue when dropping partitions in Databricks.
  • Fixed an issue where server errors would cause an empty database list to be displayed in the Okera UI - the error will now be properly reported.
  • Improved performance of listing datasets in the Okera UI.
  • Added support for using NULL DEFINED AS when creating a table.
  • Added the ability to disable autotagging of complex types by setting the ENABLE_COMPLEX_TYPE_AUTOTAGGING configuration value to false.

Notable and Incompatible Changes

  • This release contains an upgrade of the Alpine base image from 3.13 to 3.14, which upgrades the embedded Python from 3.8.x to 3.9.x (to address a Python CVE).

2.6.1 (05/25/2021)

Bug Fixes and General Improvements

  • Fixed an issue where the Okera UI would fail to render using Safari.
  • Fixed an issue where previewing data in the Registration page could fail.
  • Fixed an issue where paginating to past the first page of crawled tables in the Registration page could fail.

2.6.0 (05/13/2021)

New Features

Safeguard Policies for Catalog-Wide Compliance (Beta)

You can now set up catalog-wide 'Safeguard' policies to ensure certain specified data is always masked, or restricted to certain users, regardless of lower-level permissions granting access to it. For example, you could set up a Safeguard policy to mask data tagged social-security number across the entire catalog. Read more here.

Safeguard Policy Dialog masking ssn

Row-Level Security Expression Builder and User Attribute Policy Building Improvements

New UI experience to easily create granular row filtering permissions, which leverage user attributes. Users will have the option to create column and user attribute-based access conditions or can customize their experience by using the custom SQL expression option.

Data stewards and admins who create these policies will now be able to manage these policies in the permissions list or directly on the specific role in the roles page.

You can read more about user attributes here and how to implement user attribute-based policies here.

Okera Row filter policy result

Easily Connect and Protect BigQuery Data

You can now create a Google BigQuery connection in Okera and create a crawler to register tables from BigQuery inside the Okera catalog. Read more here.

Additional Support for Azure and Google Cloud

Azure KeyVault Support

Okera now supports using Azure KeyVault as a source for secrets, which can be leveraged when creating connections to other data sources such as Azure Synapse, Azure SQL, Snowflake, and others. You can read more about passing sensitive credentials here.

Google Cloud Secret Manager

Okera now supports using Google Cloud Secret Manager as a source for secrets, which can be leveraged when creating connections to other data sources such as Azure Synapse, Azure SQL, Snowflake, and others. You can read more about passing sensitive credentials here.

Audit Logs in Google Cloud Storage

You can now configure Okera to store the audit logs in Google Cloud Storage. You can read more about this here.

Additional Support for PrestoSQL

Client Library Support for PrestoSQL 350

Okera has added support for PrestoSQL version 350 in the PrestoSQL client library support.

Using PrestoSQL as the Okera Presto Engine (Beta)

Starting in 2.6.0, it is possible to utilize PrestoSQL as the Presto engine that Okera uses, instead of PrestoDB.

To do this, you can either specify it with okctl during upgrade (you can use either prestodb or prestosql as the values):

okctl upgrade latest --presto-engine=prestosql

or you can use the quay.io/okera/prestosql:2.6.0 image in your Kubernetes manifests.

Note: PrestoSQL has its own JDBC driver that you can download from s3://okera-internal-release/2.6.0/prestosql-jdbc-driver/presto-jdbc-350.jar. In a future Okera release, PrestoSQL (and eventually the Trino-based releases) will become the default engine (with an option to switch back to PrestoDB), as it provides improved performance and capability.

Added Support for Databricks 7.5+

Okera now supports Databricks 7.5 and 7.6.

Bug Fixes and General Improvements

  • Several improvements to for Delta tables that were created by Databricks with limited metadata.
  • Fixed an issue in the Databricks client integration when loading tables that only had metadata in the Spark properties.
  • Fixed an issue in the Databricks client integration for partitioned tables.
  • Added support for discovering ORC-based datasets using Data Registration crawlers.
  • Fixed an issue when rewriting queries for Databricks.
  • Add pushdown support for the sets_intersect built-in string function (not applicable for Dremio).
  • Fixed an issue when dropping a partitioned table using mixed-case.
  • Added the ability to set the user attribute cache invalidation interval by setting the OKERA_USER_ATTRIBUTES_CACHE_THRESHOLD_MS configuration setting (default is 5 minutes).
  • Added the ability to invalidate the connection pools for JDBC-backed tables using the INVALIDATE CACHED DATACONNECTION DDL (only available for system administrators).
  • Fixed an issue when using multiple Snowflake warehouses in the same Snowflake account in the same Okera deployment.
  • Added datatype-variants for mask() for non-text types (the masked value will be the zero value for that type).
  • Improved the Okera UI cookie security by restricting to the cluster's domain name.
  • Fixed an issue when an error occurs during login - the error will now be displayed instead of an infinite redirect loop.
  • Fixed a bug where Okera was not properly escaping column names in the Policy Builder UI.
  • Improved performance on wide tables with many tags.
  • Improved Policy Builder experience.
  • Improved the experience of editing attributes on columns on the Data page.
  • Policy Builder now shows the number of groups associated with selected role when granting from the data page.
  • The Group filter on the Users page is now multi-select.
  • Updated style and usability for how permissions are displayed on the Data pages.
  • Improved copy behavior on table rows - copying a row will now copy CSV instead of JSON.

Notable and Incompatible Changes

  • Users with VIEW_AUDIT on any database or dataset will now automatically have access to the reports page in the UI. They will also need access to the okera_system.reporting_audit_logs view to query the audit logs and load the page correctly.
  • Removed the deprecated Datasets page. This page was deprecated and disabled by default in 2.5.
  • Downgrading from 2.6.0 to 2.5.0, or to any prior version of Okera other than 2.5.1, in the same cluster will cause a UI login failure since the cookies are not compatible.
  • If a cluster uses SSL_FQDN, access the UI via IP address will no longer work due to recent cookie security fixes.
  • The api/get-token endpoint now returns 401 if not provided a token instead of 403.

2.5.10 (1/10/2022)

This release contains an additional upgrade of log4j to resolve additional instances of the Log4Shell vulnerability.

2.5.9 (12/20/2021)

Bug Fixes and Improvements

  • This release contains an additional upgrade of log4j to resolve additional instances of the Log4Shell vulnerability.

Notable and Incompatible changes

  • In past versions, Okera's operational logs did not use a partitioning scheme when uploading. This made it hard to locate the logs you needed and, in some environments, increased the time to list the log files. With this release, a new configuration option, WATCHER_LOG_PARTITIONED_UPLOADS has been added to the configuration file to enable partitioned log uploads. Valid values are true and false. When enabled (true), operational log files will use the ymd=YMD/h=H/component=C partitioning scheme for operational log file uploads. By default, this setting is disabled (false) so older clusters are not affected. However, in a future version of Okera, it will be enabled by default, so Okera recommends that users adopt this in new deployments.

2.5.8 (12/13/2021)

This release contains an upgrade of log4j to resolve the Log4Shell vulnerability.

2.5.7 (09/03/2021)

Bug Fixes and Improvements

  • Fixed an issue when using Symlink tables where the manifest file contained s3a://... URIs instead of s3://... URIs.

2.5.6 (07/19/2021)

Bug Fixes and Improvements

  • Fixed an issue when using transparent query pushdown for queries that utilize CTEs within inline views.
  • Fixed an issue when using transparent query pushdown for queries that have mixed casing.
  • Added support for specifying a default database in the connection session when using transparent query pushdown.

2.5.5 (07/08/2021)

Bug Fixes and Improvements

  • Fixed an issue when dropping partitions using the Hive/Spark client library.
  • Fixed an issue that prevented some errors in the UI from being displayed on the Databases page.

Notable and Incompatible Changes

  • This release contains an upgrade of the Alpine base image from 3.13 to 3.14, which upgrades the embedded Python from 3.8.x to 3.9.x (to address a Python CVE).

2.5.4 (07/05/2021)

Bug Fixes and Improvements

  • Improved rendering performance in the Okera Workspace when rendering large tables.

Notable and Incompatible Changes

  • This release contains an upgrade of the Alpine base image from 3.12 to 3.13 (and 2.5.5 will include an upgrade to Alpine 3.14), to address a Python CVE.

2.5.3 (06/18/2021)

Bug Fixes and Improvements

  • Fixed an issue when creating a crawler and using the "Dataset files are not in separate directories" option.
  • Fixed an issue when dropping a partitioned table and using mixed casing for the database name.
  • Fixed an issue when cascading tags on nested parts of complex columns to child views when multiple tags are present.
  • Fixed an issue that can occur with MySQL 5.7 as the backing database.

2.5.2 (05/13/2021)

Bug Fixes and Improvements

  • Fixed an issue with logging into Okera when another cookie is present on the domain (e.g., when using a SAML provider).

2.5.1 (05/05/2021)

Bug Fixes and Improvements

  • Added the ability to control the partition split optimization threshold from Hive/Spark using the recordservice.planner.partition-split-threshold parameter.
  • Fixed an issue when rewriting in_set(needle, haystack) when the haystack was NULL.
  • Improved the performance of symlink tables (e.g., Delta and Hudi).
  • Fixed handling of SQL comments in Workspace.
  • Added support for BINARY datatype for ORC files.
  • Added support for IF NOT EXISTS and IF EXISTS for CREATE/DROP DATACONNECTION.
  • Fixed an issue when using MySQL 8 or higher as the backing database.
  • Fixed an issue when updating view lineage.
  • Fixed an issue when dropping some ABAC grants created in prior versions of Okera.
  • Fixed an issue where error messages would not be cleared when re-running a data registration crawler.
  • Improved performance of loading table metadata when many attributes are present.
  • Fixed an issue when initiating a SAML login from the IdP.
  • Improved handling of quoted values in in_set rewrite.
  • Improved Okera cookie security by setting it to httpOnly.
  • Fixed an issue when using multiple Snowflake warehouses in the same Snowflake account in the same Okera deployment.

Notable and Incompatible Changes

The Okera UI has changed its cookie to use httpOnly and is now managed by the server. This approach improves security as no JavaScript code will be able to access information stored in the cookie.

Due to this, downgrading from 2.5.1 to 2.5.0, or to any prior version of Okera, in the same cluster will cause a UI login failure since the cookies are not compatible. To resolve this, you can clear all storage for the site, including cookies.

Snowflake Case-Insensitive Identifiers

By default, Okera will now set QUOTED_IDENTIFIERS_IGNORE_CASE to true when communicating with Snowflake, treating all identifiers as case-insensitive. This behavior can be disabled by setting the OKERA_JDBC_CONNECTION_SNOWFLAKE_CASE_INSENSITIVE_COLUMNS configuration parameter to false.

2.5.0 (03/03/2021)

New Features

New Data Source Connections Experience

New UI experience to easily create connections to data sources. Read more on Connections.

Data registration now supports creating crawlers on any connection source (such as JDBC-backed datasources), not just object storage. Read more on Crawlers.

UI Connections page

New Access Levels and Scopes for Data Management Delegation

Okera can now be used to grant granular permissions to delegate access to role, crawler and data connection creation and management. As part of this, the ROLE, CRAWLER and DATACONNECTION object scopes have been added to the permissions model. These new access levels and scopes are now available in the UI Policy Builder. In addition, grants on URI are now also available in the Policy Builder.

These changes impact who can see the Roles page, the Registration page, and the Connections page in the UI. Click for more information on Access Delegation and for the full set of Access Levels available.

Access levels in policy builder

Note: Read Notable and Incompatible Changes for information about the impact this change can have on how you manage permissions.

Dremio Integration (Beta)

Dremio is now a supported JDBC data source type, and you can read more about configuring it here.

User Attributes Improvements

A user's attribute values are now visible to that user from the homepage, including the user attribute source that provided them (e.g., ldap).

It is also now possible to configure a script (or multiple scripts) to source user attributes, which is useful if they need to be sourced from a bespoke location such as a custom REST API or storage.

To enable this, set the USER_ATTRIBUTES_SCRIPT configuration setting to a path where the script is available (or comma-separated paths), e.g., on S3 or a local file, and okctl will ensure those scripts are available inside the running pod. Alternatively, if you are not using okctl, ensure that the value is set to a path (or comma-separated paths) that are available inside the container.

You can read more about user attributes here, and specifically about using custom scripts for sourcing them here.

ORC Format Support

Added support for the ORC file format, allowing to register and query data for files using this format.

Note: When you use the data registration experience to crawl for data in object storage, ORC data will not be automatically discovered. This will be added in a future release.

Easier Diagnostics Collection

Added the ability to use the EXECUTE DIAGNOSTICS DDL to automatically collect logs and other diagnostics to an S3 path.

The command will return immediately with the location the diagnostics are being written to, and they will be collected in the background.

By default, these diagnostics will be uploaded to a write-only Okera-owned bucket, okera-diagnostics, but this can be overridden:

  1. On a per invocation basis using EXECUTE DIAGNOSTICS LOCATION 's3://...'.
  2. By setting the DIAGNOSTICS_COLLECTION_DEFAULT_LOCATION configuration setting to a desired value (e.g., s3://company-okera-diags/).

Note: Only administrators can run the EXECUTE DIAGNOSTICS command.

Improvements to okctl

  • Added the ability to see the token claims using okctl tokens describe <token>, where <token> is the name of the token file (in the .auth folder by default).
  • Added the ability to refresh a token using okctl tokens refresh <token>, where <token> is the name of the token file (in the .auth folder by default). This will use the generated private key for refreshing, while preserving the groups in the token.
  • Added a --duration flag for the init, refresh and create sub-commands to control the duration that the token will be valid for. The default duration is one year.
  • Added the ability to disable specific validators (which may not be applicable to the environment) in the configuration YAML, by setting the validator to false:
    validation:
      disk_space: true
      database: true
      ldap: true
      storage_read: true
      storage_write: true
      dns: true
    

Bug Fixes and Improvements

  • Added the ability to specify a list of groups during role creation in the Okera UI.
  • Added ability to add/remove multiple groups at the same time in the Okera UI.
  • Added the ability to filter the permissions table by role name in the Okera UI.
  • Fixed an issue where the Policy Builder summary would scroll away in lower resolutions in the Okera UI.
  • Added ability to quarantine specific databases using the QUARANTINED_DATABASES configuration key, which can be set to a comma-separated list of database names.
  • Added ability to specify a list of groups to grant to when creating roles, using CREATE ROLE <role> WITH GROUPS group1 group2 ....
  • Added ability to specify more than one group when granting or revoking a role, using GRANT ROLE <role> TO GROUPS group1 group2 ... and REVOKE ROLE ... FROM GROUPS group1 group2 ....
  • Improved the default timeout for communication between the built-in PrestoDB and the rest of the Okera services.
  • Added ability to configure per-task and per-worker memory limit on nScale workers from various clients.
  • Fixed an issue when using CREATE EXTERNAL TABLE ... LIKE TEXTFILE with a custom delimiter, where that delimiter would not be recognized for inferring the table schema.
  • Added the ability to specify a set of default delimiters to use when inferring schemas in TEXTFILE by setting the OKERA_DEFAULT_FIELD_TERMINATOR configuration setting to a list of characters.
  • Fixed an issue where the underlying connection to a JDBC-backed datasource could leak in some cases when an exception happened.
  • Added ability to control the default fetch size for JDBC-backed datasources by setting the OKERA_JDBC_RECORDS_BATCH_MAX_CAPACITY configuration setting to the desired value.
  • Improved handling for data sources that do not support transparent query pushdown.
  • Performance improvements for planning and execution of queries against JDBC-backed datasources.
  • Fixed an issue when revoking a URI grant with a privilege level other than ALL.
  • Added the ability to enable LDAP/SAML/OAuth UI authentication at the same time if more than one of authentication mode is necessary.
  • Fixed an issue when fully unnesting a table that could cause column names that would exceed the metastore's limit of 128 characters. These columns are now auto-truncated to retain as much of the original name while fitting within the character limit. If a truncation is not possible, an error will be thrown.
  • Fixed an issue when doing transparent query pushdown where it would incorrectly use the cluster-external load balancer (e.g., ELB) rather than the cluster-local service.
  • Fixed an issue where a failed crawler would sometimes not report failure properly.
  • Added the ability to specify the default zstd compression level by adding --zstd_default_compression_level=<level> to RS_ARGS.
  • Fixed an issue in which creating JDBC-backed tables could fail with a permission error.
  • Fixed an issue in PyOkera where it would fail to properly parse the JWT if pyjwt was installed.
  • Fixed an issue in PyOkera where if both db and a filter were passed to list_datasets, it would incorrectly omit the db parameter.
  • Fixed an issue in the Okera UI where the "Use this dataset" sample text would not escape the database and table names.
  • Fixed an issue when using scan_as_pandas where it would incorrectly reset the row index for every batch.
  • Improved the behavior in PyOkera for refreshing the token when it is expired during a scan_as_json or scan_as_pandas invocation using the presto dialect.
  • Fixed an issue where the value of the JWT_TOKEN_EXPIRATION configuration setting would not always be used, instead using the default of 1 day expiration.
  • Improved memory accounting and dynamic batching when computing queries over large rows.
  • Improved performance of JWT signature validation.
  • Added the in_set(<needle>, <comma-separated haystack>) built-in function.
  • Fixed an issue where the generated SQL could be missing enclosing parentheses when containing multiple predicates.
  • Improved error handling when registering a set of tables using ALTER DB ... LOAD DEFINITIONS(). By default (and changed from prior releases), an error will not be fatal and will continue registering tables. The number of tables added/skipped/failed can be seen by looking at the okera.jdbc.tables-XXX properties on the database. To revert to the previous behavior of aborting on any error, set the configuration value of JDBC_LOAD_DEFINITION_ABORT_ON_ERROR to true, either globally or per database (using DBPROPERTIES).
  • Fixed an issue where the identifier quoting character for MySQL and PostgreSQL would not always be used when doing a query rewrite.
  • Fixed an issue where timestamps that were too low to be represented correctly would cause incorrect values to be returned - the data is now clamped to the start of the Gregorian calendar.
  • Improved handling of concurrent partition fetching for large partition counts.
  • Fixed an issue where Okera could fail to drop a partitioned table when the DROP TABLE DDL referenced the table in non-lowercase form.
  • Fixed an issue when creating a table without specifying a ROW FORMAT but that does specify an INPUTFORMAT, causing the specified INPUTFORMAT to be used by default for subsequent table creation when no ROW FORMAT is specified.
  • Fixed an issue where worker discovery could fail if one of the worker pods was stuck in Pending state.
  • Added the ability to specify the result limit for Presto mode queries as well in the Okera Workspace.
  • Fixed an issue when rendering complex types when running using the Presto mode queries in the Okera Workspace.
  • Fixed a bug where Policy Builder failed to properly update policies when there was a conflict.
  • Improved granularity of error reporting in Data Registration. There is now a distinction between an error during background crawling execution and an error regarding a specific table.
  • Added the ability to search for crawlers by their source in the Okera Data Registration UI.
  • Fixed a bug where crawler names that contained reserved characters were not being escaped.
  • Removed the /__api/log endpoint, which was used by the Okera UI to log errors.

Notable and Incompatible Changes

Data Connections

  • Connection names are now validated to only include allowed characters by default ([a-zA-z_0-9]+). You can disable this behavior by setting the OKERA_ENABLE_CONNECTION_NAME_VALIDATION setting to false.
  • The user and password parameters when creating connections via DDL have been renamed to user_key and password_key, to aid in understanding they do not store the credentials themselves but only the reference to them (e.g., in AWS Secrets Manager or a Kubernetes Secret).
  • In Okera 2.2.x and 2.3.x, when creating JDBC-backed datasources using a property file, Okera would implicitly create a data connection for it. This behavior is now disabled, as all new registration should happen using data connections. These automatically data connections cannot be used in the Data Registration flow and should ideally be replaced with explicitly created connections.

Permissions

Several permission-related changes were made in this release. These are generally part of the new permission delegation capabilities, but the notable changes include:

  • To create a Crawler, a user now requires either the CREATE_CRAWLER_AS_OWNER or ALL privilege on the CATALOG scope.
  • To use a data connection when creating a table or database, the user must have the USE privilege on that data connection.
  • To grant access to an object that a user has WITH GRANT OPTION on, that user will also need MANAGE_PERMISSIONS on the role they want to grant that permission to. To revert to the old behavior, set the ENABLE_LEGACY_GRANTABLE_ROLES configuration setting to true.
  • Starting in 2.5.0, access to the Okera Workspace will be granted to all users (it is granted by default to okera_public_role). If you wish to limit access to Workspace to specific users:

    1. Revoke access to Workspace by removing it okera_public_role. You can do this from the Roles UI or by running the DDL:

      REVOKE SELECT ON TABLE okera_system.ui_workspace from ROLE okera_public_role;
      
    2. Edit your cluster configuration to the value for GRANT_WORKSPACE_TO_PUBLIC_ROLE to false.

    You can then grant the okera_workspace_role to any specific groups or users that you want to have access to the workspace feature.

  • Starting in 2.5.0, access to the following pages is controlled by whether the user has access to the relevant object, as opposed to explicitly granting access to that page. Read more about this here:

    • Roles: access to the Roles page is now available if the user has permission to manage any ROLE object.
    • Tags: access to the Tags page is now available if the user has permission to manage any ATTRIBUTE NAMESPACE object.

Other Updates

  • Database, dataset, and catalog filters have been removed from the Roles page. Permissions now appear on their respective objects on the Data page.
  • If a user has two grants to the data, one which is on an entire scope (e.g., table/database/catalog) with no WHERE clause, and one which has a WHERE clause, the WHERE clause will no longer be added as the other grant provides full access.

SQL Keywords

The following terms are now keywords, starting in 2.5.0:

  • CREATE_CRAWLER_AS_OWNER
  • CREATE_DATACONNECTION_AS_OWNER
  • CREATE_ROLE_AS_OWNER
  • DENY
  • MANAGE_GRANTS
  • MANAGE_GROUPS
  • MANAGE_PERMISSIONS
  • POLICYPROPERTIES

2.3.11 (1/10/2022)

This release contains an additional upgrade of log4j to resolve additional instances of the Log4Shell vulnerability.

2.3.10 (12/20/2021)

This release contains an additional upgrade of log4j to resolve additional instances of the Log4Shell vulnerability.

2.3.9 (12/13/2021)

This release contains an upgrade of log4j to resolve the Log4Shell vulnerability.

Bug Fixes and Improvements

  • Fixed an issue when dropping a partitioned table and using mixed casing for the database name.
  • Fixed an issue when dropping partitions using the Hive/Spark client library.

2.3.8

Bug Fixes and Improvements

  • ALTER TABLE ADD PARTITION no longer requires specifying the partition values if the location follows the standard path naming convention. Partitions can be added with ALTER TABLE <table> ADD PARTITION <location>.

2.3.7

Bug Fixes and Improvements

  • Fixed an issue when connecting from Databricks and using the Databricks-signed JWT could fail in some query submission modes.

2.3.6

Bug Fixes and Improvements

  • Fixed an issue when connecting from Databricks and using the Databricks-signed JWT could fail when a query was run multiple times.
  • Fixed an issue where partitioned symlink tables (e.g., Delta) would fail to plan if the number of partitions was high.

2.3.5

Bug Fixes and Improvements

  • Improved logging in the PrestoDB connector to properly log both the Presto query ID as well as the Okera task IDs when available.
  • Added the ability to set the default quote character (the default is ") for CSV files when using the built-in CSV SerDe. This can be set in the following ways:

    1. On the SERDEPROPERTIES when creating or altering a table (e.g., to disable quote handling by removing the quote character):

      SERDEPROPERTIES('quoteChar'='')
      

      e.g.

      ALTER TABLE mydb.mytable SET SERDEPROPERTIES('quoteChar'='')
      
    2. On the TBLPROPERTIES to set the default value (this can be overwritten with the above SERDEPROPERTIES):

      TBLPROPERTIES('okera.text-table.default-quote-char'='')
      
    3. Change it global default for the cluster by setting TEXT_TABLE_DEFAULT_QUOTE_CHAR to the desired value, e.g., '' to disable the quote character.

  • Fixed an issue with handling of CSV files when split across multiple tasks and running count(*).

  • Upgraded the packaged Snowflake JDBC driver to v3.2.17.

2.3.4

Bug Fixes and Improvements

  • Fixed an issue where, when using AWS Glue, loading a specific database in the UI would take a long time to load.
  • Improved handling of S3 connection errors (e.g., retries, service unavailable), including the ability to set new values via configuration.
  • Increased the default PrestoDB TaskUpdate limit.

2.3.3

Bug Fixes and Improvements

  • Fixed an issue where if a view and the underlying base table had mismatched types on a column, Okera would produce data that matched the underlying table type and not the view type, causing an issue for upstream engines (e.g., Presto). The new behavior is that an implicit cast will be added if possible, and if not, the query will be failed.
  • Fixed an issue in the PrestoDB and PrestoSQL client libraries, where if a column name was also a reserved keyword (e.g., database or metadata) AND the column was a complex type (e.g., STRUCT), the client library would produce an invalid planning request.
  • Fixed an issue in the transparent Snowflake access where it would use an external LB (if configured) rather than the cluster-local cerebro-worker service address.
  • Fixed an issue in the transparent Snowflake access where queries that used IF were not properly rewritten.
  • Fixed an issue in the Spark and Hive client libraries where they would not properly maintain millisecond values for TIMESTAMP columns (they would correctly retain microsecond and nanosecond values if present).

2.3.2

Bug Fixes and Improvements

  • Updated PostgreSQL driver to resolve a security vulnerability
  • Fixed an issue when querying tables that have columns with very large values (e.g., 100KB), where a simple query that references that column would fail due to exhausting the cluster memory. To resolve this, set RS_ARGS to include --batch_check=64 (or another relatively low number). In 2.3.x, this value is set to -1 (i.e., no limit) by default, but in future Okera releases (2.4.x and above) it will be set to a low number by default.

2.3.1

Bug Fixes and Improvements

  • Added an option in EMR bootstrap to specify a custom image location using --local-worker-image.
  • Fixed an issue where Presto would report an error of Could not compute splits and not specify the underlying Okera error.
  • Improved S3 IO retry handling for improved latency when errors occur.
  • Fixed an issue in collocated workers that would attempt to open a connection to the planner unnecessarily.
  • Added the ability to specify DROP as a privilege for attribute namespaces, databases, tables, and views.
  • Added the ability to control the number of Okera tasks for a query in Presto using the okera.max_tasks Presto session property.

Notable and Incompatible Changes

Automatic Estimated Table Statistics

Okera will now automatically collect and store estimated table statistics. These can be optionally enabled (they are disabled by default) and leveraged by Hive, Spark and Presto for query planning and cost-based optimization.

To enable for Spark and Hive, edit hive-site.xml and add:

<property>
  <name>okera.hms.stats-mode</name>
  <value>HMS_OKERA</value>
</property>

To enable for Presto, you can do either of the following options:

  1. Edit the Okera connector's okera.properties and add okera.task.plan.enable-okera-stats=HMS_OKERA.
  2. Set the okera.stats_mode Presto session property to HMS_OKERA.

Note: These estimated statistics are complementary to the normal Hive metastore statistics, and there is no change in behavior if those statistics are currently being used (they take precedence if set over Okera's estimated statistics).

Okera JDBC Driver Update

Okera has added support for specifying TimeZoneID as a URL property when using Okera's Presto JDBC driver to connect via JDBC clients. For example, the connection property can be set as TimeZoneID:UTC. If this value is not specified, the driver would use the system's current time zone ID.

Valid values for this property are specified in the IANA Time zone Database. For a complete list of supported time zone ID, see https://en.wikipedia.org/wiki/List_of_tz_database_time_zones

Default Docker Repository Changed to quay.io/okera

Okera has changed the Docker repository that images are pushed to from DockerHub to Quay, due to the impact of the newly enforced rate limits in DockerHub.

Okera's images are available with the quay.io/okera prefix (the image names have not changed).

2.3.0

New Features

Okera Collocated Compute (EMR)

You can now run Okera's scalable data plane collocated with your EMR cluster(s), allowing you to transparently (and with zero or marginal cost) scale your Okera secure compute capacity as you provision more EMR capacity (both by scaling a single cluster or having multiple independent clusters). For supported data sources and queries, the secure data access will happen on the EMR nodes, benefiting from network and compute locality, and allowing you to maintain a much smaller central Okera cluster, thus allowing you to dramatically reduce TCO.

EMR clusters running with Okera's collocated compute do not need to have direct S3 access (via IAM), as the collocated data plane gets temporary secure access to the data it needs, thereby reducing the surface area of data access and allowing you to maintain high security, while not sacrificing usability (such as prohibiting SSH access to EMR).

Note: Okera's collocated data plane is supported beyond EMR. To learn how to leverage it in either deployment environments, such as Kubernetes, contact Okera Support.

New UI Databases Page

Okera has a new catalog browsing and management experience, centered around Databases and the Datasets in them. Users can now create and manage Okera databases, as well as permissions and tags at the database level.

To search across all datasets, click on Search all datasets to leverage the new dataset search page.

Click here learn more about the new functionality.

Transparent Snowflake Access (Beta)

Okera now supports improved access control on Snowflake data sources, pushing down full queries (including joins and aggregations) to Snowflake while enforcing the complete access policy as well as audit log entries.

Users, such as data analysts, can connect their favorite SQL tool (e.g., DBeaver, Tableau, Looker) via Okera’s ODBC/JDBC endpoint, and their queries will be automatically sent to Snowflake, after being authorized and audited by Okera (and if the user does not have permission to access the data they are trying to access, the query will be rejected). With this new capability, you get the benefit of Snowflake's native performance scale and Okera's complete policy and auditing capabilities.

In future releases, more data sources will be supported for transparent access integration as well.

Read more here.

Improved Databricks Integration

Okera has an improved integration with Databricks, enforcing full fidelity policies while maintaining complete compatibility with Spark and Databricks, including Databricks Delta Lake. The new integration is transparent in its execution, and allows Databricks Spark to fully control the data access, thus retaining its performance and functionality.

This new functionality is on by default, and you can read more about how to easily integrate a Databricks cluster (or clusters) with Okera here.

PrestoSQL Support

Okera now supports PrestoSQL (both the open-source and Starburst variants) in addition to PrestoDB. This allows you to connect your existing PrestoSQL clusters to Okera, benefiting from Okera's unified catalog, access control and auditing capabilities.

Note: PrestoSQL 338 is supported.

EMR 6.1 Support

EMR 6.1 is now supported, allowing you to leverage the latest functionality on EMR, such as Spark 3, Hive 3 and PrestoSQL.

You can read more about integrating with EMR 6.1 here.

Note: Integration with EMR 6.1 clusters is only supported with Okera clusters 2.3.0 and higher.

Bug Fixes and Improvements

  • Fixed a UI bug where updating a permission without any changes caused an error and would remove the permission.
  • Added a clear error message when a user that does not permission to create an attribute namespace tries to create one in the UI.
  • Fixed an issue where a LEFT OUTER JOIN would cause an error when querying two unnested columns.
  • Fixed an issue where in some cases, a user that was granted WITH GRANT OPTION could grant a higher access level on that object.
  • Okera UDFs that are used by external systems (such as Spark) are now registered in the okera_udfs databases.
  • Ensure that the automatic Presto tuning generates default task counts which are a power of 2 (as required by Presto).
  • Added a request ID to the audit logs for Presto and Spark queries, allowing to link together all the audit log entries for a single query.
  • Added the ability to specify a specific password to use for the Presto connection when using PyOkera, to allow for connecting to non-token enabled Presto clusters.
  • Improved autotuning that automatically detects cluster resizing for the Okera client libraries for Presto, Hive and Spark.
  • Fixed an issue in PyOkera that did not properly take custom user claims into account when using a token_func when a token expired.
  • Improved handling of spaces and periods when put in databases, tables, and columns.
  • Fix an issue when running count(*) on JSON data when multiple splits are generated.
  • Added support for setting database description via DDL:

    ALTER DATABASE <db_name> SET COMMENT '<database comment>'
    
  • Fix an issue for partitioned Delta tables.

  • Improved handling in CREATE TABLE ... LIKE PARQUET for partitioned tables:
    • A data file will automatically be found inside one of the partitions without needing to be manually specified.
    • The partition scheme can be auto-inferred from the on-storage structure (in a similar manner as data registration crawlers), without needing to explicitly be set.
  • Reject all unparseable view statements when creating or altering the view definition and improve error handling if an unparseable view is already present in the catalog.
  • In PyOkera, scan_as_json and scan_as_pandas now take an optional presto_headers dict keyword argument for custom headers to use when making the Presto request.
  • Improved metadata fetching performance when executing Presto queries, especially ones that reference many catalog objects.
  • Don't automatically populate large table statistics for Spark and Hive if no real statistics are present. The prior behavior can be enabled by setting the okera.hms.auto-populate-stats Hive configuration property to true.
  • Increase the default timeout when creating an Okera connection in the client libraries to 30 seconds (prior value was 10 seconds).
  • Fixed an issue where user attributes were not read correctly if the source system (e.g., LDAP) had them in non-lowercase.
  • Fixed an issue in okctl that did not properly handle validation of parameters that supported multiple path values (e.g., `JWT_PUBLIC_KEY: s3://path1,s3://path2).
  • Added the ability to control the timeout for the Kubernetes liveness and readiness probes by setting the OKERA_HEALTHCHECK_TIMEOUT_MS configuration value.
  • Fixed an issue for feature flag toggling for non-catalog administrators.
  • Improved role conflict detection for grants on differing scopes that don't overlap in their ABAC conditions.
  • Improved handling for ALTER DATABASE ... LOAD DEFINITIONS OVERWRITE to not remove tag assignments (at either the table or column level) if they are already present.
  • HAVING ATTRIBUTE conditions will now be considered for grants that also contain WHERE filters. The prior behavior can be enabled by setting the IGNORE_HAVING_EXPR_ON_FILTER to true.

Notable and Incompatible Changes

Oracle NUMBER Type

In 2.3.0 and higher, the NUMBER type in an Oracle table will be represented as a DECIMAL(38,6) in Okera.

Credential Files for JDBC-Backed Data Sources

In 2.3.0 and higher, when creating a JDBC-backed data source using a credentials file, the creating user must have permissions on that URI (expressed as a URI grant).

For example, if your credential file is located at s3://mycompany/config/redshift.properties, and you tried executing the following command:

CREATE DATABASE IF NOT EXISTS my_redshift_db DBPROPERTIES(
  'credentials.file' = 's3://mycompany/config/redshift.properties',
  ...
);

This will error if you do not have a URI grant that gives you access to s3://mycompany/config/redshift.properties.

You can create such a grant with:

GRANT ALL ON URI s3://mycompany/config TO ROLE <some role>

Note: You can also grant access to the entire bucket (or any prefix-level you desire).

2.2.2

Bug Fixes and Improvements

  • Fixed an issue in Hive and Spark client libraries when generating planning SQL that contained DATE types.
  • Fixed an issue in scanning partitioned Delta tables.
  • Fixed an issue in the Spark and Hive client libraries where they would not properly maintain millisecond values for TIMESTAMP columns (they would correctly retain microsecond and nanosecond values if present).
  • Fixed an issue in the PrestoDB client library, where if a column name was also a reserved keyword (e.g., database or metadata) AND the column was a complex type (e.g., STRUCT), the client library would produce an invalid planning request.

Notable and Incompatible Changes

Default Docker Repository Changed to quay.io/okera

Okera has changed the Docker repository that images are pushed to from DockerHub to Quay, due to the impact of the newly enforced rate limits in DockerHub.

Okera's images are available with the quay.io/okera prefix (the image names have not changed).

2.2.1

Bug Fixes and Improvements

  • Fixed an issue in PrestoDB split computation in very large clusters.
  • Removed the restriction on column comments by default (limit was 256 characters).

    Note: This changes the underlying HMS schema, and if connected to a shared HMS, should be disabled by setting the HMS_REMOVE_LENGTH_RESTRICTION configuration value to false.

  • Improved resilience in handling crawling errors.
  • Fixed an issue with WITH GRANT OPTION on non-ALL privileges.
  • Restrict querying datasets with nested types that have policies on tags that are on the nested type.
  • Fixed an issue when paginating in the Datasets page.
  • Fixed an issue for the /api/get-token endpoint.

2.2.0

New Features

JDBC Data Sources

Custom JDBC Driver Support

Okera has added support for specifying custom JDBC data sources beyond those that ship out of the box. If you would like to connect to a custom JDBC data source, please work with Okera Support to define the JDBC connection information appropriately for your driver.

Secure Values for JDBC Properties

Okera has added support for referring to secret values in the JDBC properties file from local secret sources such as Kubernetes secrets, as well as secure Cloud services such as AWS Secrets Manager and AWS SSM Parameter Store.

For example:

driver=mysql
type=mysql
host=...
port=3306
user=awsps:///mysql/username
password=awsps:///mysql/password

This will look up in AWS SSM Parameter Store the value for /mysql/username and /mysql/password. You can similarly use file:// for local files (using Kubernetes mounted secrets) or awssm:// to use AWS Secrets Manager.

Note: If you use AWS SSM Parameter Store or AWS Secrets Manager, you will need to provide the correct IAM credentials to access these values.

Predicate Pushdown Enabled by Default for JDBC-Backed Data Sources

Starting in 2.2.0, predicate pushdown for JDBC-backed data sources is enabled by default (this was previously available as an opt-in property on a per-data source level) and will be used whenever appropriate.

To disable predicate pushdown for a particular JDBC-backed database or table, you can specify 'jdbc.predicates.pushdown.enabled' = 'false' in the DBPROPERTIES or TBLPROPERTIES (you can read more here).

BLOB/CLOB Data Type Support

Okera now supports BLOB and CLOB data types for Oracle JDBC Data Sources.

Autotagging for JDBC-Backed Data Sources

When registering JDBC-backed data sources and loading the tables, Okera will now run its autotagger by default when registering.

You can disable this behavior by specifying okera.autotagger.skip=true in your DBPROPERTIES.

UI Improvements for Tabular Results

The UI now makes it easy to copy or download results as CSV from tables. This can be used in the Workspace and when previewing a dataset.

Operability Improvements

  • Okera will now generate correlated IDs for the planner and worker tasks to make it easier to correlate the task information in the logs. For example, the planner may have a task of the form 9b45f8b08c76352e:85a51f5579300000, and if N worker tasks were generated, they would be of the form 9b45f8b08c76352e:85a51f5579300001, 9b45f8b08c76352e:85a51f5579300002, and so on.

  • System administrators can now easily access the Planner and Worker debug UIs from the System page in the Okera UI.

  • System administrators can now see how many unique users have accessed data via Okera in the System page in the Okera UI, both all-time and in the last 30 days.

Domino Data Labs Integration

When run in Domino Data Lab environments (starting in Domino Data Lab version 4.3.0), PyOkera now has built-in integration that can be used to leverage the automatically generated JWT tokens in the Domino Data Labs environment, enabling transparent authentication between Domino Data Labs environments and Okera deployments.

import os
from okera.integration import domino

ctx = domino.context()
with ctx.connect(host=os.environ['OKERA_HOST'], port=int(os.environ['OKERA_PORT'])) as conn:
    df = conn.scan_as_pandas('drug_xyz.trial_july2020')

PrestoDB Improvements

  • Several internal improvements were made to Okera's PrestoDB connector to increase performance in areas such as data deserialization, asynchronous processing, improved memory allocation, etc.
  • Several improvements were made to auto-tune Okera's built-in PrestoDB cluster to better match its environments capabilities.
  • When filtering on columns of DATE type, the PrestoDB connector will now push those filters down into the Okera workers.
  • Okera's PrestoDB connector has added support for table statistics if these are set on the table in the Okera catalog. These can be set by setting the numRows table property, e.g.:

    ALTER TABLE mydb.mytable SET TBLPROPERTIES('numRows'='12345')

These table statistics will be considered by Presto's cost-based optimizer (e.g., for JOIN reordering).

User Attributes

Okera added the user_attribute(<attribute>) built-in function, which retrieves attribute values on a per-user basis. These can be used in policy definitions, e.g., to apply dynamic per-user filters.

These attributes can be fetched from AD/LDAP by setting the LDAP_USER_ATTRIBUTES configuration value to a comma-separated list of attributes to fetch from AD/LDAP, e.g.:

LDAP_USER_ATTRIBUTES: region,manager,businessUnit

If the attribute is missing for the user executing it, the value returned will be null.

Hudi and Delta Lake Support (Experimental)

Okera has added experimental support for Delta Lake and Apache Hudi tables.

You can create Apache Hudi tables using the CREATE EXTERNAL TABLE DDL, e.g.:

CREATE EXTERNAL TABLE mydb.my_hudi_tbl
LIKE PARQUET 's3://path/to/dataset/year=2015/month=03/day=16/5f541af5-ca07-4329-ad8c-40fa9b353f35-0_2-103-391_20200210090618.parquet'
PARTITIONED BY (year int, month int, day int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 's3://path/to/dataset';

You can create Delta Lake tables using the CREATE EXTERNAL TABLE DDL, e.g.:

CREATE EXTERNAL TABLE mydb.my_delta_tbl (id BIGINT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 's3://path/to/dataset/';

The following limitations should be kept in mind:

  • In both cases, tables need to be explicitly registered, as crawling will not properly identify these tables as Hudi or Delta Lake.
  • For Apache Hudi, Okera only supports Snapshot Queries on Copy-on-Write tables and Read Optimized Queries on Merge-on-Read tables.

Okera has added several privacy functions typically used in health- and medical-related environments:

  • phi_zip3
  • phi_age
  • phi_date
  • phi_dob

These are compliant with the HIPAA safe-harbor standard.

Nested Type Tagging (Beta)

Okera has added the capability (disabled by default) to tag nested types (specifically, ARRAY and STRUCT types), and have those tags be inherited when used in views that unnest the nested portion.

For example, if you have a table with the following schema:

id  bigint
s1  struct<
  a1:array<struct<
    f1:string,
    f2:string,
    a2:array<string>
  >>
>

You could tag s1.a1.f1, s1.a1.f2 and s1.a1.a2, and when unnested, they will retain their tags.

Additionally, Okera has added the ability to fully unnest a table and inherit the tags on that object into the view - this is done using the SELECT ** operator.

For example, using a table with a schema above, you could create the following view (with tags on the three leaf fields):

CREATE VIEW mydb.unnested_view AS SELECT ** FROM mydb.nested_table

This will create a view which has the following schema:

id  bigint
s1_a1_item_f1   string
s1_a1_item_f2   string
s1_a1_item_f2   string

with s1_a1_item_f1, s1_a1_item_f2 and s1_a1_item_f2 retaining their tags. You can then grant access to this view and use normal attribute-based policies and transformations.

To enable this feature, set the FEATURE_UI_TAGS_COMPLEX_TYPES and ENABLE_COMPLEX_TYPE_TAGS configuration values to true.

Note: ABAC policies that apply to tags assigned to nested types will not be enforced on the base table, so take care to only give access to unnested views in these cases.

Bug Fixes and Improvements

  • Fixed an issue in the Okera Presto connector where some queries against information_schema could cause an exception and fail.
  • Fixed an issue in the Tags management UI where the number of tagged datasets could be incorrect.
  • Improved handling of not displaying internal databases in the UI.
  • Added the ability to run GRANT for ADD_ATTRIBUTE and REMOVE_ATTRIBUTE at the CATALOG scope.
  • Removed the need to have ALTER permissions on tables and databases to run add/remove attributes (ADD_ATTRIBUTE and REMOVE_ATTRIBUTE are now sufficient).
  • Added an implementation of listTableNamesByFilter in the HMS connector.
  • Added the ability to configure timeouts for internal checks to account for large network latency, this can be set using OKERA_PINGER_TIMEOUT_SEC.
  • Support implicit upcasting for Parquet columns of type INT32 to be represented by BIGINT in the table schema.
  • Improved experience when previewing JDBC-backed datasets by limiting the amount of data fetch.
  • Added DELETE, UPDATE and INSERT as grantable privileges.
  • Fixed an issue in okctl where it would not report an error and abort if there was an error updating the ports.
  • Improved handling of small files for Parquet-backed datasets.
  • Fixed an issue where the Autotagger would not correctly handle columns with DATE type.
  • Improved handling for JDBC-backed tables where the table name contained . characters.
  • When running a worker load balancer (default for EKS and AKS environments), the built-in Presto cluster will by default use the internal cluster-local load balancer and not the external one.
  • Fixed an issue with pagination on the datasets page where paging to the end of the list and back showed an inaccurate count
  • Improved diagnostic information available when registering a JDBC-backed table that has unsupported types or invalid characters.
  • Improved filter push down for Oracle tables for columns of DATE and TIMESTAMP type.
  • Improved handling of DECIMAL, NCHAR and FLOAT datatypes for JDBC-backed data sources.
  • Improved inference of BIGINT values in text values (e.g., CSV).
  • Fixed an issue where workers were not generating SCAN_END audit events.
  • Fixed an issue where table/view lineage information could be duplicated.
  • Upgrade Gravity to 6.1.39.
  • Remove hardcoded fetch_size in PyOkera and add the ability to explicitly set it using the fetch_size keyword argument to exec_task.
  • Fixed an issue where pagination on the Datasets UI could get into an inconsistent state when filtering by tags.
  • When using tokenize, referential integrity will now also be maintained for INTEGER columns.
  • Add IF NOT EXISTS and IF EXISTS modifiers to GRANT and REVOKE DDLs, respectively.
  • Fixed an issue when doing writes in EMR Spark when metadata bypass was enabled for non-partitioned tables.
  • Added limited support for Avro files with recursive schemas, which will allow a maximum cycle depth of 2.

Notable and Incompatible Changes

Upgrading From 2.1.x

When upgrading from Okera 2.1.x that is lower than 2.1.10, some functionality may stop working in the 2.1.x deployment if running side-by-side or downgrading back to 2.1.x. If preserving the behavior is desirable, please upgrade to 2.1.10 or work with Okera Support.

Container User Changed to root

Starting in 2.2.0, the process user inside all the Okera containers (running as Kubernetes pods) is no longer root and is running under dedicated users.

As part of this change, any files that are downloaded into the container (e.g., from S3 for custom certificates) are now placed in /etc/okera and not /etc.

SQL Keywords

The following terms are now keywords, starting in 2.2.0:

  • DELETE
  • HUDIPARQUET
  • INPUTFORMAT
  • OUTPUTFORMAT
  • UPDATE

2.1.10

Bug Fixes and Improvements

  • Fixed a forward-compatibility issue with 2.2.0

2.1.9

Bug Fixes and Improvements

  • Fixed an issue where a user could create external views in any database using Presto's CREATE VIEW DDL, even though they may not have the appropriate grant on that database.

2.1.8

Bug Fixes and Improvements

  • Fixed an issue where schema inference (used in Data Registration and CREATE TABLE LIKE FILE) for JSON-based tables would incorrectly remove leading underscores and double underscores from column names.

2.1.7

Bug Fixes and Improvements

  • Added the ability to specify additional Presto configuration values using the PRESTO_ARGS configuration value, e.g., PRESTO_ARGS: "task.concurrency=16 task.http-response-threads=100". Using this capability should be done in coordination with Okera Support.
  • Fixed an issue where the REST Server pod would not restart quickly enough if a failure happened on startup.
  • Fixed an issue where the error dialog in the Data Registration page could not be closed.
  • Improved Presto behavior on creating and closing connections to the ODAS workers.
  • Changed the default Presto maximum stage count to 400.
  • Fixed an issue where an ABAC policy that included row filters would generate the WHERE clause missing parentheses around.
  • Fixed an issue where newer Parquet files that included the INT_32 and INT_64 logical types would cause a Parquet read error.

2.1.6

Bug Fixes and Improvements

  • Fixed an issue in Data Registration UI that made pagination behave erratically when using Glue as the backing metastore.
  • Fixed an issue in Data Registration UI where auto-discovered tags would not show up if the column was not editable.
  • Fixed an issue where the in-memory group cache would be overridden with empty groups.
  • Fixed an issue where CSV files that had empty strings would not be automatically converted to NULL values.

2.1.5

Bug Fixes and Improvements

  • Added the ability to increase the REST and UI timeouts to arbitrary values (previously limited to 60 seconds).
  • Removed a restriction when unnesting nested types that did not allow WHERE clauses to be used in those queries.

2.1.4

Bug Fixes and Improvements

  • The HMS length restriction removal will now run at startup for all clusters (unless disabled), not just upgraded clusters.
  • Fixed an issue where keywords were not always escaped in ABAC transforms and filters.
  • Fixed an issue in the UI where the privacy function dropdown in the Visual Policy Builder had the wrong default.
  • Fixed an issue where ODAS errors were not propagating to Presto when creating an external view from Presto.

2.1.3

Bug Fixes and Improvements

  • Updated Presidio to not require any network connectivity in all cases.
  • Fixed an issue where the Datasets UI would render table headers over some dropdowns.
  • Improved the performance of the Datasets page when loading individual datasets.

2.1.2

Bug Fixes and Improvements

  • Fixed an issue when creating a crawler with single-file datasets, causing the registered datasets to use the directory path instead of the file path.
  • Fixed an issue where editing policies in the Policy Builder could in some cases cause an error on saving the edited policy.
  • Fixed an issue where using restricted keywords in Policy Builder would not be escaped properly in some cases.
  • Fixed an issue when using MySQL as the backing database could cause some data types to not be converted correctly via JDBC in some cases, causing exceptions.

2.1.1

Bug Fixes and Improvements

  • Several improvements to handling of S3 errors and failure conditions for very large files.
  • Fixed an issue where in some cases (typically large) Parquet files would cause an error when being queried.
  • Fixed an issue in the Databricks connector where a table would be missing the SerDe path parameter when the table was not cluster local.
  • Fixed an issue in policies where if you had two ABAC policies, one which included a transform and one which did not, they would not compose correctly (this resulted in giving less access than desired in all cases).
  • Fixed an issue when upgrading from 1.5.x where the DB schema upgrade could fail under certain conditions.
  • Fixed an issue in the Presto connector where if a JDBC client issued a query against INFORMATION_SCHEMA with underscores, Presto would error out.

2.1.0

New Features

Extending Attribute-Based Access Control Policies to Support Data Transformation Functions and Row Filtering

Attribute-based access control policies now support data transformation functions and row filtering. This is supported with an extension to the current ABAC grant syntax. Read more here.

This can significantly simplify how policies can be managed, reduce, or eliminate the need to create views, and make it much easier to manage complex policies. You can easily create these policies in the UI by specifying ABAC access conditions in the policy builder. See examples of the different policies you can create using Okera's Policy Engine here.

Tag Cascading and Inheritance

Attributes assigned on tables/views and their columns will automatically cascade to all descendant views. Read more about this capability here.

View Lineage

Okera will now maintain the lineage of datasets created after 2.1. It is now possible to know for a given dataset (table or view) what are all the views that descend from it, and for a view to know all its ancestors. This information is also exposed in the UI. Read more here.

Improved Privacy Functions

Okera has a revamped set of privacy-related functions to aid in anonymization with different guarantees. Read more about Okera's privacy and security functions here

Users Page and Inactivity Report

The web UI now includes a new Users page, where all the users that have authenticated in the system can be viewed, as well as their groups as per the last time they made a request through Okera. This makes it easier to understand if a user should have access to something or not.

The Users page also lets you generate a User Inactivity Report, that shows you all the users who have any level of access on a database but have not queried that database within a given timeframe. This report helps identify users who may not need access to data anymore since they are not utilizing, thereby improving least privilege.

Enable access to the Users page in the UI by granting a user or group access to the okera_access_review_role.

GRANT role okera_access_review_role to group marketing_steward_group;

Read more about his capability here.

Access Control for Attribute Namespaces

You can now control access to management of tags by namespaces. ATTRIBUTE NAMESPACE has been added as a new object type and CREATE ADD ATTRIBUTE and ALL access levels are supported on it. For example, if you wanted to give access to a role to create, drop and assign attributes from a particular attribute namespace, you would use the below:

GRANT ALL on ATTRIBUTE NAMESPACE marketing TO ROLE marketing_steward;
In addition, to this if you wish to grant access to the tags page in the UI, so that a user can create and manage tags there, grant okera_tags_role to that user's group.

Note: To assign attributes on data you will still need to have the correct privileges for the data you are trying to assign on. See Controlling Who Can Assign Tags on Objects for more details.

Other Tag Management Updates

  • Only editable tags (ones the ones a user has CREATE or ALL on) show up on the Tags page.
  • Adding/removing tags from a dataset will ignore tags a user does not have privileges on.
  • Tags page now requires SELECT access on okera_system.ui_tags. The built-in okera_tags_role has this privilege by default.

VIEW_AUDIT Privilege to Control Access to Audit Logs

You can now grant VIEW_AUDIT privilege on data, to enable a user to view audit log information for that object. For example, if the user only had VIEW_AUDIT on two databases, they would only see reports for those two databases in the UI or when querying the okera_system.reporting_audit_logs view. To see the Insights page in the UI, you also need the okera_reports_role. See Access to the Insights Page for more information.

Note: The default privilege required to view audit log records for objects has been changed from SELECT to VIEW_AUDIT.

Presto SQL in Workspace

The workspace now features Presto SQL mode, which allows executing queries against an Okera cluster using Presto. See Workspace for more details.

Creating Views Using Presto

It is now possible to create and delete external views via Presto directly. These views will be stored in the Okera catalog (as external views) and be accessible via Presto.

To do this, execute a DDL like this in Presto (e.g., via the Okera Workspace or an application such as SQL Workbench or DBeaver):

CREATE VIEW some_db.some_view AS SELECT ....

To support this, Okera has added extensions to the CREATE VIEW DDL statement when executed in Okera:

CREATE EXTERNAL VIEW <db>.<view> (
    <col name> <col type>,
    ...
) SKIP_ANALYSIS USING VIEW DATA AS 'SELECT ...'

This DDL requires the user to specify the full set of columns that the view statement produces (including types), as the view statement is not parsed or analyzed.

Improved JSON File Format Support

  • Starting from 2.1.0, Okera uses simdjson to read JSON file format data.
  • Several improvements for auto-inference of JSON file formats with support for appropriate data types. Extensive testing on various JSON files on auto generated files and several internet sources.

Oracle Data Source Support

Oracle is now supported as a JDBC data source. The Oracle JDBC driver will need to be configured as a custom driver. Read more on how to configure this here.

More Metadata Available on Dataset Details

There are several improvements to the dataset details view in the UI:

  • Much more detailed technical metadata is included
  • It is now possible to edit the description of a dataset
  • It is now possible to edit column comments in the dataset schema
  • View parent/child lineage information is available for views created in Okera
  • Column-level tags are included in the details view along with table-level tags
  • Dataset schema can be filtered by data with column-level tags

Ability to Create Views From the UI

Admin users can now create an internal view based on an existing view or table from the datasets page. Choosing the destination database and view name and selecting the columns to be included in the new view are supported. For more information, see Create a View of Data.

Permission Management Improvements

  • A Permissions tab has been added to the Details tab for a dataset. Like on the Roles page, you'll be able to fully manage permissions associated with the specified dataset. You can read more about this on the Datasets page.
  • Data transforms and row filtering added to Policy Builder UI.
  • Ability to edit existing policies in the UI. To learn more about editing and managing policies, go to Editing Permissions.
  • An admin user can now create a view from a dataset

Reports Page Improvements

The Reports page has a number of major improvements, including:

  • New reports for Activity overview, Active users over time, Top accessed tags, and Recent queries.
  • SQL used to generate the reports is available in-page and can be run in Workspace.
  • Custom time ranges are available within the last 90 days.
  • Reports queries use human-readable times instead of unix timestamps.
  • Reports can now be filtered by dataset and tag as well as database.
  • Reports filters now allow for multi-selection.
  • Visual updates.

For more details, see the Reports page documentation.

UI Visual and Interaction Updates

  • There are small visual updates and improvements throughout the UI focused on clarity and better use of screen real estate.
  • The output of the workspace has been reworked to better retain a user's context and show history.

Updates to Reporting and Audit Views

  • New audit table and view have been added to okera_system database - analytics_audit_logs and reporting_ui_analytics. These are populated by the cdas-rest-server container and are primarily used to track and analyze usage of the UI. For now, the UI only writes there on page visit. The data is stored in the same logging directory as regular audit logs in its own subfolder.

  • The view used by reports, okera_system.reporting_audit_logs, now includes start_time_utc and end_time_utc columns of type TIMESTAMP_NANOS for better readability.

Improved REST Server Diagnostics Logging

  • Logs now include timestamp and log level.
  • Log level can be set via REST_SERVER_LOG_LEVEL. DEBUG, INFO, WARNING, ERROR, and CRITICAL are all valid.

Bug Fixes and Improvements

  • Upon renaming tables, the attributes from the old tables are now carried to the new renamed table.
  • Performance improvements to parallelize queries with UNION ALL in it. With this enhancement, queries with UNION ALL leverage Okera multitasks across workers versus single tasks for UNION ALL prior to this fix.
  • Performance improvements on dropping tables with large number of partitions.
  • Performance improvements on DROP DATABASE CASCADE to drop all tables under the database.
  • For JDBC data sources, large numeric/decimal types (>38 precision) are now handled. The precision is capped at 38 for the larger numeric/decimals or unspecified p/s in the source. If unspecified, for scale, 6 is the default scale post Okera version 2.1.0 .
  • Fix to handle negative decimals in JDBC data scan. The scale is also treated HALF_DOWN for rounding large scale decimals.
  • For create view command if database is not specified, default database name is considered for the view.
  • Fix for parse errors on views with JDBC tables in the view definition (Joins between JDBC and non-JDBC tables).
  • Arrays of arrays are now supported in Okera.
  • log4j2 support: Okera now uses log4j2 as default for logging. A backwards compatibility bridge as recommended by Apache project is used for libraries that still use log4j, like certain Hadoop/Hive libraries.
  • Support for LIMIT on JDBC data sources. This improves the preview of data from JDBC data sources where the data is limited to 100 by default from the Okera Web UI.
  • Better error handling on JDBC data source auto-inference errors on unsupported datatypes.
  • Fix for a regression on authorization on CTEs (WITH Clause) with aggregations in the query.
  • For views involving avro file format, that have column definitions, like complex structs that can have > 4000 characters, use the schema from the avro file instead of creating the physical columns in the database.

    Note: The describe formatted for such tables/views with > 4k columns still do not show the column details. The describe <table/view> would show the correct definitions.

  • Several bug fixes to handle parquet file format issues gracefully. Example, parquet files with unsupported DataPageHeaderV2 would crash the workers. These are now handled with a graceful error message.
  • Reduce pinger verbose level from error to warn for the Sentry/Hive pinger. This will improve error diagnostics for real catalog exceptions. Earlier, this used to flood the logs with invalid errors.
  • A bug fix for count(*) on a JDBC view to return results instead of a failure.
  • Ability to specify Glue AWS region which can be separate from the cluster default region.
  • The recordservice catalog in presto is disabled by default starting from 2.1.0 .
  • Additional controls for JDBC (PrestoDB) -> Okera configurations. For example, the rpc timeouts are now parameters that can be controlled from an environment setting. OKERA_PRESTO_PLANNER_RPC_MS and OKERA_PRESTO_WORKER_RPC_MS
  • Minor improvement to remove SerDe info from SHOW CREATE TABLE command. Prior to this fix, re-running the output from SHOW CREATE TABLE command would error out due to the duplication of SerDe info and the FILE FORMAT info. Post this fix, the SHOW CREATE TABLE command would not have the SerDe info and hence re-running the output would work as is.
  • Fix for an avro file format error that has a union with default values in it.
  • UI: Better row hover state highlighting on grouped table rows.
  • UI Error boundaries introduced for increased stability in JavaScript.
  • Policy Builder layout and formatting improvements.
  • Contextual restrictions on Policy Builder UI including conditional disabled create/edit/delete.
  • More nuanced permission conflict reasons.
  • Upgraded node to 12.15.0.
  • The Presto connector has several improvements for performance, utilizing more efficient APIs and serialization/deserialization formats.
  • Several performance improvements for queries over Parquet files and queries with joins.
  • In the Okera Planner/Worker debug UI, the number of queries displayed has been increased to 256.
  • The audit log has a new field added to it, ae_attribute, which captures all attributes accessed as part of this query.
  • Fixed an issue in the /scan API where some Decimal values would not be serialized correctly.
  • Several improvements to schema detection for TEXT-based files (especially CSV).
  • Added support for md5() (based on the Hive UDF).
  • The has_access() built-in function now supports checking against all privilege levels (previously it only supported ALL and SELECT).
  • Fixed an issue where it was not checked whether an attribute existed or not in some DDL statements that modified attributes.
  • Fixed an issue where the CREATE_AS_OWNER privilege at the catalog level incorrectly gave the SHOW privilege at that scope as well.
  • Improvements to error handling and recovery of metadata operations.
  • Improved default tuning parameters in large memory environments.
  • PyOkera now properly converts all values to JSON-serializable types when scan_as_json is used.
  • Improved admission control when workers are over-subscribed on either active connections or memory metrics.
  • For Gravity-based deploys, Gravity has been upgraded to 6.1.16 LTS.
  • Improved error handling and recovery of the data registration crawler in case of failures.
  • Added the ability to increase the timeout for initializing the catalog on cluster startup by setting the CATALOG_INIT_STARTUP_TIMEOUT configuration value.
  • Fixed an issue where some system tables were not dropped prior to creating them on startup, which can cause an issue on upgrades.
  • Fixed an issue where the audit logs would have incorrect values in case of an error during initialization of an incoming request.
  • Added the ability to specify a column list when executing ALTER VIEW, in the same manner as CREATE VIEW.
  • Improved error message when using non-absolute S3 bucket paths.
  • Improved error handling when parsing a view definition that Okera cannot parse for an external view.
  • Fix an issue where service discovery would consider Kubernetes objects in a different namespace.
  • Fixed an issue where the system would generate unnecessary baseline queries, creating log noise.
  • Added the ability to specify a privilege level filter for the GetTables and GetDatabases APIs.
  • Fixed an issue in PyOkera when handling the CHAR type when there are null values in the data.
  • Fixed an issue where the ae_role column was not always populated for some role-related DDLs.
  • Improved the logging in the Okera REST Server.
  • Added the ability to configure the Planner and Worker RPC timeouts in Okera's Presto, using the OKERA_PRESTO_PLANNER_RPC_MS and OKERA_PRESTO_WORKER_RPC_MS configuration values respectively. The defaults are 300000ms and 1800000ms respectively.
  • Improved retry handling for retriable S3 errors (such as Server Busy, etc.).
  • Fixed a bug where database names were not escaped when created in the registration UI.
  • 're-autotag' button on the datasets page now causes the new tags to be fetched upon completion.
  • The UI has several new icons.
  • Workspace now includes an execution timer for queries.
  • Improved errors are reported for bad schema found during registration.
  • Fixed a bug where they UI allowed users to 'tag' partitioning columns, but such tags had no effect.
  • Now all dataset views show their view string.
  • "Queries by duration of planner request" is no longer part of the Reports page.

Notable and Incompatible Changes

  • Starting from 2.1.0, the published Okera client libraries for PrestoDB support PrestoDB versions greater than 0.234.2 and above.
  • ZooKeeper has been removed as a system component - Okera will now leverage Kubernetes to maintain the worker membership list.
  • The default per-user okera_sandbox database has been removed.
  • When creating Okera views (i.e., internal/secure views), it is now required for the creator to have the ALL privilege on all referenced datasets. This is done to ensure that these tables cannot be incorrectly exposed by users with lesser permissions.
  • Removed the 4000-character limitation on column types.

    Note: This changes the underlying HMS schema, and if connected to a shared HMS, should be disabled by setting the HMS_REMOVE_LENGTH_RESTRICTION configuration value to false. This is only done for new HMS databases - if you have an existing one from a prior installation, contact Okera Support for migration procedures.

  • The default privilege required to view audit log records for objects has been changed from SELECT to VIEW_AUDIT. This means some users may no longer be able to see audit logs for their data (if they previously only had SELECT access to it) and will need to be granted VIEW_AUDIT on data they wish to view audit logs for.
  • ML and decision-tree-based autotagging is now enabled by default.
  • OKERA_REPORTING_TIME_RANGE can no longer be used to restrict the available time range in Okera reports.
  • In 2.1.x, many data correctness issues will now fail queries as opposed to silently ignoring them (e.g., converting data into NULL, etc.) as in previous versions. To revert the behavior, add --abort_on_error=false to RS_ARGS.

SQL Keywords

The following terms are now keywords, starting in 2.1.0:

  • CXNPROPERTIES
  • DATACONNECTION
  • DIAGNOSTICS
  • DO
  • EXCEPT
  • TIMESTAMP_NANOS
  • VIEW_AUDIT
  • VIEW_COMPLETE_METADATA

Known Issues

  • The Okera PrestoDB Connector shipped with this version is compatible with PrestoDB 0.233 and higher. This connector is currently not compatible with any released version of PrestoDB on EMR, as the version of PrestoDB shipped is older than 0.233. This will be fixed in a subsequent 2.1.x maintenance release.

2.0.2

Bug Fixes and Improvements

  • Fixed an issue where many concurrent CREATE TABLE or CREATE VIEW statements could be slowed down waiting on a shared resource.
  • Fixed an issue when authorizing queries on views with complex types.
  • Added an option to use the SYSTEM_TOKEN as the shared HMAC secret for signing and validating tasks (in the Planner and Worker services) rather than using ZooKeeper. This option can be enabled by setting SYSTEM_TOKEN_HMAC: true in the configuration file.
  • Fixed an issue where it was not possible to connect to a Postgres instance that did not have public in the default search_path.
  • Added the ability to specify whether the connection to the database should be done using SSL (this was typically auto-discovered, but in some cases the auto-discovery failed). This can be enabled by setting CATALOG_DB_SSL: true in the configuration file.
  • Fixed an issue where schema upgrades did not work for remote Postgres instances.
  • Fixed an issue where the Workspace UI would scroll beyond the window if there was a long error.

2.0.1

Bug Fixes and Improvements

  • Added the ability to edit dataset and column descriptions in the Okera UI.
  • Fixed an issue in which datasets could not be registered if they had columns with type definitions that exceeded 4,000 characters.
  • Added more control options for LDAP group resolution configuration:
    • GROUP_RESOLVER_LDAP_POSIX_GID_FIELD_NAME
    • GROUP_RESOLVER_LDAP_POSIX_UID_FIELD_NAME
    • GROUP_RESOLVER_LDAP_MEMBEROF_FIELD_NAME
  • Fixed an issue where Avro datasets that had a union type with a single child (e.g., union(int)) would throw an error. These types of unions are now fully supported.
  • Fixed an issue where decimals that were stored as a byte_array in Parquet files were not read correctly.
  • Added a configuration option to control the maximum number of allowed Sentry and HMS connections:
    • CATALOG_HMS_MAX_THREADS
    • CATALOG_SENTRY_MAX_THREADS
  • Fixed an issue in which changing the description of the view (or a column in it) via DDL was not supported.
  • Fixed an issue where columns that contained arrays or maps with embedded null values were not handled correctly in the Java-based clients.
  • Fixed an issue in PyOkera where it would incorrectly decode negative decimal values with a precision higher than 18.
  • Fixed an issue when --allow_nl_in_csv=True was set and the CSV file used a different quote character than " - it would improperly use the " to escape line breaks.
  • Improve the Crawler's ability to automatically use the OpenCSV SerDe when necessary.
  • Fixed issues for handling complex types that had several nested arrays/structs/maps with null values interspersed.
  • Fixed an issue where reserved keywords were not possible to be used (as escaping them wouldn't work) as attribute namespaces and attribute keys (e.g., myns.true).
  • Add the ability to use CREATE TABLE LIKE TEXTFILE, which will automatically deduce the schema from the CSV file (this assumes the headers are the first line).
  • Improved handling of non-parseable SQL statements when accessing a view that was created outside Okera (e.g., in Hive). This capability is enabled by an environment flag ALLOW_NONPARSEABLE_SQL_IN_VIEWS: true set in the configuration file for the cluster.
  • Fixed an issue where the same tag could appear twice in the UI.
  • Fixed an issue in which dropping an external table referencing a non-existent bucket fails..
  • Fixed an issue where the crawler Data Registration page for a given crawler would display incorrect "Registered" tables if their path was a simple prefix of the crawler root path.
  • Added support for using a dedicated Postgres server (e.g., on RDS) as the backing metadata database.

2.0.0

New Features

Bucketed Tables

ODAS now supports bucketed tables and applying efficient joins to them. You can find more details here.

AWS Glue

ODAS now supports using AWS Glue as the metastore storage, allowing you to connect ODAS to an existing Glue catalog. You can read more about this support and enabling it in the Glue Integration page.

Auto-Tagging Improvements

  • ODAS now employs an ML-based engine for some of the out of the box auto-tagging rules, such as address and phone number detection.

  • You can now create and manage the regular expression-based rules that are used by the auto-tagging engine in the UI. You can read more about this in the Tags page.

  • The number of datasets tagged with a tag is now shown in the UI.

  • ODAS can continuously auto-tag your existing catalog in the background. You can enable this by setting the ENABLE_CATALOG_MAINTENANCE setting in your configuration file.

  • ODAS will now auto-tag the data inside nested complex types and apply the discovered tag(s) at the root column-level.

Azure ADLS Gen2 Support

ODAS now supports ADLS Gen2 data storage for both querying and data crawling. You can register these data sources by specifying a path with either the abfs:// or abfss:// prefixes.

Web UI Updates

  • The ODAS Web UI has been revamped to be easier to use and update the look-and-feel.

  • A Roles page has been added, allowing you to fully manage roles (create/update/delete) and their group and permission assignments. You can read more about this on the Roles page.

  • The 'About' dialog has been replaced by a System page.

JDBC Data Sources

  • Redshift External Tables are now supported for JDBC data sources of type redshift.

ABAC Updates

  • There are now DDL statements to work with tags, namely:

    • DESCRIBE <table>, DESCRIBE FORMATTED <table>, DESCRIBE DATABASE <database> will now output tag assignments.
    • CREATE ATTRIBUTE <attr> and DROP ATTRIBUTE <attr> will create/remove attributes.

      Note: Namespaces will be automatically created if they don't already exist.

    • SHOW ATTRIBUTE will show the list of currently existing attributes.
    • ALTER TABLE and ALTER VIEW now have new operations of ADD ATTRIBUTE <attr>, REMOVE ATTRIBUTE <attr>, ADD COLUMN ATTRIBUTE <col> <attr> and REMOVE COLUMN ATTRIBUTE <col> <attr> to add/remove attributes at the table-/view- and column-levels respectively.
    • ALTER DATABASE now has new operations of ADD ATTRIBUTE <attr>, REMOVE ATTRIBUTE <attr> to add/remove attributes at the database-level.
    • CREATE TABLE and CREATE VIEW can now take an optional set of attributes during table creation. For example:
      CREATE TABLE mydb.mytable (
          col1 int COMMENT "some comment1" ATTRIBUTE myns.myattr1,
          col2 int COMMENT "some comment2" ATTRIBUTE myns.myattr2,
          col3 int COMMENT "some comment3" ATTRIBUTE myns.myattr3
      )
      
  • Rule definitions now accept a "name" field. For backwards compatibility and convenience, the "name" is auto-generated if not specified.

Bug Fixes and Improvements

  • ODAS has updated Docker images that update many dependencies including the base OS, Python, OpenSSL and more.
  • Added a way to configure the structure of the data files the crawler will use while crawling. See Create a Crawler for more.
  • Added crawler search box on the data registration page.
  • Added additional validation for the crawler name and path when creating a new crawler.
  • There is now an ability to re-run the autotagging rules on an individual dataset within the Datasets page, by using the Re-autotag button.
  • Fixed an issue where datasets with complex types that had a MAP embedded in a STRUCT embedded in ARRAY would not be handled correctly.
  • Added the ability to revoke grants on objects that no longer exist.

Incompatible Changes

  • Previously by default users would only see reports for datasets they had ALL access to. Since many stewards may not have ALL access on the data, this has now been changed so they will see reports for all data they have SELECT access to. If necessary, this can be configured back to ALL by editing the view definition of okera_system.steward_audit_logs dataset.
  • Starting from 2.0.0, Okera only supports EMR versions greater than 5.11.0 up to 5.28.0.

    Note: Versions of EMR less than 5.10.0 continue to work but Okera recommends that you upgrade to a recent EMR version for latest ODAS compatibility.

  • The behavior of using REVOKE on permissions (e.g., REVOKE SELECT) has been changed to not cascade by default. For example, in 1.5.x and earlier versions, REVOKE SELECT ON TABLE mytable would also revoke any
  • Starting in 2.0.0, the published Okera client libraries for PrestoDB support PrestoDB versions greater than 0.225 and above. You can use published Okera client libraries from prior Okera versions (which will continue to work against an ODAS 2.0.x and higher cluster) to support earlier PrestoDB versions.
  • The Permissions page has been removed - all links to it (e.g., in bookmarks) will no longer work.
  • Private tags on datasets have been removed. Datasets can no longer by filtered by private tags.

SQL Keywords

The following terms are now keywords, starting in 2.0.0:

  • EXECUTE
  • INHERIT
  • TRANSFORM

Deprecation Notice

  • Starting in 2.0.0, we are deprecating the ocadm and odb CLI utilities. If you desire to continue using odb, the binary from 2.0.x and prior releases should continue to work against. However, in future releases we will not ship new binaries of these utilities.