Skip to content

Okera Version 2.2 Release Notes

This topic provides Release Notes for all 2.2 versions of Okera.

2.2.2

Bug Fixes and Improvements

  • Fixed an issue in Hive and Spark client libraries when generating planning SQL that contained DATE types.
  • Fixed an issue in scanning partitioned Delta tables.
  • Fixed an issue in the Spark and Hive client libraries where they would not properly maintain millisecond values for TIMESTAMP columns (they would correctly retain microsecond and nanosecond values if present).
  • Fixed an issue in the PrestoDB client library, where if a column name was also a reserved keyword (e.g., database or metadata) AND the column was a complex type (e.g., STRUCT), the client library would produce an invalid planning request.

Notable and Incompatible Changes

Default Docker Repository Changed to quay.io/okera

Okera has changed the Docker repository that images are pushed to from DockerHub to Quay, due to the impact of the newly enforced rate limits in DockerHub.

Okera's images are available with the quay.io/okera prefix (the image names have not changed).

2.2.1

Bug Fixes and Improvements

  • Fixed an issue in PrestoDB split computation in very large clusters.
  • Removed the restriction on column comments by default (limit was 256 characters).

    Note: This changes the underlying HMS schema, and if connected to a shared HMS, should be disabled by setting the HMS_REMOVE_LENGTH_RESTRICTION configuration value to false.

  • Improved resilience in handling crawling errors.
  • Fixed an issue with WITH GRANT OPTION on non-ALL privileges.
  • Restrict querying datasets with nested types that have policies on tags that are on the nested type.
  • Fixed an issue when paginating in the Datasets page.
  • Fixed an issue for the /api/get-token endpoint.

2.2.0

New Features

JDBC Data Sources

Custom JDBC Driver Support

Okera has added support for specifying custom JDBC data sources beyond those that ship out of the box. If you would like to connect to a custom JDBC data source, please work with Okera Support to define the JDBC connection information appropriately for your driver.

Secure Values for JDBC Properties

Okera has added support for referring to secret values in the JDBC properties file from local secret sources such as Kubernetes secrets, as well as secure Cloud services such as AWS Secrets Manager and AWS SSM Parameter Store.

For example:

driver=mysql
type=mysql
host=...
port=3306
user=awsps:///mysql/username
password=awsps:///mysql/password

This will look up in AWS SSM Parameter Store the value for /mysql/username and /mysql/password. You can similarly use file:// for local files (using Kubernetes mounted secrets) or awssm:// to use AWS Secrets Manager.

Note: If you use AWS SSM Parameter Store or AWS Secrets Manager, you will need to provide the correct IAM credentials to access these values.

Predicate Pushdown Enabled by Default for JDBC-Backed Data Sources

Starting in 2.2.0, predicate pushdown for JDBC-backed data sources is enabled by default (this was previously available as an opt-in property on a per-data source level) and will be used whenever appropriate.

To disable predicate pushdown for a particular JDBC-backed database or table, you can specify 'jdbc.predicates.pushdown.enabled' = 'false' in the DBPROPERTIES or TBLPROPERTIES (you can read more here).

BLOB/CLOB Data Type Support

Okera now supports BLOB and CLOB data types for Oracle JDBC Data Sources.

Autotagging for JDBC-Backed Data Sources

When registering JDBC-backed data sources and loading the tables, Okera will now run its autotagger by default when registering.

You can disable this behavior by specifying okera.autotagger.skip=true in your DBPROPERTIES.

UI Improvements for Tabular Results

The UI now makes it easy to copy or download results as CSV from tables. This can be used in the Workspace and when previewing a dataset.

Operability Improvements

  • Okera will now generate correlated IDs for the Policy Engine (planner) and Enforcement Fleet worker tasks to make it easier to correlate the task information in the logs. For example, the Policy Engine may have a task of the form 9b45f8b08c76352e:85a51f5579300000, and if N worker tasks were generated, they would be of the form 9b45f8b08c76352e:85a51f5579300001, 9b45f8b08c76352e:85a51f5579300002, and so on.

  • System administrators can now easily access the Policy Engine (planner) and Enforcement Fleet worker debug UIs from the System page in the Okera UI.

  • System administrators can now see how many unique users have accessed data via Okera in the System page in the Okera UI, both all-time and in the last 30 days.

Domino Data Labs Integration

When run in Domino Data Lab environments (starting in Domino Data Lab version 4.3.0), PyOkera now has built-in integration that can be used to leverage the automatically generated JWT tokens in the Domino Data Labs environment, enabling transparent authentication between Domino Data Labs environments and Okera deployments.

import os
from okera.integration import domino

ctx = domino.context()
with ctx.connect(host=os.environ['OKERA_HOST'], port=int(os.environ['OKERA_PORT'])) as conn:
    df = conn.scan_as_pandas('drug_xyz.trial_july2020')

PrestoDB Improvements

  • Several internal improvements were made to Okera's PrestoDB connector to increase performance in areas such as data deserialization, asynchronous processing, improved memory allocation, etc.
  • Several improvements were made to auto-tune Okera's built-in PrestoDB cluster to better match its environments capabilities.
  • When filtering on columns of DATE type, the PrestoDB connector will now push those filters down into the Okera workers.
  • Okera's PrestoDB connector has added support for table statistics if these are set on the table in the Okera catalog. These can be set by setting the numRows table property, e.g.:

    ALTER TABLE mydb.mytable SET TBLPROPERTIES('numRows'='12345')

These table statistics will be considered by Presto's cost-based optimizer (e.g., for JOIN reordering).

User Attributes

Okera added the user_attribute(<attribute>) built-in function, which retrieves attribute values on a per-user basis. These can be used in policy definitions, e.g., to apply dynamic per-user filters.

These attributes can be fetched from AD/LDAP by setting the LDAP_USER_ATTRIBUTES configuration value to a comma-separated list of attributes to fetch from AD/LDAP, e.g.:

LDAP_USER_ATTRIBUTES: region,manager,businessUnit

If the attribute is missing for the user executing it, the value returned will be null.

Hudi and Delta Lake Support (Experimental Feature)

Okera has added experimental support for Delta Lake and Apache Hudi tables.

You can create Apache Hudi tables using the CREATE EXTERNAL TABLE DDL. For example:

CREATE EXTERNAL TABLE mydb.my_hudi_tbl
LIKE PARQUET 's3://path/to/dataset/year=2015/month=03/day=16/5f541af5-ca07-4329-ad8c-40fa9b353f35-0_2-103-391_20200210090618.parquet'
PARTITIONED BY (year int, month int, day int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 's3://path/to/dataset';

You can create Delta Lake tables using the CREATE EXTERNAL TABLE DDL. For example:

CREATE EXTERNAL TABLE mydb.my_delta_tbl (id BIGINT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 's3://path/to/dataset/';

The following limitations should be kept in mind:

  • In both cases, tables need to be explicitly registered, as crawling will not properly identify these tables as Hudi or Delta Lake.
  • For Apache Hudi, Okera only supports Snapshot Queries on Copy-on-Write tables and Read Optimized Queries on Merge-on-Read tables.
  • We support Delta tables as symlink tables, so you will need to define them as symlink tables and ensure that the manifest for them is generated properly.

Okera has added several privacy functions typically used in health- and medical-related environments:

  • phi_zip3
  • phi_age
  • phi_date
  • phi_dob

These are compliant with the HIPAA safe-harbor standard.

Nested Field Tagging ( Preview Feature)

Okera has added the capability (disabled by default) to tag nested types (specifically, ARRAY and STRUCT types), and have those tags be inherited when used in views that unnest the nested portion. In addition, Okera has added the ability to fully unnest a table and inherit the tags on that object into the view - this is done using the SELECT ** operator.

To enable this feature, set the FEATURE_UI_TAGS_COMPLEX_TYPES and ENABLE_COMPLEX_TYPE_TAGS configuration parameters to true.

Note: ABAC policies that apply to tags assigned to nested types will not be enforced on the base table, so take care to only give access to unnested views in these cases.

For more information, see Nested Field Tags.

Bug Fixes and Improvements

  • Fixed an issue in the Okera Presto connector where some queries against information_schema could cause an exception and fail.
  • Fixed an issue in the Tags management UI where the number of tagged datasets could be incorrect.
  • Improved handling of not displaying internal databases in the UI.
  • Added the ability to run GRANT for ADD_ATTRIBUTE and REMOVE_ATTRIBUTE at the CATALOG scope.
  • Removed the need to have ALTER permissions on tables and databases to run add/remove attributes (ADD_ATTRIBUTE and REMOVE_ATTRIBUTE are now sufficient).
  • Added an implementation of listTableNamesByFilter in the HMS connector.
  • Added the ability to configure timeouts for internal checks to account for large network latency, this can be set using OKERA_PINGER_TIMEOUT_SEC.
  • Support implicit upcasting for Parquet columns of type INT32 to be represented by BIGINT in the table schema.
  • Improved experience when previewing JDBC-backed datasets by limiting the amount of data fetch.
  • Added DELETE, UPDATE and INSERT as grantable privileges.
  • Fixed an issue in okctl where it would not report an error and abort if there was an error updating the ports.
  • Improved handling of small files for Parquet-backed datasets.
  • Fixed an issue where the Autotagger would not correctly handle columns with DATE type.
  • Improved handling for JDBC-backed tables where the table name contained . characters.
  • When running a worker load balancer (default for EKS and AKS environments), the built-in Presto cluster will by default use the internal cluster-local load balancer and not the external one.
  • Fixed an issue with pagination on the datasets page where paging to the end of the list and back showed an inaccurate count
  • Improved diagnostic information available when registering a JDBC-backed table that has unsupported types or invalid characters.
  • Improved filter push down for Oracle tables for columns of DATE and TIMESTAMP type.
  • Improved handling of DECIMAL, NCHAR and FLOAT datatypes for JDBC-backed data sources.
  • Improved inference of BIGINT values in text values (e.g., CSV).
  • Fixed an issue where workers were not generating SCAN_END audit events.
  • Fixed an issue where table/view lineage information could be duplicated.
  • Upgrade Gravity to 6.1.39.
  • Remove hardcoded fetch_size in PyOkera and add the ability to explicitly set it using the fetch_size keyword argument to exec_task.
  • Fixed an issue where pagination on the Datasets UI could get into an inconsistent state when filtering by tags.
  • When using tokenize, referential integrity will now also be maintained for INTEGER columns.
  • Add IF NOT EXISTS and IF EXISTS modifiers to GRANT and REVOKE DDLs, respectively.
  • Fixed an issue when doing writes in EMR Spark when metadata bypass was enabled for non-partitioned tables.
  • Added limited support for Avro files with recursive schemas, which will allow a maximum cycle depth of 2.

Notable and Incompatible Changes

Upgrading From 2.1.x

When upgrading from Okera 2.1.x that is lower than 2.1.10, some functionality may stop working in the 2.1.x deployment if running side-by-side or downgrading back to 2.1.x. If preserving the behavior is desirable, please upgrade to 2.1.10 or work with Okera Support.

Container User Changed to root

Starting in 2.2.0, the process user inside all the Okera containers (running as Kubernetes pods) is no longer root and is running under dedicated users.

As part of this change, any files that are downloaded into the container (e.g., from S3 for custom certificates) are now placed in /etc/okera and not /etc.

SQL Keywords

The following terms are now keywords, starting in 2.2.0:

  • DELETE
  • HUDIPARQUET
  • INPUTFORMAT
  • OUTPUTFORMAT
  • UPDATE