Skip to content

Okera Version 2.3 Release Notes

This topic provides Release Notes for all 2.3 versions of Okera.

2.3.11 (1/10/2022)

This release contains an additional upgrade of log4j to resolve additional instances of the Log4Shell vulnerability.

2.3.10 (12/20/2021)

This release contains an additional upgrade of log4j to resolve additional instances of the Log4Shell vulnerability.

2.3.9 (12/13/2021)

This release contains an upgrade of log4j to resolve the Log4Shell vulnerability.

Bug Fixes and Improvements

  • Fixed an issue when dropping a partitioned table and using mixed casing for the database name.
  • Fixed an issue when dropping partitions using the Hive/Spark client library.

2.3.8

Bug Fixes and Improvements

  • ALTER TABLE ADD PARTITION no longer requires specifying the partition values if the location follows the standard path naming convention. Partitions can be added with ALTER TABLE <table> ADD PARTITION <location>.

2.3.7

Bug Fixes and Improvements

  • Fixed an issue when connecting from Databricks and using the Databricks-signed JWT could fail in some query submission modes.

2.3.6

Bug Fixes and Improvements

  • Fixed an issue when connecting from Databricks and using the Databricks-signed JWT could fail when a query was run multiple times.
  • Fixed an issue where partitioned symlink tables (e.g., Delta) would fail to plan if the number of partitions was high.

2.3.5

Bug Fixes and Improvements

  • Improved logging in the PrestoDB connector to properly log both the Presto query ID as well as the Okera task IDs when available.
  • Added the ability to set the default quote character (the default is ") for CSV files when using the built-in CSV SerDe. This can be set in the following ways:

    1. On the SERDEPROPERTIES when creating or altering a table (e.g., to disable quote handling by removing the quote character):

      SERDEPROPERTIES('quoteChar'='')
      

      e.g.

      ALTER TABLE mydb.mytable SET SERDEPROPERTIES('quoteChar'='')
      
    2. On the TBLPROPERTIES to set the default value (this can be overwritten with the above SERDEPROPERTIES):

      TBLPROPERTIES('okera.text-table.default-quote-char'='')
      
    3. Change it global default for the cluster by setting TEXT_TABLE_DEFAULT_QUOTE_CHAR to the desired value, e.g., '' to disable the quote character.

  • Fixed an issue with handling of CSV files when split across multiple tasks and running count(*).

  • Upgraded the packaged Snowflake JDBC driver to v3.2.17.

2.3.4

Bug Fixes and Improvements

  • Fixed an issue where, when using AWS Glue, loading a specific database in the UI would take a long time to load.
  • Improved handling of Amazon S3 connection errors (e.g., retries, service unavailable), including the ability to set new values via configuration.
  • Increased the default PrestoDB TaskUpdate limit.

2.3.3

Bug Fixes and Improvements

  • Fixed an issue where if a view and the underlying base table had mismatched types on a column, Okera would produce data that matched the underlying table type and not the view type, causing an issue for upstream engines (e.g., Presto). The new behavior is that an implicit cast will be added if possible, and if not, the query will be failed.
  • Fixed an issue in the PrestoDB and PrestoSQL client libraries, where if a column name was also a reserved keyword (e.g., database or metadata) AND the column was a complex type (e.g., STRUCT), the client library would produce an invalid planning request.
  • Fixed an issue in the transparent Snowflake access where it would use an external LB (if configured) rather than the cluster-local cerebro-worker service address.
  • Fixed an issue in the transparent Snowflake access where queries that used IF were not properly rewritten.
  • Fixed an issue in the Spark and Hive client libraries where they would not properly maintain millisecond values for TIMESTAMP columns (they would correctly retain microsecond and nanosecond values if present).

2.3.2

Bug Fixes and Improvements

  • Updated PostgreSQL driver to resolve a security vulnerability
  • Fixed an issue when querying tables that have columns with very large values (e.g., 100KB), where a simple query that references that column would fail due to exhausting the cluster memory. To resolve this, set RS_ARGS to include --batch_check=64 (or another relatively low number). See RS_ARGS Options. In 2.3.x, this value is set to -1 (i.e., no limit) by default, but in future Okera releases (2.4.x and above) it will be set to a low number by default.

2.3.1

Bug Fixes and Improvements

  • Added an option in Amazon EMR bootstrap to specify a custom image location using --local-worker-image.
  • Fixed an issue where Presto would report an error of Could not compute splits and not specify the underlying Okera error.
  • Improved Amazon S3 IO retry handling for improved latency when errors occur.
  • Fixed an issue in collocated Enforcement Fleet workers that would attempt to open a connection to the Okera Policy Engine (planner) unnecessarily.
  • Added the ability to specify DROP as a privilege for attribute namespaces, databases, tables, and views.
  • Added the ability to control the number of Okera tasks for a query in Presto using the okera.max_tasks Presto session property.

Notable and Incompatible Changes

Automatic Estimated Table Statistics

Okera will now automatically collect and store estimated table statistics. These can be optionally enabled (they are disabled by default) and leveraged by Hive, Spark and Presto for query planning and cost-based optimization.

To enable for Spark and Hive, edit hive-site.xml and add:

<property>
  <name>okera.hms.stats-mode</name>
  <value>HMS_OKERA</value>
</property>

To enable for Presto, you can do either of the following options:

  1. Edit the Okera connector's okera.properties and add okera.task.plan.enable-okera-stats=HMS_OKERA.
  2. Set the okera.stats_mode Presto session property to HMS_OKERA.

Note: These estimated statistics are complementary to the normal Hive metastore statistics, and there is no change in behavior if those statistics are currently being used (they take precedence if set over Okera's estimated statistics).

Okera JDBC Driver Update

Okera has added support for specifying TimeZoneID as a URL property when using Okera's Presto JDBC driver to connect via JDBC clients. For example, the connection property can be set as TimeZoneID:UTC. If this value is not specified, the driver would use the system's current time zone ID.

Valid values for this property are specified in the IANA Time zone Database. For a complete list of supported time zone ID, see https://en.wikipedia.org/wiki/List_of_tz_database_time_zones

Default Docker Repository Changed to quay.io/okera

Okera has changed the Docker repository that images are pushed to from DockerHub to Quay, due to the impact of the newly enforced rate limits in DockerHub.

Okera's images are available with the quay.io/okera prefix (the image names have not changed).

2.3.0

New Features

Okera Collocated Compute (Amazon EMR)

You can now run Okera's scalable data plane collocated with your Amazon EMR cluster(s), allowing you to transparently (and with zero or marginal cost) scale your Okera secure compute capacity as you provision more Amazon EMR capacity (both by scaling a single cluster or having multiple independent clusters). For supported data sources and queries, the secure data access will happen on the Amazon EMR nodes, benefiting from network and compute locality, and allowing you to maintain a much smaller central Okera cluster, thus allowing you to dramatically reduce TCO.

Amazon EMR clusters running with Okera's collocated compute do not need to have direct Amazon S3 access (via IAM), as the collocated data plane gets temporary secure access to the data it needs, thereby reducing the surface area of data access and allowing you to maintain high security, while not sacrificing usability (such as prohibiting SSH access to Amazon EMR).

Note: Okera's collocated data plane is supported beyond Amazon EMR. To learn how to leverage it in either deployment environments, such as Kubernetes, contact Okera Support.

New UI Databases Page

Okera has a new catalog browsing and management experience, centered around databases and the datasets in them. Users can now create and manage Okera databases, as well as permissions and tags at the database level.

To search across all datasets, click on Search all datasets to leverage the new dataset search page.

Click here learn more about the new functionality.

Note: The Datasets page has been deprecated. Please use the new Databases tab (accessed by selecting Data on the Okera side menu) instead.

Transparent Snowflake Access (Beta)

Okera now supports improved access control on Snowflake data sources, pushing down full queries (including joins and aggregations) to Snowflake while enforcing the complete access policy as well as audit log entries.

Users, such as data analysts, can connect their favorite SQL tool (e.g., DBeaver, Tableau, Looker) via Okera’s ODBC/JDBC endpoint, and their queries will be automatically sent to Snowflake, after being authorized and audited by Okera (and if the user does not have permission to access the data they are trying to access, the query will be rejected). With this new capability, you get the benefit of Snowflake's native performance scale and Okera's complete policy and auditing capabilities.

In future releases, more data sources will be supported for transparent access integration as well.

Read more here.

Improved Databricks Integration

Okera has an improved integration with Databricks, enforcing full fidelity policies while maintaining complete compatibility with Spark and Databricks, including Databricks Delta Lake. The new integration is transparent in its execution, and allows Databricks Spark to fully control the data access, thus retaining its performance and functionality.

This new functionality is on by default, and you can read more about how to easily integrate a Databricks cluster (or clusters) with Okera here.

PrestoSQL Support

Okera now supports PrestoSQL (both the open-source and Starburst variants) in addition to PrestoDB. This allows you to connect your existing PrestoSQL clusters to Okera, benefiting from Okera's unified catalog, access control and auditing capabilities.

Note: PrestoSQL 338 is supported.

Amazon EMR 6.1 Support

Amazon EMR 6.1 is now supported, allowing you to leverage the latest functionality on Amazon EMR, such as Spark 3, Hive 3 and PrestoSQL.

You can read more about integrating with Amazon EMR 6.1 here.

Note: Integration with Amazon EMR 6.1 clusters is only supported with Okera clusters 2.3.0 and higher.

Bug Fixes and Improvements

  • Fixed a UI bug where updating a permission without any changes caused an error and would remove the permission.
  • Added a clear error message when a user that does not permission to create an attribute namespace tries to create one in the UI.
  • Fixed an issue where a LEFT OUTER JOIN would cause an error when querying two unnested columns.
  • Fixed an issue where in some cases, a user that was granted WITH GRANT OPTION could grant a higher access level on that object.
  • Okera UDFs that are used by external systems (such as Spark) are now registered in the okera_udfs databases.
  • Ensure that the automatic Presto tuning generates default task counts which are a power of 2 (as required by Presto).
  • Added a request ID to the audit logs for Presto and Spark queries, allowing to link together all the audit log entries for a single query.
  • Added the ability to specify a specific password to use for the Presto connection when using PyOkera, to allow for connecting to non-token enabled Presto clusters.
  • Improved autotuning that automatically detects cluster resizing for the Okera client libraries for Presto, Hive and Spark.
  • Fixed an issue in PyOkera that did not properly take custom user claims into account when using a token_func when a token expired.
  • Improved handling of spaces and periods when put in databases, tables, and columns.
  • Fix an issue when running count(*) on JSON data when multiple splits are generated.
  • Added support for setting database description via DDL:

    ALTER DATABASE <db_name> SET COMMENT '<database comment>'
    
  • Fix an issue for partitioned Delta tables.

  • Improved handling in CREATE TABLE ... LIKE PARQUET for partitioned tables:
    • A data file will automatically be found inside one of the partitions without needing to be manually specified.
    • The partition scheme can be auto-inferred from the on-storage structure (in a similar manner as data registration crawlers), without needing to explicitly be set.
  • Reject all unparseable view statements when creating or altering the view definition and improve error handling if an unparseable view is already present in the catalog.
  • In PyOkera, scan_as_json and scan_as_pandas now take an optional presto_headers dict keyword argument for custom headers to use when making the Presto request.
  • Improved metadata fetching performance when executing Presto queries, especially ones that reference many catalog objects.
  • Don't automatically populate large table statistics for Spark and Hive if no real statistics are present. The prior behavior can be enabled by setting the okera.hms.auto-populate-stats Hive configuration property to true.
  • Increase the default timeout when creating an Okera connection in the client libraries to 30 seconds (prior value was 10 seconds).
  • Fixed an issue where user attributes were not read correctly if the source system (e.g., LDAP) had them in non-lowercase.
  • Fixed an issue in okctl that did not properly handle validation of parameters that supported multiple path values (e.g., `JWT_PUBLIC_KEY: s3://path1,s3://path2).
  • Added the ability to control the timeout for the Kubernetes liveness and readiness probes by setting the OKERA_HEALTHCHECK_TIMEOUT_MS configuration value.
  • Fixed an issue for feature flag toggling for non-catalog administrators.
  • Improved role conflict detection for grants on differing scopes that don't overlap in their ABAC conditions.
  • Improved handling for ALTER DATABASE ... LOAD DEFINITIONS OVERWRITE to not remove tag assignments (at either the table or column level) if they are already present.
  • HAVING ATTRIBUTE conditions will now be considered for grants that also contain WHERE filters. The prior behavior can be enabled by setting the IGNORE_HAVING_EXPR_ON_FILTER to true.

Notable and Incompatible Changes

Oracle NUMBER Type

In 2.3.0 and higher, the NUMBER type in an Oracle table will be represented as a DECIMAL(38,6) in Okera.

Credential Files for JDBC-Backed Data Sources

In 2.3.0 and higher, when creating a JDBC-backed data source using a credentials file, the creating user must have permissions on that URI (expressed as a URI grant).

For example, if your credential file is located at s3://mycompany/config/redshift.properties, and you tried executing the following command:

CREATE DATABASE IF NOT EXISTS my_redshift_db DBPROPERTIES(
  'credentials.file' = 's3://mycompany/config/redshift.properties',
  ...
);

This will error if you do not have a URI grant that gives you access to s3://mycompany/config/redshift.properties.

You can create such a grant with:

GRANT ALL ON URI s3://mycompany/config TO ROLE <some role>

Note: You can also grant access to the entire bucket (or any prefix-level you desire).