Release Notes

2.1.9

Bug Fixes and Improvements

  • Fixed an issue where a user could create external views in any database using Presto's CREATE VIEW DDL, even though they may not have the appropriate grant on that database.

2.1.8

Bug Fixes and Improvements

  • Fixed an issue where schema inference (used in Data Registration and CREATE TABLE LIKE FILE) for JSON-based tables would incorrectly remove leading underscores and double underscores from column names.

2.1.7

Bug Fixes and Improvements

  • Added the ability to specify additional Presto configuration values using the PRESTO_ARGS configuration value, e.g. PRESTO_ARGS: "task.concurrency=16 task.http-response-threads=100". Using this capability should be done in coordination with Okera Support.
  • Fixed an issue where the REST Server pod would not restart quickly enough if a failure happened on startup.
  • Fixed an issue where the error dialog in the Data Registration page could be uncloseable.
  • Improved Presto behavior on creating and closing connections to the ODAS workers.
  • Changed the default Presto maximum stage count to 400.
  • Fixed an issue where an ABAC policy that included row filters would generate the WHERE clause missing parentheses around.
  • Fixed an issue where newer Parquet files that included the INT_32 and INT_64 logical types would cause a Parquet read error.

2.1.6

Bug Fixes and Improvements

  • Fixed an issue in Data Registration UI that made pagination behave erratically when using Glue as the backing metastore.
  • Fixed an issue in Data Registration UI where auto-discovered tags would not show up if the column was not editable.
  • Fixed an issue where the in-memory group cache would be overridden with empty groups.
  • Fixed an issue where CSV files that had empty strings would not be automatically converted to NULL values.

2.1.5

Bug Fixes and Improvements

  • Added the ability to increase the REST and UI timeouts to arbitrary values (previously limited to 60 seconds).
  • Removed a restriction when unnesting nested types that did not allow WHERE clauses to be used in those queries.

2.1.4

Bug Fixes and Improvements

  • The HMS length restriction removal will now run on startup for all clusters (unless disabled), not just upgraded clusters.
  • Fixed an issue where keywords were not always escaped in ABAC transforms and filters.
  • Fixed an issue in the UI where the privacy function dropdown in the Visual Policy Builder had the wrong default.
  • Fixed an issue where ODAS errors were not propagating to Presto when creating an external view from Presto.

2.1.3

Bug Fixes and Improvements

  • Updated Presidio to not require any network connectivity in all cases.
  • Fixed an issue where the Datasets UI would render table headers over some dropdowns.
  • Improved the performance of the Datasets page when loading individual datasets.

2.1.2

Bug Fixes and Improvements

  • Fixed an issue when creating a crawler with single-file datasets, causing the registered datasets to use the directory path instead of the file path.
  • Fixed an issue where editing policies in the Policy Builder could in some cases cause an error on saving the edited policy.
  • Fixed an issue where using restricted keywords in Policy Builder would not be escaped properly in some cases.
  • Fixed an issue when using MySQL as the backing database could cause some data types to not be converted correctly via JDBC in some cases, causing exceptions.

2.1.1

Bug Fixes and Improvements

  • Several improvements to handling of S3 errors and failure conditions for very large files.
  • Fixed an issue where in some cases (typically large) Parquet files would cause an error when being queried.
  • Fixed an issue in the Databricks connector where a table would be missing the SerDe path parameter when the table was not cluster local.
  • Fixed an issue in policies where if you had two ABAC policies, one which included a transform and one which did not, they would not compose correctly (this resulted in giving less access than desired in all cases).
  • Fixed an issue when upgrading from 1.5.x where the DB schema upgrade could fail under certain conditions.
  • Fixed an issue in the Presto connector where if a JDBC client issued a query against INFORMATION_SCHEMA with underscores, Presto would error out.

2.1.0

New Features

Extending attribute-based access control policies to support data transformation functions and row filtering

Attribute-based access control policies now support data transformation functions and row filtering. This is supported with an extension to the current ABAC grant syntax. Read more here.

This can significantly simplify how policies can be managed, reduce or eliminate the need to create views and make it much easier to manage complex policies. You can easily create these policies in the UI by specifying ABAC access conditions in the policy builder. See examples of the different policies you can create using Okera's Policy Engine here.

Tag Cascading and Inheritance

Attributes assigned on tables/views and their columns will automatically cascade to all descendant views. Read more about this capability here.

View Lineage

Okera will now maintain the lineage of datasets created after 2.1. It is now possible to know for a given dataset (table or view) what are all the views that descend from it, and for a view to know all its ancestors. This information is also exposed in the UI. Read more here.

Improved Privacy Functions

Okera has a revamped set of privacy-related functions to aid in anonymization with different guarantees. Read more about Okera's privacy and security functions here

Users page and inactivity report

The web UI now offers a users page, where all the users that have authenticated in the system can be viewed, as well as their groups as per the last time they made a request through Okera. This makes it easier to understand if a user should have access to something or not.

The users page also lets you generate a user inactivity report, that shows you all the users who have any level of access on a database, but have not queried that database within a certain time period. This report helps identify users who may not need access to data anymore since they are not utilizing, thereby improving least privilege.

Enable access to the users page in the UI by granting a user or group access to the okera_access_review_role.

GRANT role okera_access_review_role to group marketing_steward_group;

Read more about his capability here.

Access control for attribute namespaces

You can now control access to management of tags by namespaces. ATTRIBUTE NAMESPACE has been added as a new object type, and CREATE ADD ATTRIBUTE and ALL access levels are supported on it. For example if you wanted to give access to a role to create, drop and assign attributes from a particular attribute namespace, you would use the below:

GRANT ALL on ATTRIBUTE NAMESPACE marketing TO ROLE marketing_steward;

In addition to this if you wish to grant access to the tags page in the UI, so that a user can create and manage tags there, grant okera_tags_role to that user's group. Note, in order to assign attributes on data the user will still need to have the correct privileges on the data they are trying to assign on. See the docs for more details.

Other tag management updates

  • Only editable tags (ones the ones a user has CREATE or ALL on) show up on the Tags page.
  • Adding/removing tags from a dataset will ignore tags a user does not have privileges on.
  • Tags page now requires SELECT access on okera_system.ui_tags. The built-in okera_tags_role has this privilege by default.

VIEW_AUDIT privilege to control access to audit logs

You can now grant VIEW_AUDIT privilege on data, to enable a user to view audit log information for that object. For example if the user only had VIEW_AUDIT on two databases, they would only see reports for those two databases in the UI or when querying the okera_system.reporting_audit_logs view. Note to see the reports page in the UI the user also needs the okera_reports_role, see the docs for more details.

Note

The default privilege required to view audit log records for objects has been changed from SELECT to VIEW_AUDIT.

Presto SQL in workspace

The workspace now features Presto SQL mode, which allows executing queries against an Okera cluster using Presto. See the docs for more details.

Creating Views using Presto

It is now possible to create and delete external views via Presto directly. These views will be stored in the Okera catalog (as external views) and be accessible via Presto.

To do this, execute a DDL like this in Presto (e.g. via the Okera Workspace or an application such as SQL Workbench or DBeaver):

CREATE VIEW some_db.some_view AS SELECT ....

In order to support this, Okera has added extensions to the CREATE VIEW DDL statement when executed in Okera:

CREATE EXTERNAL VIEW <db>.<view> (
    <col name> <col type>,
    ...
) SKIP_ANALYSIS USING VIEW DATA AS 'SELECT ...'

This DDL requires the user to specify the full set of columns that the view statement produces (including types), as the view statement is not parsed or analyzed.

Improved JSON file format support

  • Starting from 2.1.0, Okera uses simdjson to read JSON file format data.
  • Several improvements for auto-inference of JSON file formats with support for appropriate data types. Extensive testing on various JSON files on auto generated files and several internet sources.

Oracle data source support

Oracle is now supported as a JDBC data source. The Oracle JDBC driver will need to be configured as a custom driver. Read more on how to configure this here.

More metadata available on dataset details

There are a number of improvements to the dataset details view in the UI:

  • Much more detailed technical metadata is included
  • It is now possible to edit the description of a dataset
  • It is now possible to edit column comments in the dataset schema
  • View parent/child lineage information is available for views created in Okera
  • Column-level tags are include in the details view along with table-level tags
  • Dataset schema can be filtered by data with column-level tags

Ability to create views from the UI

Admin users can now create an internal view based on an existing view or table from the datasets page. Choosing the destination database and view name and selecting the columns to be included in the new view are supported. For more see the Datasets page documentation.

Permission management improvements

  • A Permissions tab has been added to all details cards on the Datasets page. Like on the Roles page, you'll be able to fully manage permissions associated with the specified dataset. You can read more about this on the Datasets page.
  • Data transforms and row filtering added to Policy Builder UI.
  • Ability to edit existing policies in the UI. To learn more about editing and managing policies, go to Editing Permissions.
  • An admin user can now create a view from a dataset

Reports page improvements

The Reports page has a number of major improvements, including:

  • New reports for Activity overview, Active users over time, Top accessed tags, and Recent queries.
  • SQL used to generate the reports is available in-page and can be run in Workspace.
  • Custom time ranges are available within the last 90 days.
  • Reports queries use human-readable times instead of unix timestamps.
  • Reports can now be filtered by dataset and tag as well as database.
  • Reports filters now allow for multi-selection.
  • Visual updates.

For more details, see the Reports page documentation.

UI Visual and interaction updates

  • There are small visual updates and improvements throughout the UI focussed on clarity and better use of screen real estate.
  • The output of the workspace has been re-worked to better keep a users context and show history.

Updates to reporting and audit views

  • New audit table and view have been added to okera_system database - analytics_audit_logs and reporting_ui_analytics. These are populated by cdas-rest-server container and are primarily used to track and analyze usage of the UI. For now the UI only writes there on page visit. The data is stored in the same logging directory as regular audit logs in its own subfolder.

  • The view used by reports, okera_system.reporting_audit_logs, now includes start_time_utc and end_time_utc columns of type TIMESTAMP_NANOS for better readability.

Improved REST server diagnostics logging

  • Logs now include timestamp and log level.
  • Log level can be set via REST_SERVER_LOG_LEVEL. DEBUG, INFO, WARNING, ERROR, and CRITICAL are all valid.

Bug Fixes and Improvements

  • Upon renaming tables, the attributes from the old tables are now carried to the new renamed table.
  • Performance improvements to parallelize queries with UNION ALL in it. With this enhancement, queries with UNION ALL will leverage Okera multi-tasks across workers vs single task for UNION ALL as previous to this fix.
  • Performance improvements on dropping tables with large number of partitions.
  • Performance improvements on DROP DATABASE CASCADE to drop all tables under the database.
  • For JDBC data sources, large numeric/decimal types (>38 precision) are now handled. The precision is capped at 38 for the larger numeric/decimals or unspecified p/s in the source. If unspecified, for scale, 6 is the default scale post Okera version 2.1.0 .
  • Fix to handle negative decimals in JDBC data scan. The scale is also treated HALF_DOWN for rounding large scale decimals.
  • For create view command if database is not specified, default database name is considered for the view.
  • Fix for parse errors on views with JDBC tables in the view definition (Joins between JDBC and non-JDBC tables).
  • Arrays of arrays are now supported in Okera.
  • log4j2 support: Okera now uses log4j2 as default for logging. A backwards compatibility bridge as recommended by apache project is used for libraries that still use log4j, like certain hadoop/hive libraries.
  • Support for LIMIT on JDBC data sources. This improves the preview of data from JDBC data sources where the data is limited to 100 by default from the Okera WebUI.
  • Better error handling on JDBC data source auto-inference errors on unsupported datatypes. More info here
  • Fix for a regression on authorization on CTEs (WITH Clause) with aggregations in the query.
  • For views involving avro file format, that have column definitions, like complex structs that can have > 4000 characters, use the schema from the avro file instead of creating the physical columns in the database. Note, the describe formatted for such tables/views with > 4k columns still do not show the column details. The describe <table/view> would show the correct definitions.
  • Several bug fixes to handle parquet file format issues gracefully. Example, parquet files with unsupported DataPageHeaderV2 would crash the workers. These are now handled with a graceful error message.
  • Reduce pinger verbose level from error to warn for the Sentry/Hive pinger. This will improve error diagnostics for real catalog exceptions. Earlier, this used to flood the logs with invalid errors.
  • A bug fix for count(*) on a JDBC view to return results instead of a failure.
  • Ability to specify Glue AWS region which can be seperate from the cluster default region.
  • The recordservice catalog in presto is disabled by default starting from 2.1.0 .
  • Additional controls for JDBC (PrestoDB) -> Okera configurations. Example, the rpc timeouts are now parameters that can be controlled from a environment setting. OKERA_PRESTO_PLANNER_RPC_MS and OKERA_PRESTO_WORKER_RPC_MS
  • Minor improvement to remove SerDe info from SHOW CREATE TABLE command. Prior to this fix, re-running the output from SHOW CREATE TABLE command would error out due to the duplication of SerDe info and the FILE FORMAT info. Post this fix, the SHOW CREATE TABLE command would not have the SerDe info and hence re-running the output would work as is.
  • Fix for an avro file format error that has a union with default values in it.
  • UI: Better row hover state highlighting on grouped table rows.
  • UI Error boundaries introduced for increased stability in javascript.
  • Policy Builder layout and formatting improvements.
  • Contextual restrictions on Policy Builder UI including conditional disabled create/edit/delete.
  • More nuanced permission conflict reasons.
  • Upgraded node to 12.15.0.
  • The Presto connector has several improvements for performance, utilizing more efficient APIs and serialization/deserialization formats.
  • Several performance improvements for queries over Parquet files and queries with joins.
  • In the Okera Planner/Worker debug UI, the number of queries displayed has been increased to 256.
  • The audit log has a new field added to it, ae_attribute, which captures all attributes accessed as part of this query.
  • Fixed an issue in the /scan API where some Decimal values would not be serialized correctly.
  • Several improvements to schema detection for TEXT-based files (especially CSV).
  • Added support for md5() (based on the Hive UDF).
  • The has_access() builtin function now supports checking against all privilege levels (previously it only supported ALL and SELECT).
  • Fixed an issue where it was not checked whether an attribute existed or not in some DDL statements that modified attributes.
  • Fixed an issue where the CREATE_AS_OWNER privilege at the catalog level incorrectly gave the SHOW privilege at that scope as well.
  • Improvements to error handling and recovery of metadata operations.
  • Improved default tuning parameters in large memory environments.
  • PyOkera now properly converts all values to JSON-serializable types when scan_as_json is used.
  • Improved admission control when workers are over-subscribed on either active connections or memory metrics.
  • For Gravity-based deploys, Gravity has been upgraded to 6.1.16 LTS.
  • Improved error handling and recovery of the data registration crawler in case of failures.
  • Added the ability to increase the timeout for initializing the catalog on cluster startup by setting the CATALOG_INIT_STARTUP_TIMEOUT configuration value.
  • Fixed an issue where some system tables were not dropped prior to creating them on startup, which can cause an issue on upgrades.
  • Fixed an issue where the audit logs would have incorrect values in case of an error during initialization of an incoming request.
  • Add the ability to specify a column list when executing ALTER VIEW, similar to CREATE VIEW.
  • Improved error message when using non-absolute S3 bucket paths.
  • Improved error handling when parsing a view definition that Okera cannot parse for an external view.
  • Fix an issue where service discovery would consider Kubernetes objects in a different namespace.
  • Fixed an issue where the system would generate unnecessary baseline queries, creating log noise.
  • Added the ability to specify a privilege level filter for the GetTables and GetDatabases APIs.
  • Fixed an issue in PyOkera when handling the CHAR type when there are null values in the data.
  • Fixed an issue where the ae_role column was not always populated for some role-related DDLs.
  • Improved the logging in the Okera REST Server.
  • Added the ability to configure the Planner and Worker RPC timeouts in Okera's Presto, using the OKERA_PRESTO_PLANNER_RPC_MS and OKERA_PRESTO_WORKER_RPC_MS configuration values respectively. The defaults are 300000ms and 1800000ms respectively.
  • Improved retry handling for retryable S3 errors (such as Server Busy, etc).
  • Fixed a bug where database names were not escaped when created in the registration UI.
  • 're-autotag' button on the datasets page now causes the new tags to be fetched upon completion.
  • The UI has a number of new icons.
  • Workspace now includes an execution timer for queries.
  • Improved errors are reported for bad schema found during registration.
  • Fixed a bug where they UI allowed users to 'tag' partitioning columns, but such tags had no effect.
  • Now all dataset views show their view string.
  • "Queries by duration of planner request" is no longer part of the Reports page.

Notable and Incompatible changes

  • Starting from 2.1.0, the published Okera client libraries for PrestoDB support PrestoDB versions greater than 0.234.2 and above.
  • ZooKeeper has been removed as a system component - Okera will now leverage Kubernetes to maintain the worker membership list.
  • The default per-user okera_sandbox database has been removed.
  • When creating Okera views (i.e. internal/secure views), it is now required for the creator to have the ALL privilege on all referenced datasets. This is done to ensure that these tables cannot be incorrectly exposed by users with lesser permissions.
  • Removed the 4000 character limitation on column types. Note that this changes the underlying HMS schema, and if connected to a shared HMS, should be disabled by setting the HMS_REMOVE_LENGTH_RESTRICTION configuration value to false. This is only done for new HMS databases - if you have an existing one from a prior installation, please contact Okera Support for migration procedures.
  • The default privilege required to view audit log records for objects has been changed from SELECT to VIEW_AUDIT. This means some users may no longer be able to see audit logs for their data (if they previously only had SELECT access to it), and will need to be granted VIEW_AUDIT on data they wish to view audit logs for.
  • ML and decision-tree-based autotagging is now enabled by default.
  • OKERA_REPORTING_TIME_RANGE can no longer be used to restrict the available time range in Okera reports.
  • In 2.1.x, many data correctness issues will now fail queries as opposed to silently ignoring them (e.g. converting data into NULL, etc) as in previous versions. To revert the behavior, add --abort_on_error=false to RS_ARGS.

SQL Keywords

The following terms are now keywords, starting in 2.1.0:

  • CXNPROPERTIES
  • DATACONNECTION
  • DIAGNOSTICS
  • DO
  • EXCEPT
  • TIMESTAMP_NANOS
  • VIEW_AUDIT
  • VIEW_COMPLETE_METADATA

Known Issues

  • The Okera PrestoDB Connector shipped with this version is compatible with PrestoDB 0.233 and higher. This connector is currently not compatible with any released version of PrestoDB on EMR, as the version of PrestoDB shipped is older than 0.233. This will be fixed in a subsequent 2.1.x maintenance release.

2.0.2

Bug Fixes and Improvements

  • Fixed an issue where many concurrent CREATE TABLE or CREATE VIEW statements could be slowed down waiting on a shared resource.
  • Fixed an issue when authorizing queries on views with complex types.
  • Added an option to use the SYSTEM_TOKEN as the shared HMAC secret for signing and validating tasks (in the Planner and Worker services) rather than using ZooKeeper. This option can be enabled by setting SYSTEM_TOKEN_HMAC: true in the configuration file.
  • Fixed an issue where it was not possible to connect to a Postgres instance that did not have public in the default search_path.
  • Added the ability to specify whether the connection to the database should be done using SSL (this was typically auto-discovered, but in some cases the auto-discovery failed). This can be enabled by setting CATALOG_DB_SSL: true in the configuration file.
  • Fixed an issue where schema upgrades did not work for remote Postgres instances.
  • Fixed an issue where the Workspace UI would scroll beyond the window if there was a long error.

2.0.1

Bug Fixes and Improvements

  • Added the ability to edit dataset and column descriptions in the Okera UI.
  • Fixed an issue where datasets discovered by the crawler that had columns whose type definition exceeded 4,000 characters couldn't be registered.
  • Added more control options for LDAP group resolution configuration:
    • GROUP_RESOLVER_LDAP_POSIX_GID_FIELD_NAME
    • GROUP_RESOLVER_LDAP_POSIX_UID_FIELD_NAME
    • GROUP_RESOLVER_LDAP_MEMBEROF_FIELD_NAME
  • Fixed an issue where Avro datasets that had a union type with a single child (e.g. union(int)) would throw an error. These types of unions are now fully supported.
  • Fixed an issue where decimals that were stored as a byte_array in Parquet files were not read correctly.
  • Added a configuration option to control the maximum number of allowed Sentry and HMS connections:
    • CATALOG_HMS_MAX_THREADS
    • CATALOG_SENTRY_MAX_THREADS
  • Fixed an issue where changing the description of the view (or a column in it) via DDL was not supported.
  • Fixed an issue where columns that contained arrays or maps with embedded null values were not handled correctly in the Java-based clients.
  • Fixed an issue in PyOkera where it would incorrectly decode negative decimal values with a precision higher than 18.
  • Fixed an issue when --allow_nl_in_csv=True was set and the CSV file used a different quote character than " - it would improperly use the " to escape line breaks.
  • Improve the Crawler's ability to automatically use the OpenCSV SerDe when necessary.
  • Fixed issues for handling complex types that had several nested arrays/structs/maps with null values interspersed.
  • Fixed an issue where reserved keywords were not possible to be used (as escaping them wouldn't work) as attribute namespaces and attribute keys (e.g. myns.true).
  • Add the ability to use CREATE TABLE LIKE TEXTFILE, which will automatically deduce the schema from the CSV file (this assumes the headers are the first line).
  • Improved handling of non-parsable SQL statements when accessing a view that was created outside Okera (e.g. in Hive). This capability is enabled by an environment flag ALLOW_NONPARSEABLE_SQL_IN_VIEWS: true set in the configuration file for the cluster.
  • Fixed an issue where the same tag could appear twice in the UI.
  • Fixed an issue where dropping an external table referencing a bucket that does not exist would fail.
  • Fixed an issue where the crawler Data Registration page for a given crawler would display incorrect "Registered" tables if their path was a simple prefix of the crawler root path.
  • Added support for using a dedicated Postgres server (e.g. on RDS) as the backing metadata database.

2.0.0

New Features

Bucketed Tables

ODAS now supports bucketed tables and applying efficient joins to them. You can find more details here.

AWS Glue

ODAS now supports using AWS Glue as the metastore storage, allowing you to connect ODAS to an existing Glue catalog. You can read more about this support and enabling it in the Glue Integration page.

Auto-tagging Improvements

  • ODAS now employs an ML-based engine for some of the out of the box auto-tagging rules, such as address and phone number detection.

  • You can now create and manage the regular expression-based rules that are used by the auto-tagging engine in the UI. You can read more about this in the Tags page.

  • The number of datasets tagged with a tag is now shown in the UI.

  • ODAS can continuously auto-tag your existing catalog in the background. You can enable this by setting the ENABLE_CATALOG_MAINTENANCE setting in your configuration file.

  • ODAS will now auto-tag the data inside nested complex types, and apply the discovered tag(s) at the root column-level.

Azure

ODAS now supports ADLS Gen2 data storage for both querying and data crawling. You can register these data sources by specifying a path with either the abfs:// or abfss:// prefixes.

Web UI

  • The ODAS Web UI has been revamped to be easier to use and update the look-and-feel.

  • A Roles page has been added, allowing you to fully manage roles (create/update/delete) and their group and permission assignments. You can read more about this on the Roles page.

  • The 'About' dialog has been replaced by a System page.

JDBC Data Sources

  • Redshift External Tables are now supported for JDBC data sources of type redshift.

ABAC

  • There are now DDL statements to work with tags, namely:

    • DESCRIBE <table>, DESCRIBE FORMATTED <table>, DESCRIBE DATABASE <database> will now output tag assignments.
    • CREATE ATTRIBUTE <attr> and DROP ATTRIBUTE <attr> will create/remove attributes (note that namespaces will be automatically created if they don't already exist).
    • SHOW ATTRIBUTE will show the list of currently existing attributes.
    • ALTER TABLE and ALTER VIEW now have new operations of ADD ATTRIBUTE <attr>, REMOVE ATTRIBUTE <attr>, ADD COLUMN ATTRIBUTE <col> <attr> and REMOVE COLUMN ATTRIBUTE <col> <attr> to add/remove attributes at the table-/view- and column-levels respectively.
    • ALTER DATABASE now has new operations of ADD ATTRIBUTE <attr>, REMOVE ATTRIBUTE <attr> to add/remove attributes at the database-level.
    • CREATE TABLE and CREATE VIEW can now take an optional set of attributes during table creation. For example:
      CREATE TABLE mydb.mytable (
          col1 int COMMENT "some comment1" ATTRIBUTE myns.myattr1,
          col2 int COMMENT "some comment2" ATTRIBUTE myns.myattr2,
          col3 int COMMENT "some comment3" ATTRIBUTE myns.myattr3
      )
      
  • Rule defintions now accept a "name" field. For backwards compatibility and convenience, the "name" is auto-generated if not specified.

Bug Fixes and Improvements

  • ODAS has updated Docker images that update many dependencies including the base OS, Python, OpenSSL and more.
  • Added a way to configure the structure of the data files the crawler will use while crawling. See the docs on creating a crawler for more.
  • Added crawler search box on the data registration page.
  • Added additional validation for the crawler name and path when creating a new crawler.
  • There is now an ability to re-run the autotagging rules on an individual datasets within the Datasets page, by using the Re-autotag button.
  • Fixed an issue where datasets with complex types that had a MAP embedded in a STRUCT embedded in ARRAY would not be handled correctly.
  • Added the ability to revoke grants on objects that no longer exist.

Incompatible changes

  • Previously by default users would only see reports for datasets they had ALL access to. Since many stewards may not have ALL access on the data, this has now been changed so they will see reports for all data they have SELECT access to. If necessary this can be configured back to ALL by editing the view definition of okera_system.steward_audit_logs dataset.
  • Starting from 2.0.0, Okera will only support EMR versions greater than 5.11.0 up to 5.28.0. Note, the versions of EMR less than 5.10.0 would still continue to work but it is recommended to upgrade to a recent EMR version for latest ODAS compatibility.
  • The behavior of using REVOKE on permissions (e.g. REVOKE SELECT) has been changed to not cascade by default. For example, in 1.5.x and earlier versions, REVOKE SELECT ON TABLE mytable would also revoke any
  • Starting in 2.0.0, the published Okera client libraries for PrestoDB support PrestoDB versions greater than 0.225 and above. You can use published Okera client libraries from prior Okera versions (which will continue to work against an ODAS 2.0.x and higher cluster) to support earlier PrestoDB versions.
  • The Permissions page has been removed - all links to it (e.g. in bookmarks) will no longer work.
  • Private tags on datasets have been removed. Datasets can not longer by filtered by private tags.

SQL Keywords

The following terms are now keywords, starting in 2.0.0:

  • EXECUTE
  • INHERIT
  • TRANSFORM

Deprecation Notice

  • Starting in 2.0.0, we are deprecating the ocadm and odb CLI utilities. If you desire to continue using odb, the binary from 2.0.x and prior releases should continue to work against. However, in future releases we will not ship new binaries of these utilities.

1.5.16

Bug Fixes and Improvements

  • Fixed an issue where writing to non-partitioned tables from Spark would fail if Spark bypass was enabled.
  • Improved error handling when doing unsupported operations on complex types.
  • Fixed an issue where running count(struct_field.some_value) would fail when run inside views.
  • Fixed an issue where using ORDER BY in an external view could fail an authorization check.
  • Fixed an issue where some decimals were not serialized properly when accessed via the /scan API.
  • Improved some error handling on the node-remover CronJob for Gravity-based clusters.
  • Fixed an issue where CTEs that contained aggregations would fail.
  • Added the ability to disable Zookeeper-based worker membership and instead leverage the Kubernetes metadata. This can be enabled by setting OKERA_KUBERNETES_MEMBERSHIP: true in the configuration file.

1.5.15

Bug Fixes and Improvements

  • Fixed several issues related to access control on tables and views with complex types.
  • Fixed an issue when registering JDBC tables with many columns.
  • Fixed an issue where small decimals would not be returned correctly when queried via the Presto endpoint.

Notable and Incompatible Changes

  • In PyOkera, scan_as_json now defaults strings_as_utf8 to True, matching the behavior prior to 1.5.2.

1.5.14

Bug Fixes and Improvements

  • Fixed an issue in PyOkera where scan_as_json and scan_as_pandas would ignore the tz option supplied on the context object.
  • Fixed an issue in the Presto client library where it did not properly handle null checks on STRUCT columns.

1.5.13

Bug Fixes and Improvements

  • Fixed an issue where queries on views that referenced STRUCT columns could fail when an ABAC permission applied to it.

1.5.12

Bug Fixes and Improvements

  • Fixed an issue where many concurrent CREATE TABLE or CREATE VIEW statements could be slowed down waiting on a shared resource.
  • Fixed an issue when authorizing queries on views with complex types.
  • Fixed an issue where the server was not properly clearing the effective user when different user utilize the same underlying planner connection (this typically only happens in PyOkera scripts that switch between different users, such as tests).
  • Added an option to use the SYSTEM_TOKEN as the shared HMAC secret for signing and validating tasks (in the Planner and Worker services) rather than using ZooKeeper. This option can be enabled by setting SYSTEM_TOKEN_HMAC: true in the configuration file.

1.5.11

Bug Fixes and Improvements

  • Fixed an issue where in certain EKS environments, the CPU scheduler was not properly saturating the CPU capacity.
  • Fixed an issue where scanning Parquet files would fail if their dictionary_offset was after the data _page_offset.
  • Added an improvement for SerDes that use field-delimiters, to allow specifying field-delimiters within double-quotes

1.5.10

Bug Fixes and Improvements

  • Fixed an issue in PyOkera where it would incorrectly decode negative decimal values with a precision higher than 18.
  • Fixed an issue when --allow_nl_in_csv=True was set and the CSV file used a different quote character than " - it would improperly use the " to escape line breaks.
  • Improve the Crawler's ability to automatically use the OpenCSV SerDe when necessary.
  • Fixed issues for handling complex types that had several nested arrays/structs/maps with null values interspersed.
  • Fixed an issue where reserved keywords were not possible to be used (as escaping them wouldn't work) as attribute namespaces and attribute keys (e.g. myns.true).
  • Add the ability to use CREATE TABLE LIKE TEXTFILE, which will automatically deduce the schema from the CSV file (this assumes the headers are the first line).

1.5.9

Bug Fixes and Improvements

  • Added the ability to edit dataset and column descriptions in the Okera UI.
  • Fixed an issue where datasets discovered by the crawler that had columns whose type definition exceeded 4,000 characters couldn't be registered.
  • Added more control options for LDAP group resolution configuration:
    • GROUP_RESOLVER_LDAP_POSIX_GID_FIELD_NAME
    • GROUP_RESOLVER_LDAP_POSIX_UID_FIELD_NAME
    • GROUP_RESOLVER_LDAP_MEMBEROF_FIELD_NAME
  • Fixed an issue where Avro datasets that had a union type with a single child (e.g. union(int)) would throw an error. These types of unions are now fully supported.
  • Fixed an issue where decimals that were stored as a byte_array in Parquet files were not read correctly.
  • Added a configuration option to control the maximum number of allowed Sentry and HMS connections:
    • CATALOG_HMS_MAX_THREADS
    • CATALOG_SENTRY_MAX_THREADS
  • Fixed an issue where changing the description of the view (or a column in it) via DDL was not supported.
  • Fixed an issue where columns that contained arrays or maps with embedded null values were not handled correctly in the Java-based clients.

1.5.8

Bug Fixes and Improvements

  • Improved ZooKeeper membership registration and cluster health check capabilities. The cluster can now identify more cases where a node gets incorrectly deregistered and self-heal.
  • Improved handling of non-parsable SQL statements when accessing a view that was created outside Okera (e.g. in Hive). This capability is enabled by an environment flag ALLOW_NONPARSEABLE_SQL_IN_VIEWS: true set in the configuration file for the cluster.

1.5.7

Bug Fixes and Improvements

  • Fixed an issue where Hive/Hue could not load the table listing for a database if it contained a view that Okera could not parse.
  • JWT tokens with a group claim can now have that claim be a simple string denoting the group rather than having it be an array.

1.5.6

Bug Fixes and Improvements

  • Improved performance of attribute access checks on wide views.
  • Fixed an issue where an attribute-based grant on a view with a complex type might not properly omit the complex type column.
  • Added support for CSVs with embedded newlines within records that are enclosed within the quote separator. To enable this, specify --allow_nl_in_csv=true for RS_ARGS in your ODAS configuration.

1.5.5

Bug Fixes and Improvements

  • Fixed an issue where joining or unioning a dataset with itself could cause an invalid query plan to be generated, preventing that query from being run.
  • Fixed an issue where a column-level grant on a view could allow joining on columns other than those granted.
  • Improved the detection in PyOkera of whether Pandas and NumPy are installed, and if not, still allow usage of all functionality that does not require them.
  • Fixed an issue where an external view in Hive which has both row_number() and an ORDER BY clause could cause the query to not succeed.
  • Fixed an issue where non-conformant Parquet files that have a mismatch between the number of records specified in the dictionary header vs. the actual batch would cause the file to not be queryable.
  • Added the ability to specify the CATALOG_DB_PASSWORD, LDAP_GROUP_RESOLVER_PASSWORD and LDAP_USER_QUERY_SERVICE_PASSWORD in a Kubernetes secret.
  • Added the ability to okctl to specify CATALOG_DB_PASSWORD, LDAP_GROUP_RESOLVER_PASSWORD and LDAP_USER_QUERY_SERVICE_PASSWORD as file paths in the configuration file.

1.5.4

Bug Fixes and Improvements

  • Fixed an issue for Parquet files where TIMESTAMP and TIMESTAMP_MILLIS columns that were backed by int64 were not supported.
  • Fixed an issue where an invalid plan could cause the worker to crash.
  • Added two new DDLs that allow changing the comment on a table and column:
    • ALTER TABLE <table> CHANGE COMMENT '<comment>'
    • ALTER TABLE <table> CHANGE COLUMN COMMENT <col> '<comment>'
  • Added APIs to get and set the description on a dataset and column:
    • GET/PUT /datasets/<name>/description
    • GET/PUT /datasets/<name>/columns/<column>/description
  • For PyOkera, execute_ddl now takes an optional requesting_user parameter, similar to the plan and scan_as_... functions.
  • Fixed an issue where a column-level grant on a view could allow filtering (but not viewing) on columns other than those granted when executing a query in Workspace.

1.5.3

Bug Fixes and Improvements

  • Fixed an issue where DECIMAL columns in Avro schemas would not get detected properly.
  • Added the ability to provide a default clamp value for DECIMAL columns whose precision exceeds the maximum precision allowed (38). This can be set using the AVRO_SCHEMA_TOO_HIGH_PRECISION_FALLBACK configuration value.
  • Added support for skip.footer.line.count table property.
  • Performance improvements in the case of many small files in a single partition (NOTE: it is still recommended to avoid having small files).
  • Fixed an issue where some sensitive values would be exposed in the Planner and Worker debug UIs.
  • Added the ability to enable setting X-Frame-Options: DENY for all requests by setting the FRAME_OPTIONS_DENY_ENABLED configuration value.
  • Added the ability to enable the Secure flag on the session cookie using the OKERA_SHARED_COOKIE_SECURE configuration value.
  • Improved default cipher support for TLS1.2.
  • Added the ability to control the duration of the generated JWT when logging in by setting JWT_TOKEN_EXPIRATION to the desired number of seconds (minimum is 300 seconds).

1.5.2

New Features

JDBC Data Sources

  • Added support for Sybase.
  • Added support for filter pushdown.
  • Added support for count(*) for JDBC data sources.
  • Added support for case sensitive column names.
  • Added support for specifying custom SSL CAs to use to validate when making SSL connections to the JDBC data source.

Audit Log Uploads

It is now possible to configure audit logs to be uploaded in an immutable fashion. When enabled, audit logs will be uploaded with a .staging.audit and .staging.reporting suffix until they are finalized, and will then be uploaded without the .staging portion when finalized.

To enable this, set WATCHER_AUDIT_LOG_STAGING_FILES to true or 1.

Additionally, it is possible to force the audit logs to be uploaded after a certain number of seconds have passed, by specifying WATCHER_AUDIT_LOG_MAX_UPLOAD_SEC.

PyOkera

  • PyOkera now has full support for complex types (ARRAY, MAP, STRUCT).
  • context.enable_token_auth now accepts an optional argument called token_func, which can reference a no-argument function that when called, returns a valid token to be used. Note that this function must be pickle-able (and an error will be returned if it isn't), as it will be used across multiprocessing calls.
  • PyOkera now supports running scan_as_json and scan_as_pandas using Presto.

Bug Fixes and Improvements

  • Added the ability to ignore LDAPS certificate errors when doing group resolution.
  • Added the ability to set Presto tuning variables, specifically:
    • PRESTO_JVM_MAX_HEAP
    • PRESTO_QUERY_MAX_MEMORY
    • PRESTO_QUERY_MAX_MEMORY_PER_NODE
    • PRESTO_QUERY_MAX_TOTAL_MEMORY_PER_NODE
    • PRESTO_JOIN_DISTRIBUTION_TYPE
  • Improved handling for Date type in JDBC data sources.
  • Improved handling of broadcast joins using cross-task caching.
  • Fixed an issue where JDBC data sources that had USING VIEW AS did not properly handle single quotes in the view.
  • Fixed an issue where JDBC data sources did not close the connection properly when no more events were necessary, causing poor performance.
  • ODAS Web UI will now automatically redirect to the https URL if a user navigates to the http one.
  • Added the ability to control how long the Web UI waits before timing out a request to the server (default is 30000, in milliseconds), by setting the UI_TIMEOUT_MS configuration.
  • ODAS Web UI will now break out the inner portions of ARRAY and MAP complex type columns.
  • Added the ability to configure ODAS to look for user-specified claims in the JWT to determine the user (JWT_USER_CLAIM_KEY, default is sub) and groups (JWT_GROUP_CLAIM_KEY, default is groups).
  • Added support for partitioning schemes on S3 that do not contain the partition column name in the folder, e.g. s3://company/dataset/2019 vs s3://company/dataset/year=2019. This can be enabled by setting okera.hms.allow-no-name-partitions to true in hive-site.xml.
  • Fixed an issue where array and map indexing in an external view definition would cause ODAS to fail to parse.
  • Added support to specify strings_as_utf8=True when using scan_as_json in PyOkera.
  • Fixed an issue in PyOkera when converting a CHAR column to UTF-8.
  • Upgraded several dependencies.

Notable and Incompatible Changes

  • The bundled Presto service now exposes an additional "catalog" (in Presto terms) called okera (in addition to the existing recordservice one). These are identical and contain the same datasets. In a future version, the recordservice catalog will be removed and is now deprecated. All clients should shift usage to the okera one.

  • Removed the default from deserializer column comment that would appear for Parquet and Avro files when created using CREATE TABLE LIKE FILE.

  • In PyOkera, when using scan_as_json, date columns are now serialized to millisecond precision without the corresponding timezone, to match output of other APIs.

  • The driver type of redshift is now required to connect to Redshift, and the postgresql type will no longer work. This was done as the drivers have deviated and they were updated for security and performance reasons.

1.5.1

New Features

SAML Support

It is now possible to configure authentication to ODAS with SAML providers.

JDBC Data Sources

  • Added support for MS SQL Server.
  • Added support for Redshift External Tables.

LDAP Authentication

It is now possible to configure LDAP authentication to do two-step authentication (DN lookup followed by authentication).

Bug Fixes and Improvements

  • Data Registration Crawler improvements:
    • Increased performance on large partitioned tables.
    • Improved filetype classification.
    • Avro schema comment fields (i.e. description) will now be inherited by ODAS when registered.
  • Azure improvements and fixes:
    • Add support for Azure MySQL connections where SSL is required.
    • Fix an issue where CREATE TABLE LIKE FILE was not properly loading Avro schema files from ADLS.
  • Fix a bug where ODAS was caching UDFs when a pattern was set in a call to SHOW FUNCTIONS.
  • Added the table property to control whether automatic partition recovery is enabled for a particular table: 'okera.auto_partition_recovery.disable'='true'.
  • Improved handling for when doing DROP DATABASE CASCADE on a database that does not exist.
  • ODAS will now respect the LOCATION field set on a database.
  • Kubernetes liveliness and readiness probes have been tuned to cause less load on the system.
  • Added tables in okera_system to expose role and group information.
  • Fixed an issue in the Hive SerDe to properly initialize the header skip flag.
  • Fixed an issue where the compiler was generating invalid CPU instructions for Decimal types due to bad memory alignment.
  • Respect the value of OKERA_WORKER_LOAD_BALANCER if it is passed in.
  • Disable an optimization when doing a join where the second table was larger than 128MB.
  • Fixed an issue in the Avro parser that did not allow for default values of empty arrays and maps.
  • Fixed an issue where partition names were not properly escaped in Hive.
  • Fixed an issue in the Kubernetes resource files for Presto to reference the correct version.
  • Improved system availability when registering a high number of partitions.

Notable and Incompatible Changes

  • Previously, changes to the CATALOG_ADMINS setting would not get fully reflected on a cluster that had previously configured these. In this release, users and groups referred to by CATALOG_ADMINS will be automatically granted admin_role on startup. If you have users that you no longer want to be admins, you should remove them from CATALOG_ADMINS.

1.5.0

New Features

Policy builder

A new interactive policy builder in the Okera Portal. Table access policies and fine-grained permissions can now be granted through the UI.

Attribute-based access control updates

Updated syntax and and other improvements to attribute-based access control (ABAC).

See the ABAC docs for more.

Other improvements

  • Azure: added experimental support for ADLS Gen2 - users can now CREATE EXTERNAL TABLE on data that is stored in Gen2 storage, and query that data.
  • Added IF EXISTS to DROP ROLE, so you can now do DROP ROLE IF EXISTS <role>.
  • Changed to how we deploy ZooKeeper on Kubernetes to better handle node failures.
  • Updated the underlying Thrift library to version 11 to stay more current. This should have no user-visible impact.
  • Improvements to ALTER TABLE <table> RECOVER PARTITIONS to improve its runtime. There is more work planned for future releases.
  • Added a new table property that allows CREATE TABLE <name> LIKE <FILETYPE> to handle cases where a partition column and data column exist with the same name.
  • Improved handling of automatic file type detection in crawlers for Avro and JSON files.
  • The mask() UDF is now always available.
  • Permission model now supports CREATE_AS_OWNER, which lets users create objects in the catalog and be given owner (i.e. ALL) privileges on the new object. This can be used to create per user (staging) tables or to support distributed stewardship.
  • Fixed a bug where it was not possible to override the database named used for the CATALOG_DB_OKERA_DB database.
  • Fixed a bug where you could create grants that were invalid and would fail downstream - we now fail them at the point of creation.
  • Added a num_results_read column to okera_system.audit_logs, denoting the number of records read during a particular operation.
  • Support for special characters in column names. Okera now expands the special characters supported in column names on par with ANSI-SQL specification. Characters that are still not supported in the column name are ., `, : and !. The special characters in a column name can be escaped by backticks. For example, if the column name is Special Chars (name) then the column name can be specified as, CREATE TABLE special_chars.sample `Special Chars (name)` STRING
  • The cerebro-web Kubernetes service was removed. All functionality is now consolidated into the cdas-rest-server service. Note: on using the Deployment Manager to upgrade from previous versions to 1.5.0, the cerebro-web service will continue to exist after the upgrade. The service is vestigial, however, and should not be used. If there is need to remove this service entirely, please open a support ticket.
  • Improved robustness of service discovery in several places.
  • Added CEREBRO_EXTERNAL_PLANNER_HOST and CEREBRO_EXTERNAL_PLANNER_PORT, which can be set to override the planner's external host/port shown in the UI.

Incompatible changes

  • Any external tooling checking for the existence of the cerebro-web service will no longer function. These tools should be updated to point at the cdas-rest-server service, which now encompasses the functionality.
  • Removed okera_system.weekly_audit_logs and okera_system.monthly_audit_logs views, since the UI preview was not functioning properly for them.
  • OKERA_PORT_CONFIGURATION, set in env.sh for Deployment Manager installs, no longer recognizes the cerebro_web:webui port. Please change this value to cdas_rest_server:webui for new clusters.

1.4.1

New Features

Improved cluster deployment

Okera clusters can now be created without using the Deployment Manager.

Support for granting column access to views

In previous Okera versions, it was not possible to grant column-level access on views, only tables. It is now possible to grant on columns in views as well.

See the docs for more.

LDAP group resolution

Okera can now issue an ldapsearch to retrieve the groups associated with the username contained in a JWT if no groups are embedded in the JWT.

See the docs

Other Improvements

  • Added a new way to set up automatic multi-tenant authentication for EMR and CDH integrations.
  • Added an ability to create one-node quickstart clusters that have out-of-the-box configuration including SSL, JWT, user/group settings.
  • Improved automatic service discovery for inter-service communication, allowing us to increase resiliency in the case of node failures.
  • Improved handling of unsupported or invalid views, typically inherited from an existing metastore. The view metadata can now be returned (but they are still un-queryable).
  • Okera now supports hms escaped partition paths. Additional characters that were not escaped previously can now be used in the partition path. For example, spaces and hyphens: timestamp-partition/time_val=2019-06-11 00:00:00. Note, partition paths with '=' or '/' are not yet supported.
  • Full support for complex map types in parquet data.
  • Added support for complex types of map<string, array<string>>.
  • Added a new builtin function, current_date, which is like current_timestamp but just returns the date portion.
  • Enabled selecting current_date and current_timestamp as columns, e.g. select current_timestamp vs select current_timestamp().
  • Upgraded kube-prometheus to 0.1.0 (latest at time of publishing).
  • Added support for timestamps outside of typical data ranges. While we don't expect a lot of user data from the dark ages, sentinel values in those ranges as well as year 0 are valid. They will be passed through without transformation so that the data values can be read.
  • Added better support for Hue when some fields are null.
  • REPORTING_TIME_RANGE can now be set directly in env.sh.
  • Reduced number of retries and yield time for HDFS connection attempts.
  • Okera now escapes partition columns to support keywords as partition column names.
  • Fixed a bug where data registration crawlers were treating hidden files as possible dataset files.
  • Fixed a bad error message in the UI when a database was not found on the permissions page. The error is clearer now.
  • Removed CORS headers from REST Server. Fixed a security bug where the REST server was returning a wildcard hostname in its CORS headers. This has been fixed by removing CORS headers entirely.
  • Fixed a bug where if a view had any constant-time expressions such as decode we would not do any access checks.
  • Fixed a bug where in some cases, select count(*) did not work if a user only had column-level access.
  • Fixed to skip check for table format on views.
  • Fixed storage descriptor path for Databricks based on provider for spark.
  • Fixed column access check for count(*) on views.
  • Fixed an issue with spark and presto clients where select * queries returned incorrect results for users with partial access to views.
  • If defaultdb property is not provided, JDBC connections will now use jdbc.db.name as default db for connecting.

1.4.0

New features

Tags

Tags can now be assigned to datasets or columns to mark the type of data they contain. For example, a ‘Sensitive’ tag can be created and assigned to any columns containing sensitive data. The Datasets page can be filtered by these tags to view only datasets or columns with certain attributes. Complex-type columns can be tagged, but not nested elements within a complex type.

Tags may only be created and assigned by users in admin roles and will be visible to all users. Admin users may also give other roles the ability to assign tags in the Workspace page.

Any user may still create Private Tags for their own use.

  • See the docs for more details.

Auto-tagger

In order to reduce the manual work of tagging, an Auto-Tagger can be configured to detect when a column is likely to contain a certain type of formatted data, such as a Phone Number or Social Security Number and will apply the relevant tag to that column. This occurs when a new dataset is discovered on the Data Registration page.

Attribute-Based Access Grants (ABAC)

Admin users can now grant access to tables based on tags. For example, an admin may grant users access to all data tagged as ‘Sales’ inside a particular database. This allows access grants to be based on data attributes instead of only on technical metadata (e.g database name or dataset name). Please note that ABAC grants are currently only fully supported on tables and not views. ABAC grants on views will only be enforced when tags are on the view level, but not on the column level. For ABAC grants on tables, both table level and column level grants are fully supported. Full support for views is coming soon. All existing RBAC grants remain unaffected and you can still create RBAC grants. ABAC and RBAC grants are additive, which means if either grant give the user access, the user will be able to see that table.

JDBC Support

Added a JDBC endpoint and native Presto support. A new cluster type, STANDALONE_JDBC_CLUSTER, is now available. Specifying STANDALONE_JDBC_CLUSTER will bring up a cluster that includes Presto and exposes a JDBC endpoint for use with Tableau and other JDBC-enabled analytics clients.

JSON file format

  • JSON file formats are now supported by ODAS.
  • All data types similar to avro and parquet are supported with the exception of maps. Maps can already be represented as valid json structure.
  • JSON tables can be created via auto-inference or stored-as-json syntax
  • See the docs for more details.
  • JSON files are now supported in the data registration wizard.

DATE type

  • DATE type is now supported.
  • See the docs for more details.

AWS CloudTrail Integration

  • Okera can consume AWS CloudTrail API event logs to more accurately determine when it is appropriate to perform maintenance operations. For example, the automatic discovery of new datasets and dataset partitions can occur faster and more efficiently when Okera receives direct notifications from AWS regarding S3 write operations. Without CloudTrail event consumption, Okera will fall back onto a polling model for detection of dataset changes. Refer to the Quick Start Guide: AWS CloudTrail Integration document for details.

Performance Improvements

  • The improvements includes specific optimizations for partitions metadata handling to improve performance on scanning data with partition filters.
  • Introduced new compression method (zstd) for efficient transfer between ODAS cluster and clients like spark and hive. The default compression is now zstd.
  • Introducing Okera SQL Extensions to our spark client.
  • This is an extension capability provided by spark using which we can augment the spark plan to pass additional information to ODAS.
  • This is primarily for two optimizations at this point,
    • To push down functions that are supported by ODAS like CAST/UPPER/LOWER/UNIX_TIMESTAMP
    • Implemented metadata only optimization for queries that have aggregation on just partition columns This is inspired from spark's own version of such optimization as here

Other Improvements

  • AWS Athena can be registered and used as a JDBC data source. See docs
  • New CREATE_AS_OWNER privilege that grants ability to create a database and automatically receive ALL privileges on that database. Note: CREATE_AS_OWNER does not cascade to all tables. You will not be able to create tables inside databases you have not created with this privilege.
  • Cluster name may be customized and will display in the navigation bar.
  • Crawlers may now be deleted on the Data Registration page.
  • Crawlers can now discover JSON data types on the Data Registration page.
  • The Permission page now displays the full list of permissions for the column, dataset, database, and server scopes affecting a given database. For example, if there is a group that only has access to the selected database, then that group will appear in the full list.
  • The Permission page indicates any Attribute Based Access Control expressions granting a group's level of access.
  • Improved error messaging throughout the Okera Web UI, specifically in the Workspace page and Dataset Preview.
  • Decimal types in i32 and i64 storage formats are supported with latest versions of Parquet, instead of just fixed_length_byte_array. Starting from 1.4.0 version, ODAS supports handling these additional i32 and i64 decimal storage formats along with fixed_length_byte_array.
  • ODAS shares existing HMSs which contain ORC files created by hive. However, the metadata load will fail for such cases. With 1.4.0 version of ODAS, we support ORC file format for metadata load. Note scans will still fail for ORC files with 'ORC files are not currently supported.' error.
  • Extended support for MAP complex types for PARQUET file formats. It is now possible to use MAP<STRING, STRUCT> and MAP<STRING, ARRAY>. Note, this is still not available for AVRO types.

Incompatible changes

  • The default resolution of parquet schema has been changed to be by name. To be explicit, this flag is now by default set to true, default_parquet_resolve_by_name=true. Prior to 1.4 the default was by ordinal (position).
  • The way access is controlled for the Workspace and Reports features in the UI has changed. Current users may need their access updated as a result:
  • Where before a user needed ALL or SELECT access on any dataset in the Okera catalog to access Workspace, that user now needs SELECT access on okera_system.ui_workspace. See docs for more info.
  • Where before a user needed ALL or SELECT access on okera_system.reporting_audit_logs to access Reports, that user now needs SELECT access on okera_system.ui_reports. See docs for more info.

Known Issues

  • The following Okera configurations cannot be set directly in env.sh and must instead be listed in the SERVICE_ENVIRONMENT_CONFIGS environment variable in env.sh:
  • OKERA_CLOUDTRAIL_SERVICE_CONFIGURATION
  • OKERA_REPORTING_TIME_RANGE

Example env.sh for this case:

export SERVICE_ENVIRONMENT_CONFIGS="$SERVICE_ENVIRONMENT_CONFIGS;OKERA_CLOUDTRAIL_SERVICE_CONFIGURATION=s3://my_bucket/my_cloudtrail.conf;OKERA_REPORTING_TIME_RANGE=5days,3weeks"
  • ABAC grants on views will only be enforced when tags are at the view level, but not at the column level i.e If you assign some tags on columns on a view and then create a grant on that view for only those columns, it will not be enforced. If however the tag is at the view level, the grant will be enforced. For ABAC grants on tables, both table level and column level grants are fully supported.

1.3.4 (Mar 2019)

This release contains the following changes

  • Enhancement to ALTER TABLE statement to allow partition location change.
  • Support for scanning alternate partition location outside the table base path.
  • Adjust health-check frequency to accommodate longer cluster start times.
  • Optimized concurrent loading of metadata in workers, to prevent overloading the catalog with calls.
  • Reduce noise from logs from in-memory cache management and repeated log entries from custom UDF log errors.
  • Speeds up UI preview for large tables with many partitions to avoid time outs. Preview will show results from the last partition.
  • Control docker log size in containers with log size restrictions and log rotation policy.
  • Handle presence of unsupported complex type fields in text format data, gracefully.
  • Fixed a memory leak that occurs in the REST container when a query invoked via Workspace times out.
  • Fixed an env variable that controls the number of PyOkera worker processes in the REST container.
  • Increased the number of Gunicorn worker processes in the REST container from 4 to 8.
  • Support for EMR5.20. ODAS now handles backward compatibility breaking changes in Presto SPI.

Known issues

  • Hive does not support scanning partitions where the partition name and the physical location do not match. eg. we do not support scanning via hive if the partition is year=2010,month=2,date=29 and the partition location is s3://foo/year=2012/month=4/date=21/, or s3://a/b/ ,
  • Hive does not support scanning partitions where the partition is outside the table base directory. eg., if table base dir is s3://foo/loc1/ and the partittion is at s3://foo/loc2. For the above cases, you may use spark, databricks, pyokera or the workspace

1.3.1 (Feb 2019)

  • This release contains a hot fix for the large partitions optimization introduced in 1.3.0 release. Due to this issue, the filters on partition columns in certain cases result in full table scan and can result in incorrect results.

1.3.0 (January 2019)

1.3.0 is the next major Okera release with significant new functionality and improvements throughout the platform.

Major Features

Dataset Registration

Datasets can now be registered in bulk through the Okera Web Portal. Choose an S3 path to crawl, and ODAS will inspect all files in that S3 path, finding possible datasets. Those datasets can be verified, modified, and registered one at a time or in bulk as needed. See docs for details.

Monitoring

ODAS clusters leverage grafana for monitoring and has been updated substantially in this release. The metrics are now backed by Prometheus and the out of box monitoring dashboards have been improved.

Support for Parquet formats

Parquet formats are now fully supported with exception to complex types on Map. Full details, see docs

Other improvements

  • AWS EMR support extended and can be seen on https://docs.okera.com/support-versions.

  • PyOkera is now supported on python 3.6 and 3.7.
    We recommend all clients update to the 1.3.0 PyOkera version.

  • Added support for DESCRIBE DATABASE <db_name>. See describe database

  • Partitioned column information in DESCRIBE FORMATTED <view_name> command output. See describe statement

  • Base tables referenced in the view in DESCRIBE FORMATTED <view_name> command output. See describe statement

  • Global UDF support.
    User defined functions can now be created and shared across database. Access them without the need to qualify everytime. More details here.

  • Planner UDF caching.
    Hadoop clients (for example Hive and Spark running on EMR) will load the UDFs on startup. For large catalogs (specifically number of databases), this can impact client startup time. In this release, the registered UDFs are cached on the planner, with a default time to live (TTL) of 30 seconds. This should significantly speedup client start up time in these cases.

  • Added support for SHOW GRANT GROUP and SHOW GRANT USER.
    These provide convinient ways to list all the grants for a group or user, in addition to SHOW GRANT ROLE.

  • It is now possible to create a table against a fully qualified path.
    Previously, tables (and partitions) had to be created over a directory. It is now possible to create a table over a single file, simply using the full path as the LOCATION. For more details, see supported sql.

  • ODAS clusters will now default to starting up with multiple planners.
    For clusters (larger than 1 node), the default number of planners will be greater than 1. This offers better availability and load balancing. This value can still be controlled as before, by specifying the --numPlanners option when creating the cluster.

  • ocadm now supports restarting a single service in a running cluster.
    Previously, it was only possible to restart the entire cluster (all services). See ocadm clusters restart help for details.

  • Grafana and monitoring improvements.
    ODAS clusters leverage grafana for monitoring and has been updated substantially in this release. The metrics are now backed by Prometheus and the out of box monitoring dashboards have been improved.

  • Idle clients now timeout after 120 seconds.
    A client is considered idle if it has no active requests for more than the configured time. Idle clients will now timeout after 120 seconds and the queries associated with that client will be cancelled. Note that requests that take a long time and keep the server busy are not considered idle. This config can be controlled by the idle_session_timeout service config.

  • (beta) ODAS workers cache bytes from storage
    ODAS workers now support a variant of LRU caching which automatically caches bytes from the storage system. This is only supported for the file system datasources: S3 and HDFS. This cache is enabled by default but defaults to a small size (1GB per worker). The cache size can be configured via the worker config io_cache_size which controls the size in bytes. Setting it to a value <= 0 disables the cache.

  • Performance enhancements for heavily partitioned tables.
    This release has some significant performance improvements for operations on partitioned tables. The automatic partition recovery that scans for new folders added on s3, is optimized to run faster than before. Similarly operations like ALTER TABLE ADD PARTITION, ALTER TABLE RECOVER PARTITIONS are optimized by effectively scanning changes on s3 buckets and also using more parallel ways to manage HMS partitions. On the scan side, if the number of partitions are greater that 200, we now load the partition metadata in the workers instead of the planner. Loading metadata for large partitions in the planner was resulting in timeouts of queries. The planner would now roundrobin the partitions (based on number of partitions) to the workers and the workers would now load the partitions metadata for the partitions that it has to fetch the records. This ensures queries do not timeout at planning phase and the overall time to execute queries for the partitioned table is faster.

  • JDBC queries can now run in parallel.
    JDBC queries are now run in parallel provided there is a suitable numeric type field specified in the mapred.jdbc.scan.key as tblproperty in the catalog table. See scan records in parallel

Incompatible changes

  • Package path for Java client has been renamed to com.okera.*.
    This should not impact typical use cases as backward compatible classes have been added. For example, there exists two copies of RecordServiceHiveInputFormat in the old and new namespace. Clients that were developed against the Java client library will need to be updated.

  • The default LDAP port changed from 389 to 636.
    Previously, if unspecified, ODAS defaults to SSL enabled connecting to the server on port 389 for LDAP. This configuration is atypical, as 389 is the default non-SSL enabled port. This release changes the default to use the standard port (636) by default as SSL is enabled by default. For users that are explicitly specifying this configuration (LDAP_PORT), this will have no impact.

1.2.3 (December 2018)

1.2.3 is a point release with some fixes to critical issues.

Bug Fixes

  • Fix web UI's 'Preview Dataset' by making scans with record limits much faster for partitioned datasets, significantly reducing the likelihood of timeouts. In the event of a timeout, a more accurate error message is now shown.

  • Significantly improve the performance of the web UI's Dataset List when the total number of datasets is large (1000+).

  • The machines in the ODAS cluster will now install Java 8 if Java is not already installed. ODAS has always required Java 8 but some newer linux distros have updated the default java version to java 11, which is not compatible. This version is now properly pinned to Java 8.

  • ODAS clusters will by default start up with multiple planners. This previously could be optionally specified when creating the cluster but defaulted to a single planner. As part of this change, a client now has sticky sessions meaning clients will be pinned to a planner for some duration, allowing APIs such as scan_paged to work correctly.

  • Fixed issue with idle session expiry. Previously some idle sessions were not tracked correctly and did not expire as promptly as expected.

  • Fixed client side issue scanning some complex schemas with a particular combination of nested structs.

1.2.2 (October 2018)

1.2.2 is a point release with some fixes to critical issues.

Bug Fixes

  • Fixed 'ocadm agent start-minion' to aid in manual cluster repair

  • Properly return an error message for queries that contain a LEFT ANTI JOIN

  • Idle sessions now timeout by default with a timeout of 120 seconds. A session is considered idle if the client did not make any request in that time window. This config can be controlled via the planner or workers idle_session_timeout config.

  • A fix to optimize processing of datasets with large numbers of partitions.

  • Fix web UI's 'Preview Dataset' so that it relies on LAST PARTITION

1.2.1 (September 2018)

1.2.1 is a point release with some fixes to critical issues.

Bug fixes

  • Fixed a critical issue with scanning nested collections.

  • Added support for AVRO schema files specified using an HTTPS URI.

  • Fixed some error handling in PyOkera.

  • Increased the default connection limit to 512.

1.2.0 (September 2018)

1.2.0 is the next major Okera release with significant new functionality and improvements throughout the platform.

Major Features

Data usage and reporting

Using the Okera Portal, users can now understand how the datasets in the system are being used. This can be useful for system administrators and data owners to understand which datasets are being used most often, by whom and with which applications. The reporting insights are built on the audits and automatically capture system activity. For more details, see here.

Support for data sources using JDBC

Okera Data Access Platform (ODAS) now supports data sources connected via JDBC, typically relational data bases. These datasets can now be registered in the Okera Catalog and then read and managed as any other Okera dataset. For more details on how to register and configure these sources, see here.

Improved access level granularity

ODAS now supports richer access levels, in addition to SELECT and ALL. It is now possible to, for example, grant users only the ability to find and look at metadata, only to alter dataset properties. We've also added the concept of public role, which can simplify permission management. For details and best practices, see here.

Access Control Builtins

ODAS now supports a family of access control builtins. These are intended to be used in view definitions and can dramatically simplify implementing fine grained access control. See this document for more details.

Improvements to LAST SQL clause

ODAS supports the LAST PARTITION clause to facilitate sampling large datasets. In this release, this support was extended to support LAST N PARTITIONS and LAST N FILES. In addition, it is now possible to set this as metadata on the catalog object, to prevent queries trying to read all partitions. See here for best practices.

Improvements to Workspace

Workspace can now run multiple queries at once and supports monospace format outputs and datetime queries.

PyOkera

  • PyCerebro has been renamed PyOkera. The API is effectively unchanged except that now instead of importing 'from cerebro' you will need to import 'from okera'.

  • Parallel Execution of Tasks. PyOkera will now schedule and execute worker tasks in parallel to minimize network latency. The scan_as_pandas() and scan_as_json() API calls will by default spawn worker processes to concurrently execute tasks where possible. The default number of local worker processes is defined as 2 times the number of CPU Cores on the machine on which it is being processed. This has demonstrated a reduction in run-time duration for queries by minimizing the network latency involved with establishing network connections with the Okera Worker Nodes.

Performance

  • Improved planner task generation. One of the responsibilities of the planner is to break up the files that need to be read into tasks. In this release, we've implemented a new cost-based algorithm which should result in tasks that are more even. This should lead to less execution skew across tasks and overall reduction in job completion times.

  • Improved planning time for queries on tables with large number of partitions. The planner now loads metadata more lazily, deferring as much as possible to after partitioning pruning. This can result in significantly better latency for queries that scan only a few partitions of a heavily partitioned table.

  • Improved expression handling in the planner. The planner will fold constant expressions trees and reorder expressions. Queries that push complex expressions to ODAS should see improvements.

  • Dramatically improved planner and worker RPC handling. Server side RPC handling is much more robust to slow clients or if there are transient slowdowns in dependent authentication services.

  • Worker fetch rpc now does keep alive for clients, eliminating the need to set high client RPC timeouts. Users previously worked around this by setting a very high value for recordservice.worker.rpc.timeoutMs.

  • Support for caching authenticated JWT tokens and increasing the timeout.

Other improvements

  • Added support for ALTER DATABASE <db_name> SET DBPROPERTIES

  • Added support for 'ALTER TABLE RENAME TO ' in hive.

  • EMR support included through EMR 5.16.0

  • Support for MySQL 5.7 as the backing database for the Okera catalog.

  • Support Avro bytes data type. This is treated by ODAS as STRING

  • EXPLAIN is now supported as a DDL command and can be run via odb or from the the web portal.

  • Support for the UNION ALL operator. Note that UNION DISTINCT is not supported.

  • Updated Kubernetes to 1.10.4 and the Kubernetes dashboard to 1.8.3.

  • Users can now user the keyword CATALOG as an alias to SERVER to grant access to the entire catalog. For example, GRANT SHOW ON CATALOG TO ROLE common_role would enable metadata browsing of the entire catalog.

  • (Beta) DeploymentManager now supports deploying ODAS clusters on externally managed Kubernetes clusters. In this case, the DeploymentManager just deploys ODAS services without managing machines or the Kubernetes.

  • The Okera portal can now store token credentials using cookies, allowing users to share credentials across web applications.

  • Added support for naming the sentryDB when creating a cluster. The name can be specified with the "--sentryDbName" flag when using ocadm.

  • Partitioned tables can now have their partitions automatically recovered.

  • Diagnostic Bundler now captures network route and firewall details

  • Support for IF NOT EXISTS in CREATE ROLE and SHOW ROLES LIKE SQL statements.

Changes in behavior

  • View usage for Python now uses the pyokera client instead of the REST API.

  • Okera views now inherit stats from the base tables and views the view is created from. These can be overwritten using ALTER TABLE but will provide better behavior for the vast majority of use cases.

  • Okera portal no longer estimates the total number of datasets as this can cause performance issues with very large catalogs.

  • Workspace will no longer render more than 500 rows per query.

  • Workspace terminal will drop older queries if the terminal output exceeds more than 750 rows of queries in total. This will improve workspace rendering speeds.

  • Post_scan request now utilizes the pyokera client.

Bug fixes

  • Improved error handling for invalid data metadata, for example, an Avro schema path that is no longer valid

  • When using field name schema resolution for parquet files, the field comparison is now case insensitive. This matches the behavior from the Apache Parquet Java implementation (parquet-mr).

  • The double datatypes now returns as many digits as possible when scanned via the REST API. Previously, this would round or return scientific notation for some range of values.

  • Allow revoking URIs to non-existent (typically deleted) S3 buckets. This would previously error out.

  • Fix issues with creating SELECT * views in some cases. Previously, in same cases, this could fail if layers of views are used.

  • Table property, 'skip.header.line.count', is now properly respected.

Incompatible and Breaking Changes

  • Deprecated the CEREBRO_INSTALL_DIR and DEPLOYMENT_MANAGER_INSTALL_DIR environment variables. OKERA_INSTALL_DIR should be used.

Known issues

  • Partitioned tables where the user only has SHOW access will mistakenly show the user as having access to the partitions columns in the UI. The UI properly shows that the other columns are inaccessible. -* While multiple ODAS clusters can be configured to share their HMS and Sentry databases, datasets created by a 1.2.0 ODAS cluster cannot be read by ODAS clusters running earlier versions (1.1.x or earlier).

1.1.0 (june 2018)

The 1.1.0 introduces two major items but no significant alterations to existing features or functionality. It includes all of the fixes from 1.0.1.

Support for Array and Map collection data types

This completes the complex types support started in 0.9.0, when the struct type was introduced. This release adds support for Arrays and Maps. As with struct support, only data stored in Avro or Parquet is supported. See the docs for more details.

Migration from company rename

In 1.0.0, we renamed the product but maintained backwards compatibility. For example, system paths that contained the product or company name were not changed. In 1.1.0, we have completely the product renaming and users upgrading will need to migrate.

Incompatible and Breaking Changes

  • AWS AutoScalingGroup launch scripts (the parameter passed to the --clusterLaunchScript flag when creating an ODAS environment) should accept the --numHosts parameter. This is a breaking change from 1.0.0 when the flag was introduced as --hosts.

  • Okera Portal (UI) Workspace page no longer accepts queries in the URL. Bookmarks and links from previous versions that included a query will still go to the page, but query arguments will not populate the input box.

  • Pyokera (fka Pycerebro) no longer returns a byte array to represent strings from calls to scan_as_json. Instead, it returns a UTF-8 encoded Python string, which is automatically serializable to JSON.

Known issues

  • Support for AWS AutoScalingGroups (ASGs) is in beta and not recommended for production use. There may be issues scaling an ASG cluster down, depending on which EC2 VM is terminated.

1.0.1 (may 2018)

1.0.1 is a patch release that fixes some critical issues in 1.0.0. We recommend all 1.0.0 users switch to this version for both the server and java client libraries. The java client library for this release is also 1.0.1.

Fixes

  • Fix issue health checking planner and workers with kerberos enabled. The health checks were failing continuously causing the cluster stability issues.

  • Fix to diagnostic bundler when some of the collected log files are either very big or corrupt. In some cases, log files on the host OS in /var/log could be corrupt or very large. This used to cause the bundle to fail and now they are skipped with the issue logged.

  • Fix issue when dropping favorited datasets in the UI. In some cases, looking at favorited datasets that don't exist anymore or are no longer accessible were displayed incorrectly.

  • Enable small row-group optimizable when reading parquet. Parquet files with small row groups had very poor performance in some cases, particularly if the table had many columns. The implementation for how these kind of files are read has been changed to handle this case better.

  • Fix some queries when scanning partitioned tables from hive with filters. Some queries using filters on partitioned tables would result in the client library generated an invalid planner request. This issue was specific to some very particular hive queries but has been resolved in this version.

1.0.0 (may 2018)

The 1.0.0 release is a major version, introducing significant new functionality and improvements across the platform.

Name Change Notice

With the 1.0.0 generally available (GA) release, our company and product name changes as well. Cerebro Data, Inc is now Okera, Inc, and the Cerebro Data Access Platform is now the Okera Active Data Access Platform. Component names have mostly been updated, and documentation should reflect the current state of all component names. In this release, we have maintained backwards compatibility and existing automation will continue to work. The binary paths continue to use the Cerebro name.

Deprecations Notice

  • Environment variables that begin with CEREBRO_ will continue to be supported only until version 1.2.0 is released. For standard environment variables established during installation, CEREBRO will alias OKERA.

Upgrading from prior versions

It is not possible to upgrade an existing ODAS cluster to 1.0.0 (i.e. ocadm clusters upgrade will not work). Instead, a new ODAS cluster must be created. Note that this is only the ODAS cluster itself, the catalog metadata from older clusters can be read with no issues.

Diagnosability and Monitoring

  • Support for collecting logs across all services and machines for diagnostics. This bundle can be sent to Okera or used by users to improve the troubleshooting experience. Cluster administrators can use the CLI to generate this support bundle across cluster services with one command.

  • Support for Kubernetes dashboard and Grafana for UI based administration and monitoring. ODAS now installs and enables the Kubernetes, which helps cluster admins manage a ODAS cluster (such as restarting a service, inspecting configs, etc) and Grafana to look at metrics across the cluster (cpu usage, memory usage, etc). These can now be optionally enabled at cluster creation time.

Robustness and Stability

  • Resolved issue where occasionally multiple workers can be assigned to the same VM, causing load skew and cluster membership issues.

  • Improved healthchecking to be able to detect service availability, triggering repair of individual containers more reliably. The healthcheck now more closely emulates what a client would do. If the healthcheck detects an issue, the individual container is restarted.

  • Beta support for Amazon Web Services Auto Scaling Groups (ASG). Users can now provide an ASG launch script instead of the instance launch script and the ODAS cluster will be built using the ASG. This should improve cluster launch times at scale, make it easier to manage the VMs (a single ODAS cluster is one ASG) as well as handle node failures. When the ASG relaunches a machine for the failed nodes, the Deployment Manager will automatically detect this, remove the failed nodes and add the new ones.

Docs

In this release, the product documentation has been significantly improved. Search is now supported throughout the documentation. Content has been added to cover FAQs, product overview and many other topics.

UI improvements

  • Workspace is now enabled for admin users. The workspace provides an easy interface for data producers and data stewards to issue DDL and metadata queries against the system, without the need to use another tool. Workspace provides basic query capabilities but is not intended to be used for analytics. Features include: scan queries, DDL queries, and query history.

  • Permissions page has been added to help users understand how permissions have been configured. For data stewards, it provides the ability to look up which users have been granted access and how. For data consumers, it allows him or her to lookup the information to acquire access to more datasets.

  • Users can now tag and favorite datasets to be able to easily work with them again in the future. These tags and favorites are currently unique per user.

  • Performance and scale improvements. The UI scales much better with larger catalogs.

Performance and scalability

  • Significant performance improvements for metadata operations on large catalogs. Numerous improvements were made to the responsiveness of metadata operations (DDL) against large catalogs and tables with large number of partitions. This includes operations such as show partitions and show tables as well as the planning phase at the start of each scan request.

  • Improvements to task generation in the planner. The planner is responsible for generating tasks run by the workers. The tasks generated by the planner should now be more even across more cases - for example, skew in partition sizes, data file sizes, etc.

  • Improvements to worker scalability with high concurrency. The workers are now more efficient under high concurrency (200+ clients per worker).

  • Better batch size handling in workers. Workers produce results in batches which are returned to the client. Improvements were made to manage the batch size and memory usage better across a wider range of schemas.

  • The default replication factor of some of the services (e.g. odas-rest-server) has been increased to improve fault tolerance and scalability.

EMR integration improvements

  • Spark: SparkSQL can be run directly against tables in the catalog with no need to create a temporary view. While this worked previous, the integration is improved and now performs as well as the temporary view usage pattern. For example, SparkSQL users can just run spark.sql('SELECT * FROM okera_sample.users').

  • Hive: Improvements to handling of partitioned tables to improve planning performance. Queries that scan a few partitions in a dataset with many partitions should see significant improvements.

  • Deprecating single-tenant install support. While it is still supported in this release, the single-tenant EMR integration is being deprecated. Instead it is recommended to use the multi-tenant install but only bootstrap a single (typically hadoop) user. Improvements were made to the user bootstrapping and setup scripts. See the [emr docs][emr-integration].

  • EMR support extended to include 5.1 and 5.2.

Miscellaneous

  • Added support for BINARY and REAL data types. See the data types docs.

  • Support for registering bucketed tables and views using lateral views. Note that since ODAS does not support these, the user must have all access on these tables and views and direct access to these files in the file system.

  • Extended ODAS SQL to add DROP_TABLE_OR_VIEW. See supported-sql for more details.

  • Audit logs now include the Okera client version as part of the application field. For example, instead of presto, new clients will identity as presto (1.0.0). Note that this is the version of the Okera client, not the version of presto.

Incompatible and Breaking Changes

REST server default timeout increased to 60 seconds from 30

This is mostly to accommodate long DDL commands, such as alter table recover partitions.

New log filename format

The log files that are uploaded to S3 have a new naming scheme. The old naming scheme was <service>-<pid>-<id>-<timestamp>-<guid>.log. The new naming scheme is: <timestamp>-<service>-<ip>-<id>-<guid>.log.

This makes it more efficient to find logs from a given time window.

ODAS REST Service returns decimal data as string, instead of double

Decimal data is converted to string, when data is accessed via the REST APIs. The json result set now returns the decimal values as string to prevent any precision loss. Client can now control the rounding behavior in the application and cast the type as needed.

Deployment Manager now requires java 8

Known issues

Using AWS autoscaling groups

If an ODAS cluster is created using a clusterLaunchScript, it will instantiate an autoscaling group of the specified size in AWS. Scaling an ODAS cluster that is running on an ASG is not supported in 1.0.0. Specifically, scaling down is known to have issues. This will be remedied in the next release.

Earlier Releases

Release notes for 0.9.0 (april 2018) or earlier.