Release Notes

1.2.3 (November 2018 - In Development)

1.2.3 is a point release with some fixes to critical issues.

Bug Fixes

  • Fix web UI’s ‘Preview Dataset’ by removing its reliance on LAST PARTITION, which was not working for views.

  • Fix web UI’s ‘Preview Dataset’ by making scans with record limits much faster for partitioned datasets, significantly reducing the likelihood of timeouts. In the event of a timeout, a more accurate error message is now shown.

  • Significantly improve the performance of the web UI’s Dataset List when the total number of datasets is large (1000+).

  • The machines in the ODAS cluster will now install Java 8 if Java is not already installed. ODAS has always required Java 8 but some newer linux distros have updated the default java version to java 11, which is not compatible. This version is now properly pinned to Java 8.

  • ODAS clusters will by default start up with multiple planners. This previously could be optionally specified when creating the cluster but defaulted to a single planner. As part of this change, a client now has sticky sessions meaning clients will be pinned to a planner for some duration, allowing APIs such as scan_paged to work correctly.

  • Fixed issue with idle session expiry. Previously some idle sessions were not tracked correctly and did not expire as promptly as expected.

  • Fixed client side issue scanning some complex schemas with a particular combination of nested structs.

1.2.2 (October 2018)

1.2.2 is a point release with some fixes to critical issues.

Bug Fixes

  • Fixed ‘ocadm agent start-minion’ to aid in manual cluster repair

  • Properly return an error message for queries that contain a LEFT ANTI JOIN

  • Idle sessions now timeout by default with a timeout of 120 seconds. A session is considered idle if the client did not make any request in that time window. This config can be controlled via the planner or workers idle_session_timeout config.

  • A fix to optimize processing of datasets with large numbers of partitions.

  • JDBC data source scans can now be run in parallel. The JDBC type table should have a valid numeric column type ‘mapred.jdbc.scan.key’ defined in the catalog tblproperties to enable this mode.

  • Fix web UI’s ‘Preview Dataset’ so that it relies on LAST PARTITION

1.2.1 (September 2018)

1.2.1 is a point release with some fixes to critical issues.

Bug fixes

  • Fixed a critical issue with scanning nested collections.

  • Added support for AVRO schema files specified using an HTTPS URI.

  • Fixed some error handling in PyOkera.

  • Increased the default connection limit to 512.

1.2.0 (September 2018)

1.2.0 is the next major Okera release with significant new functionality and improvements throughout the platform.

Major Features

Data usage and reporting

Using the Okera Portal, users can now understand how the datasets in the system are being used. This can be useful for system administrators and data owners to understand which datasets are being used most often, by whom and with which applications. The reporting insights are built on the audits and automatically capture system activity. For more details, see here.

Support for data sources using JDBC

Okera Data Access Platform (ODAS) now supports data sources connected via JDBC, typically relational data bases. These datasets can now be registered in the Okera Catalog and then read and managed as any other Okera dataset. For more details on how to register and configure these sources, see here.

Improved access level granularity

ODAS now supports richer access levels, in addition to SELECT and ALL. It is now possible to, for example, grant users only the ability to find and look at metadata, only to alter dataset properties. We’ve also added the concept of public role, which can simplify permission management. For details and best practices, see here.

Access Control Builtins

ODAS now supports a family of access control builtins. These are intended to be used in view definitions and can dramatically simplify implementing fine grained access control. See this document for more details.

Improvements to LAST SQL clause

ODAS supports the LAST PARTITION clause to facilitate sampling large datasets. In this release, this support was extended to support LAST N PARTITIONS and LAST N FILES. In addition, it is now possible to set this as metadata on the catalog object, to prevent queries trying to read all partitions. See here for best practices.

Improvements to Workspace

Workspace can now run multiple queries at once and supports monospace format outputs and datetime queries.

PyOkera

  • PyCerebro has been renamed PyOkera. The API is effectively unchanged except that now instead of importing ‘from cerebro’ you will need to import ‘from okera’.

  • Parallel Execution of Tasks. PyOkera will now schedule and execute worker tasks in parallel to minimize network latency. The scan_as_pandas() and scan_as_json() API calls will by default spawn worker processes to concurrently execute tasks where possible. The default number of local worker processes is defined as 2 times the number of CPU Cores on the machine on which it is being processed. This has demonstrated a reduction in run-time duration for queries by minimizing the network latency involved with establishing network connections with the Okera Worker Nodes.

Performance

  • Improved planner task generation. One of the responsibilities of the planner is to break up the files that need to be read into tasks. In this release, we’ve implemented a new cost-based algorithm which should result in tasks that are more even. This should lead to less execution skew across tasks and overall reduction in job completion times.

  • Improved planning time for queries on tables with large number of partitions. The planner now loads metadata more lazily, deferring as much as possible to after partitioning pruning. This can result in significantly better latency for queries that scan only a few partitions of a heavily partitioned table.

  • Improved expression handling in the planner. The planner will fold constant expressions trees and reorder expressions. Queries that push complex expressions to ODAS should see improvements.

  • Dramatically improved planner and worker RPC handling. Server side RPC handling is much more robust to slow clients or if there are transient slowdowns in dependent authentication services.

  • Worker fetch rpc now does keep alive for clients, eliminating the need to set high client RPC timeouts. Users previously worked around this by setting a very high value for recordservice.worker.rpc.timeoutMs.

  • Support for caching authenticated JWT tokens and increasing the timeout. See advanced install for more details.

Other improvements

  • Added support for ALTER DATABASE <db_name> SET DBPROPERTIES

  • Added support for ‘ALTER TABLE RENAME TO ' in hive.

  • EMR support included through EMR 5.16.0

  • Support for MySQL 5.7 as the backing database for the Okera catalog.

  • Support Avro bytes data type. This is treated by ODAS as STRING

  • EXPLAIN is now supported as a DDL command and can be run via odb or from the the web portal.

  • Support for the UNION ALL operator. Note that UNION DISTINCT is not supported.

  • Updated Kubernetes to 1.10.4 and the Kubernetes dashboard to 1.8.3.

  • Users can now user the keyword CATALOG as an alias to SERVER to grant access to the entire catalog. For example, GRANT SHOW ON CATALOG TO ROLE common_role would enable metadata browsing of the entire catalog.

  • (Beta) DeploymentManager now supports deploying ODAS clusters on externally managed Kubernetes clusters. In this case, the DeploymentManager just deploys ODAS services without managing machines or the Kubernetes install.

  • The Okera portal can now store token credentials using cookies, allowing users to share credentials across web applications. -See here for configuration details.

  • Added support for naming the sentryDB when creating a cluster. The name can be specified with the “–sentryDbName” flag when using ocadm.

  • Partitioned tables can now have their partitions automatically recovered. See the faq for more details.

  • Diagnostic Bundler now captures network route and firewall details

  • Support for IF NOT EXISTS in CREATE ROLE and SHOW ROLES LIKE SQL statements.

Changes in behavior

  • View usage for Python now uses the pyokera client instead of the REST API.

  • Okera views now inherit stats from the base tables and views the view is created from. These can be overwritten using ALTER TABLE but will provide better behavior for the vast majority of use cases.

  • Okera portal no longer estimates the total number of datasets as this can cause performance issues with very large catalogs.

  • Workspace will no longer render more than 500 rows per query.

  • Workspace terminal will drop older queries if the terminal output exceeds more than 750 rows of queries in total. This will improve workspace rendering speeds.

  • Post_scan request now utilizes the pyokera client.

Bug fixes

  • Improved error handling for invalid data metadata, for example, an Avro schema path that is no longer valid

  • When using field name schema resolution for parquet files, the field comparison is now case insensitive. This matches the behavior from the Apache Parquet Java implementation (parquet-mr).

  • The double datatypes now returns as many digits as possible when scanned via the REST API. Previously, this would round or return scientific notation for some range of values.

  • Allow revoking URIs to non-existent (typically deleted) S3 buckets. This would previously error out.

  • Fix issues with creating SELECT * views in some cases. Previously, in same cases, this could fail if layers of views are used.

  • Table property, ‘skip.header.line.count’, is now properly respected.

Incompatible and Breaking Changes

  • Deprecated the CEREBRO_INSTALL_DIR and DEPLOYMENT_MANAGER_INSTALL_DIR environment variables. OKERA_INSTALL_DIR should be used.

Known issues

  • Partitioned tables where the user only has SHOW access will mistakenly show the user as having access to the partitions columns in the UI. The UI properly shows that the other columns are inaccessible. -* While multiple ODAS clusters can be configured to share their HMS and Sentry databases, datasets created by a 1.2.0 ODAS cluster cannot be read by ODAS clusters running earlier versions (1.1.x or earlier).

1.1.0 (june 2018)

The 1.1.0 introduces two major items but no significant alterations to existing features or functionality. It includes all of the fixes from 1.0.1.

Support for Array and Map collection data types

This completes the complex types support started in 0.9.0, when the struct type was introduced. This release adds support for Arrays and Maps. As with struct support, only data stored in Avro or Parquet is supported. See the docs for more details.

Migration from company rename

In 1.0.0, we renamed the product but maintained backwards compatibility. For example, system paths that contained the product or company name were not changed. In 1.1.0, we have completely the product renaming and users upgrading will need to migrate. See the migration guide for details on the changes and their implications.

Incompatible and Breaking Changes

  • AWS AutoScalingGroup launch scripts (the parameter passed to the --clusterLaunchScript flag when creating an ODAS environment) should accept the --numHosts parameter. This is a breaking change from 1.0.0 when the flag was introduced as --hosts.

  • Okera Portal (UI) Workspace page no longer accepts queries in the URL. Bookmarks and links from previous versions that included a query will still go to the page, but query arguments will not populate the input box.

  • Pyokera (fka Pycerebro) no longer returns a byte array to represent strings from calls to scan_as_json. Instead, it returns a UTF-8 encoded Python string, which is automatically serializable to JSON.

Known issues

  • Support for AWS AutoScalingGroups (ASGs) is in beta and not recommended for production use. There may be issues scaling an ASG cluster down, depending on which EC2 VM is terminated.

1.0.1 (may 2018)

1.0.1 is a patch release that fixes some critical issues in 1.0.0. We recommend all 1.0.0 users switch to this version for both the server and java client libraries. The java client library for this release is also 1.0.1.

Fixes

  • Fix issue health checking planner and workers with kerberos enabled. The health checks were failing continuously causing the cluster stability issues.

  • Fix to diagnostic bundler when some of the collected log files are either very big or corrupt. In some cases, log files on the host OS in /var/log could be corrupt or very large. This used to cause the bundle to fail and now they are skipped with the issue logged.

  • Fix issue when dropping favorited datasets in the UI. In some cases, looking at favorited datasets that don’t exist anymore or are no longer accessible were displayed incorrectly.

  • Enable small row-group optimizable when reading parquet. Parquet files with small row groups had very poor performance in some cases, particularly if the table had many columns. The implementation for how these kind of files are read has been changed to handle this case better.

  • Fix some queries when scanning partitioned tables from hive with filters. Some queries using filters on partitioned tables would result in the client library generated an invalid planner request. This issue was specific to some very particular hive queries but has been resolved in this version.

1.0.0 (may 2018)

The 1.0.0 release is a major version, introducing significant new functionality and improvements across the platform.

Name Change Notice

With the 1.0.0 generally available (GA) release, our company and product name changes as well. Cerebro Data, Inc is now Okera, Inc, and the Cerebro Data Access Platform is now the Okera Active Data Access Platform. Component names have mostly been updated, and documentation should reflect the current state of all component names. In this release, we have maintained backwards compatibility and existing automation will continue to work. The binary paths continue to use the Cerebro name.

Deprecations Notice

  • Environment variables that begin with CEREBRO_ will continue to be supported only until version 1.2.0 is released. For standard environment variables established during installation, CEREBRO will alias OKERA.

Upgrading from prior versions

It is not possible to upgrade an existing ODAS cluster to 1.0.0 (i.e. ocadm clusters upgrade will not work). Instead, a new ODAS cluster must be created. Note that this is only the ODAS cluster itself, the catalog metadata from older clusters can be read with no issues.

Diagnosability and Monitoring

  • Support for collecting logs across all services and machines for diagnostics. This bundle can be sent to Okera or used by users to improve the troubleshooting experience. Cluster administrators can use the CLI to generate this support bundle across cluster services with one command. For more details see here.

  • Support for Kubernetes dashboard and Grafana for UI based administration and monitoring. ODAS now installs and enables the Kubernetes, which helps cluster admins manage a ODAS cluster (such as restarting a service, inspecting configs, etc) and Grafana to look at metrics across the cluster (cpu usage, memory usage, etc). These can now be optionally enabled at cluster creation time. More details see here.

Robustness and Stability

  • Resolved issue where occasionally multiple workers can be assigned to the same VM, causing load skew and cluster membership issues.

  • Improved healthchecking to be able to detect service availability, triggering repair of individual containers more reliably. The healthcheck now more closely emulates what a client would do. If the healthcheck detects an issue, the individual container is restarted.

  • Beta support for Amazon Web Services Auto Scaling Groups (ASG). Users can now provide an ASG launch script instead of the instance launch script and the ODAS cluster will be built using the ASG. This should improve cluster launch times at scale, make it easier to manage the VMs (a single ODAS cluster is one ASG) as well as handle node failures. When the ASG relaunches a machine for the failed nodes, the Deployment Manager will automatically detect this, remove the failed nodes and add the new ones. See here.

Docs

In this release, the product documentation has been significantly improved. Search is now supported throughout the documentation. Content has been added to cover FAQs, product overview and many other topics.

UI improvements

  • Workspace is now enabled for admin users. The workspace provides an easy interface for data producers and data stewards to issue DDL and metadata queries against the system, without the need to use another tool. Workspace provides basic query capabilities but is not intended to be used for analytics. Features include: scan queries, DDL queries, and query history.

  • Permissions page has been added to help users understand how permissions have been configured. For data stewards, it provides the ability to look up which users have been granted access and how. For data consumers, it allows him or her to lookup the information to acquire access to more datasets.

  • Users can now tag and favorite datasets to be able to easily work with them again in the future. These tags and favorites are currently unique per user.

  • Performance and scale improvements. The UI scales much better with larger catalogs.

Performance and scalability

  • Significant performance improvements for metadata operations on large catalogs. Numerous improvements were made to the responsiveness of metadata operations (DDL) against large catalogs and tables with large number of partitions. This includes operations such as show partitions and show tables as well as the planning phase at the start of each scan request.

  • Improvements to task generation in the planner. The planner is responsible for generating tasks run by the workers. The tasks generated by the planner should now be more even across more cases - for example, skew in partition sizes, data file sizes, etc.

  • Improvements to worker scalability with high concurrency. The workers are now more efficient under high concurrency (200+ clients per worker).

  • Better batch size handling in workers. Workers produce results in batches which are returned to the client. Improvements were made to manage the batch size and memory usage better across a wider range of schemas.

  • The default replication factor of some of the services (e.g. odas-rest-server) has been increased to improve fault tolerance and scalability.

EMR integration improvements

  • Spark: SparkSQL can be run directly against tables in the catalog with no need to create a temporary view. While this worked previous, the integration is improved and now performs as well as the temporary view usage pattern. For example, SparkSQL users can just run spark.sql('SELECT * FROM okera_sample.users').

  • Hive: Improvements to handling of partitioned tables to improve planning performance. Queries that scan a few partitions in a dataset with many partitions should see significant improvements.

  • Deprecating single-tenant install support. While it is still supported in this release, the single-tenant EMR integration is being deprecated. Instead it is recommended to use the multi-tenant install but only bootstrap a single (typically hadoop) user. Improvements were made to the user bootstrapping and setup scripts. See the [emr docs][emr-integration].

  • EMR support extended to include 5.1 and 5.2.

Miscellaneous

  • Added support for BINARY and REAL data types. See the data types docs.

  • Support for registering bucketed tables and views using lateral views. Note that since ODAS does not support these, the user must have all access on these tables and views and direct access to these files in the file system.

  • Extended ODAS SQL to add DROP_TABLE_OR_VIEW. See supported-sql for more details.

  • Audit logs now include the Okera client version as part of the application field. For example, instead of presto, new clients will identity as presto (1.0.0). Note that this is the version of the Okera client, not the version of presto.

Incompatible and Breaking Changes

REST server default timeout increased to 60 seconds from 30

This is mostly to accommodate long DDL commands, such as alter table recover partitions.

New log filename format

The log files that are uploaded to S3 have a new naming scheme. The old naming scheme was <service>-<pid>-<id>-<timestamp>-<guid>.log. The new naming scheme is: <timestamp>-<service>-<ip>-<id>-<guid>.log.

This makes it more efficient to find logs from a given time window.

ODAS REST Service returns decimal data as string, instead of double

Decimal data is converted to string, when data is accessed via the REST APIs. The json result set now returns the decimal values as string to prevent any precision loss. Client can now control the rounding behavior in the application and cast the type as needed.

Deployment Manager now requires java 8

Known issues

Using AWS autoscaling groups

If an ODAS cluster is created using a clusterLaunchScript, it will instantiate an autoscaling group of the specified size in AWS. Scaling an ODAS cluster that is running on an ASG is not supported in 1.0.0. Specifically, scaling down is known to have issues. This will be remedied in the next release.

Earlier Releases

Release notes for 0.9.0 (april 2018) or earlier.