Okera Version 2.2 Release Notes¶
This topic provides Release Notes for all 2.2 versions of Okera.
2.2.2¶
Bug Fixes and Improvements¶
- Fixed an issue in Hive and Spark client libraries when generating planning SQL that contained
DATE
types. - Fixed an issue in scanning partitioned Delta tables.
- Fixed an issue in the Spark and Hive client libraries where they would not properly maintain millisecond values for
TIMESTAMP
columns (they would correctly retain microsecond and nanosecond values if present). - Fixed an issue in the PrestoDB client library, where if a column name was also a reserved keyword (e.g.,
database
ormetadata
) AND the column was a complex type (e.g.,STRUCT
), the client library would produce an invalid planning request.
Notable and Incompatible Changes¶
Default Docker Repository Changed to quay.io/okera
¶
Okera has changed the Docker repository that images are pushed to from DockerHub to Quay, due to the impact of the newly enforced rate limits in DockerHub.
Okera's images are available with the quay.io/okera
prefix (the image names have not changed).
2.2.1¶
Bug Fixes and Improvements¶
- Fixed an issue in PrestoDB split computation in very large clusters.
- Removed the restriction on column comments by default (limit was 256 characters).
Note: This changes the underlying
HMS
schema, and if connected to a shared HMS, should be disabled by setting theHMS_REMOVE_LENGTH_RESTRICTION
configuration value tofalse
. - Improved resilience in handling crawling errors.
- Fixed an issue with
WITH GRANT OPTION
on non-ALL
privileges. - Restrict querying datasets with nested types that have policies on tags that are on the nested type.
- Fixed an issue when paginating in the Datasets page.
- Fixed an issue for the
/api/get-token
endpoint.
2.2.0¶
New Features¶
JDBC Data Sources¶
Custom JDBC Driver Support¶
Okera has added support for specifying custom JDBC data sources beyond those that ship out of the box. If you would like to connect to a custom JDBC data source, please work with Okera Support to define the JDBC connection information appropriately for your driver.
Secure Values for JDBC Properties¶
Okera has added support for referring to secret values in the JDBC properties file from local secret sources such as Kubernetes secrets, as well as secure Cloud services such as AWS Secrets Manager and AWS SSM Parameter Store.
For example:
driver=mysql
type=mysql
host=...
port=3306
user=awsps:///mysql/username
password=awsps:///mysql/password
This will look up in AWS SSM Parameter Store the value for /mysql/username
and /mysql/password
.
You can similarly use file://
for local files (using Kubernetes mounted secrets) or awssm://
to use AWS Secrets Manager.
Note: If you use AWS SSM Parameter Store or AWS Secrets Manager, you will need to provide the correct IAM credentials to access these values.
Predicate Pushdown Enabled by Default for JDBC-Backed Data Sources¶
Starting in 2.2.0, predicate pushdown for JDBC-backed data sources is enabled by default (this was previously available as an opt-in property on a per-data source level) and will be used whenever appropriate.
To disable predicate pushdown for a particular JDBC-backed database or table, you can specify 'jdbc.predicates.pushdown.enabled' = 'false'
in the DBPROPERTIES
or TBLPROPERTIES
(you can read more here).
BLOB/CLOB Data Type Support¶
Okera now supports BLOB
and CLOB
data types for Oracle JDBC Data Sources.
Autotagging for JDBC-Backed Data Sources¶
When registering JDBC-backed data sources and loading the tables, Okera will now run its autotagger by default when registering.
You can disable this behavior by specifying okera.autotagger.skip=true
in your DBPROPERTIES
.
UI Improvements for Tabular Results¶
The UI now makes it easy to copy or download results as CSV from tables. This can be used in the Workspace and when previewing a dataset.
Operability Improvements¶
-
Okera will now generate correlated IDs for the Policy Engine (planner) and Enforcement Fleet worker tasks to make it easier to correlate the task information in the logs. For example, the Policy Engine may have a task of the form
9b45f8b08c76352e:85a51f5579300000
, and if N worker tasks were generated, they would be of the form9b45f8b08c76352e:85a51f5579300001
,9b45f8b08c76352e:85a51f5579300002
, and so on. -
System administrators can now easily access the Policy Engine (planner) and Enforcement Fleet worker debug UIs from the System page in the Okera UI.
-
System administrators can now see how many unique users have accessed data via Okera in the System page in the Okera UI, both all-time and in the last 30 days.
Domino Data Labs Integration¶
When run in Domino Data Lab environments (starting in Domino Data Lab version 4.3.0), PyOkera now has built-in integration that can be used to leverage the automatically generated JWT tokens in the Domino Data Labs environment, enabling transparent authentication between Domino Data Labs environments and Okera deployments.
import os
from okera.integration import domino
ctx = domino.context()
with ctx.connect(host=os.environ['OKERA_HOST'], port=int(os.environ['OKERA_PORT'])) as conn:
df = conn.scan_as_pandas('drug_xyz.trial_july2020')
PrestoDB Improvements¶
- Several internal improvements were made to Okera's PrestoDB connector to increase performance in areas such as data deserialization, asynchronous processing, improved memory allocation, etc.
- Several improvements were made to auto-tune Okera's built-in PrestoDB cluster to better match its environments capabilities.
- When filtering on columns of
DATE
type, the PrestoDB connector will now push those filters down into the Okera workers. -
Okera's PrestoDB connector has added support for table statistics if these are set on the table in the Okera catalog. These can be set by setting the
numRows
table property, e.g.:ALTER TABLE mydb.mytable SET TBLPROPERTIES('numRows'='12345')
These table statistics will be considered by Presto's cost-based optimizer (e.g., for JOIN
reordering).
User Attributes¶
Okera added the user_attribute(<attribute>)
built-in function, which retrieves attribute values on a per-user basis.
These can be used in policy definitions, e.g., to apply dynamic per-user filters.
These attributes can be fetched from AD/LDAP by setting the LDAP_USER_ATTRIBUTES
configuration value to a comma-separated list of attributes to fetch from AD/LDAP, e.g.:
LDAP_USER_ATTRIBUTES: region,manager,businessUnit
If the attribute is missing for the user executing it, the value returned will be null
.
Hudi and Delta Lake Support (Experimental Feature)¶
Okera has added experimental support for Delta Lake and Apache Hudi tables.
You can create Apache Hudi tables using the CREATE EXTERNAL TABLE
DDL. For example:
CREATE EXTERNAL TABLE mydb.my_hudi_tbl
LIKE PARQUET 's3://path/to/dataset/year=2015/month=03/day=16/5f541af5-ca07-4329-ad8c-40fa9b353f35-0_2-103-391_20200210090618.parquet'
PARTITIONED BY (year int, month int, day int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 's3://path/to/dataset';
You can create Delta Lake tables using the CREATE EXTERNAL TABLE
DDL. For example:
CREATE EXTERNAL TABLE mydb.my_delta_tbl (id BIGINT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 's3://path/to/dataset/';
The following limitations should be kept in mind:
- In both cases, tables need to be explicitly registered, as crawling will not properly identify these tables as Hudi or Delta Lake.
- For Apache Hudi, Okera only supports Snapshot Queries on Copy-on-Write tables and Read Optimized Queries on Merge-on-Read tables.
- We support Delta tables as symlink tables, so you will need to define them as symlink tables and ensure that the manifest for them is generated properly.
PHI-Related Privacy Functions¶
Okera has added several privacy functions typically used in health- and medical-related environments:
phi_zip3
phi_age
phi_date
phi_dob
These are compliant with the HIPAA safe-harbor standard.
Nested Field Tagging ( Preview Feature)¶
Okera has added the capability (disabled by default) to tag nested types (specifically, ARRAY
and STRUCT
types), and have those tags be inherited when used in views that unnest the nested portion. In addition, Okera has added the ability to fully unnest a table and inherit the tags on that object into the view - this is done using the SELECT **
operator.
To enable this feature, set the FEATURE_UI_TAGS_COMPLEX_TYPES
and ENABLE_COMPLEX_TYPE_TAGS
configuration parameters to true
.
Note: ABAC policies that apply to tags assigned to nested types will not be enforced on the base table, so take care to only give access to unnested views in these cases.
For more information, see Nested Field Tags.
Bug Fixes and Improvements¶
- Fixed an issue in the Okera Presto connector where some queries against
information_schema
could cause an exception and fail. - Fixed an issue in the Tags management UI where the number of tagged datasets could be incorrect.
- Improved handling of not displaying internal databases in the UI.
- Added the ability to run
GRANT
forADD_ATTRIBUTE
andREMOVE_ATTRIBUTE
at theCATALOG
scope. - Removed the need to have
ALTER
permissions on tables and databases to run add/remove attributes (ADD_ATTRIBUTE
andREMOVE_ATTRIBUTE
are now sufficient). - Added an implementation of
listTableNamesByFilter
in the HMS connector. - Added the ability to configure timeouts for internal checks to account for large network latency, this can be set using
OKERA_PINGER_TIMEOUT_SEC
. - Support implicit upcasting for Parquet columns of type
INT32
to be represented byBIGINT
in the table schema. - Improved experience when previewing JDBC-backed datasets by limiting the amount of data fetch.
- Added
DELETE
,UPDATE
andINSERT
as grantable privileges. - Fixed an issue in
okctl
where it would not report an error and abort if there was an error updating the ports. - Improved handling of small files for Parquet-backed datasets.
- Fixed an issue where the Autotagger would not correctly handle columns with
DATE
type. - Improved handling for JDBC-backed tables where the table name contained
.
characters. - When running a worker load balancer (default for EKS and AKS environments), the built-in Presto cluster will by default use the internal cluster-local load balancer and not the external one.
- Fixed an issue with pagination on the datasets page where paging to the end of the list and back showed an inaccurate count
- Improved diagnostic information available when registering a JDBC-backed table that has unsupported types or invalid characters.
- Improved filter push down for Oracle tables for columns of
DATE
andTIMESTAMP
type. - Improved handling of
DECIMAL
,NCHAR
andFLOAT
datatypes for JDBC-backed data sources. - Improved inference of
BIGINT
values in text values (e.g., CSV). - Fixed an issue where workers were not generating
SCAN_END
audit events. - Fixed an issue where table/view lineage information could be duplicated.
- Upgrade Gravity to 6.1.39.
- Remove hardcoded
fetch_size
in PyOkera and add the ability to explicitly set it using thefetch_size
keyword argument toexec_task
. - Fixed an issue where pagination on the Datasets UI could get into an inconsistent state when filtering by tags.
- When using
tokenize
, referential integrity will now also be maintained forINTEGER
columns. - Add
IF NOT EXISTS
andIF EXISTS
modifiers toGRANT
andREVOKE
DDLs, respectively. - Fixed an issue when doing writes in Amazon EMR Spark when metadata bypass was enabled for non-partitioned tables.
- Added limited support for Avro files with recursive schemas, which will allow a maximum cycle depth of 2.
Notable and Incompatible Changes¶
Upgrading From 2.1.x
¶
When upgrading from Okera 2.1.x that is lower than 2.1.10, some functionality may stop working in the 2.1.x deployment if running side-by-side or downgrading back to 2.1.x. If preserving the behavior is desirable, please upgrade to 2.1.10 or work with Okera Support.
Container User Changed to root
¶
Starting in 2.2.0, the process user inside all the Okera containers (running as Kubernetes pods) is no longer root
and is running under dedicated users.
As part of this change, any files that are downloaded into the container (e.g., from Amazon S3 for custom certificates) are now placed in /etc/okera
and not /etc
.
SQL Keywords¶
The following terms are now keywords, starting in 2.2.0:
DELETE
HUDIPARQUET
INPUTFORMAT
OUTPUTFORMAT
UPDATE