Object Storage Connections¶
Okera supports the following cloud object storage types.
- Amazon S3
- Azure: ADLS Gen2
- Google Cloud Storage
- HDMS
Note: Make sure Okera already has read access to your files, as configured in your cluster. You cannot create a crawler on a path for which Okera does not have read access.
Supported File Formats¶
Okera supports registering data in these file formats:
- Apache Avro
- CSV (Comma-Separated-Values)
- Delta Lake
- Apache Hudi
- JSON
- Apache Parquet
- ORC (Optimized Row Columnar)
- TEXT
Register Data From Object Storage¶
See Create and Run a Crawler and Register Datasets to learn how to register data from object storage.
Amazon S3 Bucket Role Mapping Support¶
See Amazon S3 Assume Secondary Role Support to learn how to assume secondary roles to read Amazon S3 data, with different roles for different buckets.
Apache Hudi Table (Preview Feature)¶
Okera provides preview support for Apache Hudi tables.
You can create Apache Hudi tables using the CREATE EXTERNAL TABLE
DDL. For example:
CREATE EXTERNAL TABLE mydb.my_hudi_tbl
LIKE PARQUET 's3://path/to/dataset/year=2015/month=03/day=16/5f541af5-ca07-4329-ad8c-40fa9b353f35-0_2-103-391_20200210090618.parquet'
PARTITIONED BY (year int, month int, day int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 's3://path/to/dataset';
This support has the following limitations:
-
Tables must be explicitly registered, as Okera connection crawling does not properly identify Apache Hudi tables.
-
Okera only supports snapshot queries on copy-on-write tables and read optimized queries on merge-on-read tables.
Databricks Delta Lake Table Support (Preview Feature)¶
Okera provides support for Delta Lake tables. It provides this support using one of two methods:
- Native support (this is supported as a preview feature)
- Manifest-required support
Okera recommends using its native support for Delta Lake tables.
Native Support¶
This is the recommended method of integrating Okera with Delta Lake tables.
To select native support for a specific Delta Lake table or database, specify the okera.delta.native-support=true
property as a table or database property.
To select native support for the entire Okera cluster, add the DELTA_TABLE_NATIVE_SUPPORT
configuration parameter to the Okera configuration file and set it to true
. The default for this parameter is currently false
(use the original manifest-required support), but will change to true
in a future release.
You can create Delta Lake tables using the CREATE EXTERNAL TABLE
DDL. For example:
CREATE EXTERNAL TABLE mydb.my_delta_tbl (i INT)
STORED AS PARQUET
LOCATION 's3a://path/to/dataset'
TBLPROPERTIES(
'spark.sql.sources.provider'='DELTA',
'okera.delta.native-support'='true'
);
Note: Okera currently only supports querying the latest snapshot of a Delta Lake table.
Manifest-Required Support¶
Okera recommends using its native support for Delta Lake tables, not its manifest-required support.
However, if you want to continue to use Okera's manifest-required support for Delta Lake tables, specify okera.delta.native-support=false
as a table or database property.
To select manifest-required support for the entire Okera cluster, add the DELTA_TABLE_NATIVE_SUPPORT
configuration parameter to the Okera configuration file and set it to false
. The default for this parameter is currently false
(use the original manifest-required support), but will change to true
(use native support) in a future release.
Important
Okera recommends that you explicitly set okera.delta.native-support=false
for the Delta Lake tables and databases for which you want to continue using Okera's manifest-required support, even though false
is currently the cluster default, because the cluster default will change to true
(use native support) in a future release.
With manifest-required support, Delta Lake tables are supported as symlink tables, so you must define them as symlink tables and ensure that the manifest for them is generated properly. Tables must be explicitly registered, as Okera connection crawling does not properly identify Delta Lake tables.