Skip to content

Object Storage Connections

Okera supports the following cloud object storage types.

  • AWS S3
  • Azure: ADLS Gen2
  • Google Cloud Storage
  • HDMS

Note: Make sure Okera already has read access to your files, as configured in your cluster. You cannot create a crawler on a path for which Okera does not have read access.

Supported File Formats

Okera supports registering data in these file formats:

  • Apache Avro
  • CSV (Comma-Separated-Values)
  • Delta Lake
  • Apache Hudi
  • JSON
  • Apache Parquet
  • ORC (Optimized Row Columnar)
  • TEXT

Register Data From Object Storage

See Create and Run a Crawler and Register Datasets to learn how to register data from object storage.

Amazon S3 Bucket Role Mapping Support

See Amazon S3 Assume Secondary Role Support to learn how to assume secondary roles to read S3 data, with different roles for different buckets.

Apache Hudi Table (Preview Feature)

Okera provides preview support for Apache Hudi tables.

You can create Apache Hudi tables using the CREATE EXTERNAL TABLE DDL. For example:

CREATE EXTERNAL TABLE mydb.my_hudi_tbl LIKE PARQUET 's3://path/to/dataset/year=2015/month=03/day=16/5f541af5-ca07-4329-ad8c-40fa9b353f35-0_2-103-391_20200210090618.parquet' PARTITIONED BY (year int, month int, day int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3://path/to/dataset';

This support has the following limitations:

  • Tables must be explicitly registered, as Okera connection crawling does not properly identify Apache Hudi tables.

  • Okera only supports snapshot queries on copy-on-write tables and read optimized queries on merge-on-read tables.

Databricks Delta Lake Table Support (Preview Feature)

Okera provides support for Delta Lake tables. It provides this support using one of two methods:

  • Native support (this is supported as a preview feature)
  • Manifest-required support

Okera recommends using its native support for Delta Lake tables.

Native Support

This is the recommended method of integrating Okera with Delta Lake tables.

To select native support for a specific Delta Lake table or database, specify the okera.delta.native-support=true property as a table or database property.

To select native support for the entire Okera cluster, add the DELTA_TABLE_NATIVE_SUPPORT configuration parameter to the Okera configuration file and set it to true. The default for this parameter is currently false (use the original manifest-required support), but will change to true in a future release.

You can create Delta Lake tables using the CREATE EXTERNAL TABLE DDL. For example:

CREATE EXTERNAL TABLE mydb.my_delta_tbl (i INT)
STORED AS PARQUET
LOCATION 's3a://path/to/dataset'
TBLPROPERTIES(
  'spark.sql.sources.provider'='DELTA',
  'okera.delta.native-support'='true'
);

Note: Okera currently only supports querying the latest snapshot of a Delta Lake table.

Manifest-Required Support

Okera recommends using its native support for Delta Lake tables, not its manifest-required support.

However, if you want to continue to use Okera's manifest-required support for Delta Lake tables, specify okera.delta.native-support=false as a table or database property.

To select manifest-required support for the entire Okera cluster, add the DELTA_TABLE_NATIVE_SUPPORT configuration parameter to the Okera configuration file and set it to false. The default for this parameter is currently false (use the original manifest-required support), but will change to true (use native support) in a future release.

Important

Okera recommends that you explicitly set okera.delta.native-support=false for the Delta Lake tables and databases for which you want to continue using Okera's manifest-required support, even though false is currently the cluster default, because the cluster default will change to true (use native support) in a future release.

With manifest-required support, Delta Lake tables are supported as symlink tables, so you must define them as symlink tables and ensure that the manifest for them is generated properly. Tables must be explicitly registered, as Okera connection crawling does not properly identify Delta Lake tables.