Complex Data Types¶

This document details the support for complex data types in Okera. The supported complex data types include:

structs
arrays
maps (limited support; see below for details)
unions (limited support; see below for details)

There is no plan to support unions or enums in Okera, excepting in special cases for Avro. See details below.

File Format Support¶

Complex type support applies to data stored in Avro, Parquet and JSON formats.

Note: Maps are not supported for JSON file formats.

Struct Data Types¶

Okera supports struct data types with up to 100 levels of nesting. Struct types are useful for logs, event data, and other records that use JSON.

Several examples below are based on the following schema below to illustrate how Okera supports structs.

{
  "uid": 100,
  "user": "alice",
  "address": {
    "city": "san francisco",
    "state": "ca"
  },
  "age": 25
}

Using a CREATE SQL Command With Avro-Formatted JSON Data¶

Okera supports the HiveQL SQL for table creation. To create a table using our example schema (assuming Avro format):

CREATE EXTERNAL TABLE users(
  uid BIGINT,
  user STRING,
  address STRUCT<
    city: STRING,
    state: STRING>,
  age INT
) STORED AS AVRO
LOCATION 's3://okera/rs-complex-users/'

A description of this table will return:

> describe users;
uid bigint
name    string
address struct<
  city:string,
  state:string
>
age int

Creating a DDL With File-Based Avro Schema¶

If a schema is stored in an .avsc file, the CREATE statement can use that file in lieu of an explicit schema definition. Given a users.avsc file, for example, you could write:

CREATE EXTERNAL TABLE users
LIKE AVRO 's3://path/to/schemas/users.avsc'
LOCATION 's3://okera/rs-complex-users/'

This approach offers several advantages: * It is less error-prone than typing or a cut-and-paste operation. * It is the only option for large, detailed schema, in particular any schema that contains a struct whose serialized values exceed 4kb. Any attempt to paste such a schema into a CREATE statement would fail and report the field is too long. * If a field name is typed incorrectly in HiveQL, the engine can only infer a new field is being declared, resulting in a schema change instead of an error. Reading from a schema file will avoid this potential confusion. * Some valid Avro artifacts, such as an empty struct, have no analog in HiveQL. The Okera engine can however handle these cases transparently when they are read from an Avro schema file.

Using a Nonstandard Avro Schema File¶

Okera assumes the given Avro path will point to a well-formed Avro JSON schema. The Okera engine can also read an Avro schema that has been embedded in a JSON object.

Say there is an avro-json.json file that includes the following elements:

{
    'owner' : 'user',
    'custom' : {
        'key' : 'value',
        'path' : [avro schema as escaped json]
    }
}

You can write a CREATE TABLE statement to read the avro-json.json file and extract the Avro schema by setting the avro.schema.field.path table property as follows:

CREATE EXTERNAL TABLE example
LIKE AVRO 's3://data/schemas/avro-json.json'
STORED AS AVRO LOCATION 's3://data/rs-complex-users/'
TBLPROPERTIES('avro.schema.field.path'='custom.path')

Okera will locate the schema using the JSON identifier custom.path.

Using CREATE VIEW With Complex Type Subfields¶

You can create views using just a subfield of a complex type. Column names in a view cannot contain a dot(.), so it's necessary to give them an alias:

CREATE VIEW user_city AS SELECT user, address.city AS cityname FROM users

Omitting the alias will throw an error:

CREATE VIEW user_city AS Select user, address.city from users

Could not execute ddl., detail:AnalysisException: Invalid column/field name: address.city

Scanning Data With the Okera REST API¶

The REST API supports JSON by default. Invoking the curl utility against the users table endpoint will return:

> curl REST_SERVER:PORT/api/scan/users
[
    {
        "uid": 100,
        "user": "alice",
        "address": {
            "city": "san francisco",
            "state": "ca"
        },
        "age": 25
    },
    {
        "uid": 101,
        "user": "bob",
        "address": {
            "city": "seattle",
            "state": "wa"
        },
        "age": 25
    }
]

Note: This example omits the authentication headers for brevity. See REST API OAuth for details.

Scanning a Table via SparkSQL¶

Okera support for struct types aligns with Spark's record type. Some representative queries on the users table include:

Creating an ephemeral table:
```
spark.sql(s"""
    |CREATE TEMPORARY TABLE users
    |USING com.cerebro.recordservice.spark.DefaultSource
    |OPTIONS (RecordServiceTable 'users')
""".stripMargin)
```
Note: in Amazon EMR (or any ODAP configuration that uses an external Hive Metastore), you can omit this step and just write queries against db.table.)

Returning a record count:

spark.sql("SELECT count(*) from users").show()

Selecting the first column:
```
spark.sql("SELECT uid from users").show()
```
Recall the type of the uid field is BIGINT.
Selecting the first column and a subfield of the struct. This request flattens the nested type:
```
spark.sql("SELECT uid, address.city from users").show()
```
Selecting the first column and the column containing a struct:
```
spark.sql("SELECT uid, address from users").show()
```

Selecting all fields from the table:

val row = sc.sql("SELECT * from users").collect()(0)  // Row for 'alice'
assert(row.get(0) == 100)
assert(row.get(1) == "alice")
assert(row.get(2).asInstanceOf[Row](0) == "seattle")
assert(row.get(2).asInstanceOf[Row](1) == "wa")
assert(row.get(3) == 25)

The top-level record consists of four fields [BIGINT, STRING, STRUCT, INT].

Scanning via Apache Hive¶

The Avro struct type maps to Hive's struct type. For example, describing the users schema:

hive> describe users;
OK
uid                     bigint
user                    string
address                 struct<city:string,state:string>
age                     int

A few more examples of Hive DDL-based scans:

Query

hive> select * from users;

Results

OK
{"city":"san francisco","state":"ca"}   25
{"city":"seattle","state":"wa"} 25

Query

hive> select uid, user from users;

Results

OK
100 alice
101 bob

Query

hive> select uid, address from users;

Results

OK
100 {"city":"san francisco","state":"ca"}
101 {"city":"seattle","state":"wa"}

Query

hive> select uid, address.city from users;

Results

OK
100 san francisco
101 seattle

Query

hive> select * from users where uid = 100;

Results

OK
100 alice   {"city":"san francisco","state":"ca"}   25

Query

hive> select address.city, count(*) from users group by address.city;

Results

OK
seattle 1
san francisco   1

Query

hive> select * from users where address.state = 'wa';

Results

OK
101 bob {"city":"seattle","state":"wa"} 25
san francisco   1

Scanning Table Data With Presto¶

Presto supports struct types. The examples below reflect the expected Presto experience.

Query

presto> describe recordservice.default.users;

Results

 Column  |               Type               | Extra | Comment
---------+----------------------------------+-------+---------
 uid     | bigint                           |       |
 user    | varchar                          |       |
 address | row(city varchar, state varchar) |       |
 age     | integer                          |       |

Some additional examples:

Query

presto> select * from recordservice.default.users;

Results

 uid | user  |            address             | age
-----+-------+--------------------------------+-----
 100 | alice | {city=san francisco, state=ca} |  25
 101 | bob   | {city=seattle, state=wa}       |  25
(2 rows)

Query

presto> select uid, user from recordservice.default.users;

Results

uid | user
-----+-------
 100 | alice
 101 | bob
(2 rows)

Query

presto> select uid, address from recordservice.default.users;

Results

 uid |            address
-----+--------------------------------
 100 | {city=san francisco, state=ca}
 101 | {city=seattle, state=wa}
(2 rows)

Query

presto> select uid, address.city from recordservice.default.users;

Results

 uid |     city
-----+---------------
 100 | san francisco
 101 | seattle
(2 rows)

Query

presto> select * from recordservice.default.users where uid = 100;

Results

uid | user  |            address             | age
-----+-------+--------------------------------+-----
 100 | alice | {city=san francisco, state=ca} |  25

Query

presto> select address.city, count(*) from recordservice.default.users group by address.city

Results

     city      | _col1
---------------+-------
 san francisco |     1
 seattle       |     1

Query

presto> select * from recordservice.default.users where address.state = 'wa';

Results

 uid | user |         address          | age
-----+------+--------------------------+-----
 101 | bob  | {city=seattle, state=wa} |  25
(1 row)

Array Data Types¶

Note: PostgreSQL and Oracle array and array(array) data types are not supported in Okera.

The DDL for creating tables that include array types is standard HiveQL:

CREATE EXTERNAL TABLE user_phone_numbers(
   phone_numbers struct<
    uid: BIGINT,
    numbers: ARRAY<STRING>,
    user: STRING
  >
) STORED AS AVRO
LOCATION 's3://okera/rs-complex-users/'

The table description should appear as follows:

> describe user_phone_numbers;
uid bigint
numbers array<string>
user string

Scanning With the REST API¶

The REST API returns arrays as a JSON ordered list. A curl on the endpoint for the user_phone_numbers table would return:

> curl REST_SERVER:PORT/api/scan/user_phone_numbers?records=1
[
    {
        "phone_numbers": {
            "uid": 123,
            "numbers": [
                "111-222-3333",
                "444-555-6666"
            ],
            "user": "john"
        }
    }
]

Scanning With SparkSQL¶

Okera support for array types aligns with Spark's array type. To create an ephemeral view:

spark.sql(s"""
    |CREATE TEMPORARY VIEW user_phone_numbers
    |USING com.cerebro.recordservice.spark.DefaultSource
    |OPTIONS (RecordServiceTable 'rs_complex.user_phone_numbers')
""".stripMargin)

A SELECT on all fields, including the array type, will return:

spark.sql("SELECT * FROM user_phone_numbers").show()
+--------------------+
|       phone_numbers|
+--------------------+
|[123,WrappedArray...|
|[234,WrappedArray...|
|[345,WrappedArray...|
+--------------------+

scala> spark.sql("select phone_numbers.numbers from user_phone_numbers").show()
+--------------------+
|             numbers|
+--------------------+
|[111-222-3333, 44...|
|[222-333-4444, 55...|
|      [111-222-5555]|
+--------------------+

Scanning Table Data With Apache Hive¶

Support for array types also aligns with Hive's array type.

A SELECT on all fields including an array type and one on the array type alone are shown below:

hive> SELECT * FROM user_phone_numbers
{"uid":123,"numbers":["111-222-3333","444-555-6666"],"user":"john"}
{"uid":234,"numbers":["222-333-4444","555-666-7777"],"user":"jim"}
{"uid":345,"numbers":["111-222-5555"],"user":"jack"}

hive> select phone_numbers.numbers from user_phone_numbers
["111-222-3333","444-555-6666"]
["222-333-4444","555-666-7777"]
["111-222-5555"]

Scanning Table Data With Presto¶

Support for array types also aligns with Presto's array type.

A SELECT on all the fields including an array type and the array type by itself are shown below:

presto> select * from recordservice.rs_complex.user_phone_numbers;
                       phone_numbers
------------------------------------------------------------
 {uid=123, numbers=[111-222-3333, 444-555-6666], user=john}
 {uid=234, numbers=[222-333-4444, 555-666-7777], user=jim}
 {uid=345, numbers=[111-222-5555], user=jack}

presto> select phone_numbers.numbers from recordservice.rs_complex.user_phone_numbers;
           numbers
------------------------------
 [111-222-3333, 444-555-6666]
 [222-333-4444, 555-666-7777]
 [111-222-5555]

JSON file format¶

JSON files are supported by Okera starting with release 1.4. JSON files can be specified as STORED AS JSON as part of the create table statement. For example,

CREATE EXTERNAL TABLE user_phone_numbers(
   phone_numbers struct<
    uid: BIGINT,
    numbers: ARRAY<STRING>,
    user: STRING
  >
)
STORED AS JSON
LOCATION 's3://okera/rs-complex-users-json/'

Refer to other options for table creation in Supported SQL.

The returned records would be consistent and similar to results from other file formats.

Map Data Types¶

Scope of Support¶

Maps are supported for map implementations. Mapping to a complex type (e.g., map is supported for PARQUET File format and not yet supported for AVRO file format. Additionally, maps are not supported for JSON file formats.

Note: PostgreSQL and Oracle map data types are not supported by Okera.

Using Map Types¶

The DDL for creating tables with a map type corresponds to standard HiveQL:

CREATE EXTERNAL TABLE user_phone_numbers_map(
   uid BIGINT,
   username STRING,
   contact_numbers MAP<STRING, STRING>
) STORED AS AVRO
LOCATION 's3://okera/rs-complex-map-userphones'

A description of this table should return:

> describe user_phone_numbers_map;
uid bigint
username string
contact_numbers map<string,string>

Scanning With the REST API¶

The REST API returns Map types as a JSON object in the form of name-value pairs. A curl on the endpoint for the user_phone_numbers_map table should return:

> curl REST_SERVER:PORT/api/scan/user_phone_numbers_map?records=1&format=1
[
    {
        "uid": 1,
        "username": "Alex",
        "contact_numbers": {
            "work": "444-555-6666",
            "home": "111-222-3333"
        }
    }
]

Scanning Table Data via SparkSQL¶

Okera support for map types aligns with Spark's map type. A select on all fields including a map type returns:

scala> spark.sql("select * from rs_complex.user_phone_numbers_map").show()
+---+--------+--------------------+
|uid|username|     contact_numbers|
+---+--------+--------------------+
|  1|    Alex|[work -> 444-555-...|
|  2| Melinda|[work -> 444-555-...|
+---+--------+--------------------+

Scanning Table Data With Apache Hive¶

Support for map types also aligns with Hive's map type. A SELECT on all fields including a map type will return:

hive> SELECT * FROM rs_complex.user_phone_numbers_map;
OK
1   Alex    {"work":"444-555-6666","home":"111-222-3333"}
2   Melinda {"work":"444-555-8888","home":"111-222-7777"}

Limited Support for Unions in Avro¶

The Avro specification uses unions only to incorporate the NULL type into all other types, thus making all Avro types nullable. All types in Okera are also nullable. In this context, the specific case for Avro unions is supported. Any other use case, such as accepting a union between a string and int type, is not supported.

Support for the enum Type in Avro¶

The Avro specification defines enums as a complex type; Okera does not. A table or view in Okera cannot define a column with the ENUM type.

The Okera engine will, however, convert any Avro field of enum type to a string if the corresponding table column is declared as STRING. This behavior matches what Hive does under the same conditions.

Configuring Parquet File Resolution Types¶

Table property parquet.resolve-by.type can be used to configure how a Parquet data file is resolved. Valid values are ordinal (positional resolution) and name (name resolution).

For example:

  ALTER TABLE nation SET TBLPROPERTIES('parquet.resolve-by.type'='name')
  ALTER TABLE nation SET TBLPROPERTIES('parquet.resolve-by.type'='ordinal')