Complex Data Types

This document details the support for complex types in ODAS.

The suppored types include:

There is no plan to support unions or enums in ODAS, excepting in special cases for Avro. See details below.

File format support

Complex type support applies to data stored in Avro, Parquet and JSON formats.

Note: Maps are not supported for JSON file format.

Structs

ODAS supports struct types with up to 100 levels of nesting. Struct types are useful for logs, event data, and other records that use JSON.

Several examples below are based on the following schema below to illustrate how ODAS supports structs.

{
  "uid": 100,
  "user": "alice",
  "address": {
    "city": "san francisco",
    "state": "ca"
  },
  "age": 25
}

Using a CREATE DDL statement with Avro-formatted JSON data

ODAS supports the HiveQL DDL for table creation. To create a table using our example schema (assuming Avro format):

CREATE EXTERNAL TABLE users(
  uid BIGINT,
  user STRING,
  address STRUCT<
    city: STRING,
    state: STRING>,
  age INT
) STORED AS AVRO
LOCATION 's3://okera/rs-complex-users/'

A description of this table will return:

> describe users;
uid	bigint
name	string
address	struct<
  city:string,
  state:string
>
age	int

Creating a DDL with file-based Avro schema

If a schema is stored in an .avsc file, the CREATE statement can use that file in lieu of an explicit schema definition. Given a users.avsc file, for example, you could write:

CREATE EXTERNAL TABLE users
LIKE AVRO 's3://path/to/schemas/users.avsc'
LOCATION 's3://okera/rs-complex-users/'

This approach offers several advantages:

  • It is less error-prone than typing or a cut-and-paste operation.
  • It is the only option for large, detailed schema, in particular any schema that contains a struct whose serialized values exceed 4kb. Any attempt to paste such a schema into a CREATE statement would fail and report the field is too long.
  • If a field name is typed incorrectly in HiveQL, the engine can only infer a new field is being declared, resulting in a schema change instead of an error. Reading from a schema file will avoid this potential confusion.
  • Some valid Avro artifacts, such as an empty struct, have no analog in HiveQL. The ODAS engine can however handle these cases transparently when they are read from an Avro schema file.
Using a non-standard Avro schema file

ODAS assumes the given Avro path will point to a well-formed Avro JSON schema. The ODAS engine can also read an Avro schema that has been embedded in a JSON object.

Say there is an avro-json.json file that includes the following elements:

{
    'owner' : 'user',
    'custom' : {
        'key' : 'value',
        'path' : [avro schema as escaped json]
    }
}

You can write a CREATE TABLE statement to read the avro-json.json file and extract the Avro schema by setting the avro.schema.field.path table property as follows:

CREATE EXTERNAL TABLE example
LIKE AVRO 's3://data/schemas/avro-json.json'
STORED AS AVRO LOCATION 's3://data/rs-complex-users/'
TBLPROPERTIES('avro.schema.field.path'='custom.path')

ODAS will locate the schema using the JSON identifier custom.path.

Using CREATE VIEW with complex type subfields

You can create views using just a subfield of a complex type. Column names in a view cannot contain a dot(.), so it’s necessary to give them an alias:

CREATE VIEW user_city AS SELECT user, address.city AS cityname FROM users

Omitting the alias will throw an error:

CREATE VIEW user_city AS Select user, address.city from users

Could not execute ddl., detail:AnalysisException: Invalid column/field name: address.city

Scanning data with the ODAS REST API

The REST API supports JSON be default. Invoking the curl utility against the users table endpoint will return:

> curl REST_SERVER:PORT/api/scan/users
[
    {
        "uid": 100,
        "user": "alice",
        "address": {
            "city": "san francisco",
            "state": "ca"
        },
        "age": 25
    },
    {
        "uid": 101,
        "user": "bob",
        "address": {
            "city": "seattle",
            "state": "wa"
        },
        "age": 25
    }
]

Note: This example omits the authentication headers for brevity. See (REST API OAuth](catalog-rest-api#authentication) for details.

Scanning a table via SparkSQL

ODAS support for struct types aligns with Spark’s record type. Some representative queries on the users table include:

  • Creating an ephemeral table:
       spark.sql(s"""
            |CREATE TEMPORARY TABLE users
            |USING com.cerebro.recordservice.spark.DefaultSource
            |OPTIONS (RecordServiceTable 'users')
       """.stripMargin)
    

    Note: in EMR (or any ODAP configuration that uses an external Hive Metastore), you can omit this step and just write queries against db.table.)

  • Returning a record count:
     spark.sql("SELECT count(*) from users").show()
    
  • Selecting the first column:
     spark.sql("SELECT uid from users").show()
    

    Recall the type of the uid field is BIGINT.

  • Selecting the first column and a subfield of the struct. This request flattens the nested type:
     spark.sql("SELECT uid, address.city from users").show()
    
  • Selecting the first column and the column containing a struct:
     spark.sql("SELECT uid, address from users").show()
    
  • Selecting all fields from the table:
     val row = sc.sql("SELECT * from users").collect()(0)  // Row for 'alice'
     assert(row.get(0) == 100)
     assert(row.get(1) == "alice")
     assert(row.get(2).asInstanceOf[Row](0) == "seattle")
     assert(row.get(2).asInstanceOf[Row](1) == "wa")
     assert(row.get(3) == 25)
    

    The top-level record consists of four fields [BIGINT, STRING, STRUCT, INT].

Scanning via Apache Hive

The Avro struct type maps to Hive’s struct type. For example, describing the users schema:

   hive> describe users;
   OK
   uid                 	bigint
   user                	string
   address             	struct<city:string,state:string>
   age                 	int

A few more examples of Hive DDL-based scans:

Query

hive> select * from users;

Results

OK
{"city":"san francisco","state":"ca"}	25
{"city":"seattle","state":"wa"}	25

Query

hive> select uid, user from users;

Results

OK
100	alice
101	bob

Query

hive> select uid, address from users;

Results

OK
100	{"city":"san francisco","state":"ca"}
101	{"city":"seattle","state":"wa"}

Query

hive> select uid, address.city from users;

Results

OK
100	san francisco
101	seattle

Query

hive> select * from users where uid = 100;

Results

OK
100	alice	{"city":"san francisco","state":"ca"}	25

Query

hive> select address.city, count(*) from users group by address.city;

Results

OK
seattle	1
san francisco	1

Query

hive> select * from users where address.state = 'wa';

Results

OK
101	bob	{"city":"seattle","state":"wa"}	25
san francisco	1

Scanning table data with Presto

Presto supports struct types. The examples below reflect the expected Presto experience.

presto> describe recordservice.default.users;

Results

 Column  |               Type               | Extra | Comment
---------+----------------------------------+-------+---------
 uid     | bigint                           |       |
 user    | varchar                          |       |
 address | row(city varchar, state varchar) |       |
 age     | integer                          |       |

Some additional examples:

presto> select * from recordservice.default.users;

Results

 uid | user  |            address             | age
-----+-------+--------------------------------+-----
 100 | alice | {city=san francisco, state=ca} |  25
 101 | bob   | {city=seattle, state=wa}       |  25
(2 rows)
presto> select uid, user from recordservice.default.users;

Results

uid | user
-----+-------
 100 | alice
 101 | bob
(2 rows)
presto> select uid, address from recordservice.default.users;

Results

 uid |            address
-----+--------------------------------
 100 | {city=san francisco, state=ca}
 101 | {city=seattle, state=wa}
(2 rows)
presto> select uid, address.city from recordservice.default.users;

Results

 uid |     city
-----+---------------
 100 | san francisco
 101 | seattle
(2 rows)
presto> select * from recordservice.default.users where uid = 100;

Results

uid | user  |            address             | age
-----+-------+--------------------------------+-----
 100 | alice | {city=san francisco, state=ca} |  25
presto> select address.city, count(*) from recordservice.default.users group by address.city

Results

     city      | _col1
---------------+-------
 san francisco |     1
 seattle       |     1
presto> select * from recordservice.default.users where address.state = 'wa';

Results

 uid | user |         address          | age
-----+------+--------------------------+-----
 101 | bob  | {city=seattle, state=wa} |  25
(1 row)

Array

The DDL for creating tables that include array types is standard HiveQL:

CREATE EXTERNAL TABLE user_phone_numbers(
   phone_numbers struct<
    uid: BIGINT,
    numbers: ARRAY<STRING>,
    user: STRING
  >
) STORED AS AVRO
LOCATION 's3://okera/rs-complex-users/'

The table description should appear as follows:

> describe user_phone_numbers;
uid	bigint
numbers array<string>
user string

Scanning with the REST API

The REST API returns arrays as a JSON ordered list. A curl on the endpoint for the user_phone_numbers table would return:

> curl REST_SERVER:PORT/api/scan/user_phone_numbers?records=1
[
    {
        "phone_numbers": {
            "uid": 123,
            "numbers": [
                "111-222-3333",
                "444-555-6666"
            ],
            "user": "john"
        }
    }
]

Scanning with the SparkSQL

ODAS support for array types aligns with Spark’s array type. To create an ephemeral view:

spark.sql(s"""
    |CREATE TEMPORARY VIEW user_phone_numbers
    |USING com.cerebro.recordservice.spark.DefaultSource
    |OPTIONS (RecordServiceTable 'rs_complex.user_phone_numbers')
""".stripMargin)

A SELECT on all fields, including the array type, will return:

spark.sql("SELECT * FROM user_phone_numbers").show()
+--------------------+
|       phone_numbers|
+--------------------+
|[123,WrappedArray...|
|[234,WrappedArray...|
|[345,WrappedArray...|
+--------------------+
scala> spark.sql("select phone_numbers.numbers from user_phone_numbers").show()
+--------------------+
|             numbers|
+--------------------+
|[111-222-3333, 44...|
|[222-333-4444, 55...|
|      [111-222-5555]|
+--------------------+

Scanning table data with Apache Hive

Support for array types also aligns with Hive’s array type.

A SELECT`on all fields including an array type and one on the array type alone are shown below:

hive> SELECT * FROM user_phone_numbers
{"uid":123,"numbers":["111-222-3333","444-555-6666"],"user":"john"}
{"uid":234,"numbers":["222-333-4444","555-666-7777"],"user":"jim"}
{"uid":345,"numbers":["111-222-5555"],"user":"jack"}
hive> select phone_numbers.numbers from user_phone_numbers
["111-222-3333","444-555-6666"]
["222-333-4444","555-666-7777"]
["111-222-5555"]

Scanning table data with Presto

Support for array types also aligns with Presto’s array type.

A SELECT on all the fields including an array type and the array type by itself are shown below:

presto> select * from recordservice.rs_complex.user_phone_numbers;
                       phone_numbers
------------------------------------------------------------
 {uid=123, numbers=[111-222-3333, 444-555-6666], user=john}
 {uid=234, numbers=[222-333-4444, 555-666-7777], user=jim}
 {uid=345, numbers=[111-222-5555], user=jack}
presto> select phone_numbers.numbers from recordservice.rs_complex.user_phone_numbers;
           numbers
------------------------------
 [111-222-3333, 444-555-6666]
 [222-333-4444, 555-666-7777]
 [111-222-5555]

JSON file format

JSON files are supported by ODAS starting from 1.4 release. JSON files can be specified as STORED AS JSON as part of the create table statement. For example,

CREATE EXTERNAL TABLE user_phone_numbers(
   phone_numbers struct<
    uid: BIGINT,
    numbers: ARRAY<STRING>,
    user: STRING
  >
)
STORED AS JSON
LOCATION 's3://okera/rs-complex-users-json/'

Refer to other options for table creation in supported-sql

The returned records would be consistent and similar to results from other file formats.

Map

Scope of support

Maps are supported for map<string, primitiveType> implementations. Mapping to a complex type (e.g., map<string, complexType> is supported for PARQUET File format and not yet supported for AVRO file format. Additionally, maps are not supported for JSON file formats.

Using Map types

The DDL for creating tables with a map type corresponds to standard HiveQL:

CREATE EXTERNAL TABLE user_phone_numbers_map(
   uid BIGINT,
   username STRING,
   contact_numbers MAP<STRING, STRING>
) STORED AS AVRO
LOCATION 's3://okera/rs-complex-map-userphones'

A description of this table should return:

> describe user_phone_numbers_map;
uid	bigint
username string
contact_numbers map<string,string>

Scanning with the REST API

The REST API returns Map types as a JSON object in the form of name-value pairs. A curl on the endpoint for the user_phone_numbers_map table should return:

> curl REST_SERVER:PORT/api/scan/user_phone_numbers_map?records=1&format=1
[
    {
        "uid": 1,
        "username": "Alex",
        "contact_numbers": {
            "work": "444-555-6666",
            "home": "111-222-3333"
        }
    }
]

Scanning table data via SparkSQL

ODAS support for map types aligns with Spark’s map type. A select on all fields including a map type returns:

scala> spark.sql("select * from rs_complex.user_phone_numbers_map").show()
+---+--------+--------------------+
|uid|username|     contact_numbers|
+---+--------+--------------------+
|  1|    Alex|[work -> 444-555-...|
|  2| Melinda|[work -> 444-555-...|
+---+--------+--------------------+

Scanning table data with Apache Hive

Support for map types also aligns with Hive’s map type. A SELECT on all fields including a map type will return:

hive> SELECT * FROM rs_complex.user_phone_numbers_map;
OK
1	Alex	{"work":"444-555-6666","home":"111-222-3333"}
2	Melinda	{"work":"444-555-8888","home":"111-222-7777"}

Limited support for unions in Avro

The Avro specification uses unions only to incorporate the NULL type into all other types, thus making all Avro types nullable. All types in ODAS are also nullable. In this context, the specific case for Avro unions is supported. Any other use case, such as accepting a union between a string and int type, is not supported.

Support for the enum type in Avro

The Avro specification defines enums as a complex type; ODAS does not. A table or view in ODAS cannot define a column with the ENUM type).

The ODAS engine will, however, convert any Avro field of enum type to a string if the corresponding table column is declared as STRING. This behavior matches what Hive does under the same conditions.