Complex Data Types

This document describes ODAS’s support for complex types. Complex types usually consist of:

  • structs
  • arrays
  • maps
  • unions

ODAS currently supports arrays, maps and lists. There is no plan to support unions (see exception for Avro below).

File format support

Complex types are only supported when the underlying data files are in Avro format.

Struct

ODAS supports struct types with up to 100 levels of nesting. Struct types are useful to model nested data, such as logs or event data that originated as JSON.

As a running example, let’s use this JSON data.

{
  "uid": 100,
  "user": "alice",
  "address": {
    "city": "san francisco",
    "state": "ca"
  },
  "age": 25
}

CREATE DDL

The syntax for creating the table is compatible with HiveQL, using the struct type and listing the fields inside. For this schema:

CREATE EXTERNAL TABLE users(
  uid BIGINT,
  user STRING,
  address STRUCT<
    city: STRING,
    state: STRING>,
  age INT
) STORED AS AVRO
LOCATION 's3://okera/rs-complex-users/'

Describing the table should look like:

> describe users;
uid	bigint
name	string
address	struct<
  city:string,
  state:string
>
age	int

Create DDL for Avro

If the schemas are available in avro, we recommend creating the table using the schema file directly, rather than replicating it in the DDL statement.

For example, in the above users example, if the avro schema file (typically .avsc, a human-readable json file), you could do:

CREATE EXTERNAL TABLE users
LIKE AVRO 's3://path/to/schemas/users.avsc'
LOCATION 's3://okera/rs-complex-users/'

In this case, ODAS will parse the schema file and automatically populate the expected schema.

This approach alleviates a few issues:

  • Schemas can be long and for nested types in particular, can be error prone to replicate.
  • Avro resolves schemas by name, meaning the engine cannot tell a typo from a new field (schema evolution) which can be difficult to track down.
  • Some avro constructs don’t map to a SQL data model and our engine will automatically handle those. For example, Avro allows records that have no fields, but it is not possible to create a table with an empty struct. This conversion is handled automatically.
  • If the schema is very big (contains structs that serialize to over 4K characters), then specifying it as a file is the only option. If the schema is explicitly typed, it will error with an error saying the field is over 4K characters.

CREATE VIEW

A view can be created on subfields of complex type fields. Since fields may not contain a dot(.), they need to be aliased.

Example:

CREATE VIEW user_city AS SELECT user, address.city AS cityname FROM users

If alias is not specified as in

CREATE VIEW user_city AS Select user, address.city from users

ODAS would return an error like Could not execute ddl., detail:AnalysisException: Invalid column/field name: address.city

Scanning via REST API

The REST API has always returned JSON and fully supports nested types. For example, curling this endpoint now returns:

> curl REST_SERVER:PORT/api/scan/users
[
    {
        "uid": 100,
        "user": "alice",
        "address": {
            "city": "san francisco",
            "state": "ca"
        },
        "age": 25
    },
    {
        "uid": 101,
        "user": "bob",
        "address": {
            "city": "seattle",
            "state": "wa"
        },
        "age": 25
    }
]

Note: Correct auth headers are required. See (REST API OAuth](catalog-rest-api#authentication) for details.

Scanning via SparkSQL

Support for struct types behaves as expected in Spark, integrating with Spark’s record type. For example, valid queries on this table include:

  • Create an ephemeral table. Note in EMR (or any Hive Metastore integrated Hadoop install, this step can be omitted and instead queries can just be against db.table.)
       spark.sql(s"""
            |CREATE TEMPORARY TABLE users
            |USING com.cerebro.recordservice.spark.DefaultSource
            |OPTIONS (RecordServiceTable 'users')
       """.stripMargin)
    
  • Just return the count:
     spark.sql("SELECT count(*) from users").show()
    
  • Select just the first column:
     spark.sql("SELECT uid from users").show()
    

    The type of this is BIGINT.

  • Select the first column and a field in the struct. This “flattens” the nested type:
     spark.sql("SELECT uid, address.city from users").show()
    
  • Select the first column and the column with the struct:
     spark.sql("SELECT uid, address from users").show()
    
  • Select all the fields from the data:
     val row = sc.sql("SELECT * from users").collect()(0)  // Row for 'alice'
     assert(row.get(0) == 100)
     assert(row.get(1) == "alice")
     assert(row.get(2).asInstanceOf[Row](0) == "seattle")
     assert(row.get(2).asInstanceOf[Row](1) == "wa")
     assert(row.get(3) == 25)
    

    The top level record consists of 4 fields [BIGINT, STRING, STRUCT, INT].

Scanning via Hive

The struct type maps to hive’s struct type. For example, to describe the schema:

   hive> describe users
   OK
   uid                 	bigint
   user                	string
   address             	struct<city:string,state:string>
   age                 	int

Here are a few more examples of scans:

Query

hive> select * from users;

Results

OK
{"city":"san francisco","state":"ca"}	25
{"city":"seattle","state":"wa"}	25

Query

hive> select uid, `user` from users;

Results

OK
100	alice
101	bob

Query

hive> select uid, address from users;

Results

OK
100	{"city":"san francisco","state":"ca"}
101	{"city":"seattle","state":"wa"}

Query

hive> select uid, address.city from users;

Results

OK
100	san francisco
101	seattle

Query

hive> select * from users where uid = 100

Results

OK
100	alice	{"city":"san francisco","state":"ca"}	25

Query

hive> select address.city, count(*) from users group by address.city

Results

OK
seattle	1
san francisco	1

Query

hive> select * from users where address.state = 'wa'

Results

OK
101	bob	{"city":"seattle","state":"wa"}	25
san francisco	1

Scanning via Presto

Presto has support for struct types. The behavior should be a typical Presto experience.

presto> describe recordservice.default.users;

Results

 Column  |               Type               | Extra | Comment
---------+----------------------------------+-------+---------
 uid     | bigint                           |       |
 user    | varchar                          |       |
 address | row(city varchar, state varchar) |       |
 age     | integer                          |       |

Here are some example scans.

presto> select * from recordservice.default.users;

Results

 uid | user  |            address             | age
-----+-------+--------------------------------+-----
 100 | alice | {city=san francisco, state=ca} |  25
 101 | bob   | {city=seattle, state=wa}       |  25
(2 rows)
presto> select uid, user from recordservice.default.users;

Results

uid | user
-----+-------
 100 | alice
 101 | bob
(2 rows)
presto> select uid, address from recordservice.default.users;

Results

 uid |            address
-----+--------------------------------
 100 | {city=san francisco, state=ca}
 101 | {city=seattle, state=wa}
(2 rows)
presto> select uid, address.city from recordservice.default.users;

Results

 uid |     city
-----+---------------
 100 | san francisco
 101 | seattle
(2 rows)
presto> select * from recordservice.default.users where uid = 100;

Results

uid | user  |            address             | age
-----+-------+--------------------------------+-----
 100 | alice | {city=san francisco, state=ca} |  25
presto> select address.city, count(*) from recordservice.default.users group by address.city

Results

     city      | _col1
---------------+-------
 san francisco |     1
 seattle       |     1
presto> select * from recordservice.default.users where address.state = 'wa';

Results

 uid | user |         address          | age
-----+------+--------------------------+-----
 101 | bob  | {city=seattle, state=wa} |  25
(1 row)

Array

The DDL for creating tables with an array type is identical to standard HiveQL. For example,

CREATE EXTERNAL TABLE user_phone_numbers(
   phone_numbers struct<
    uid: BIGINT,
    numbers: ARRAY<STRING>,
    user: STRING
  >
) STORED AS AVRO
LOCATION 's3://okera/rs-complex-users/'

Describing the table should look like:

> describe phone_numbers;
uid	bigint
numbers array<string>
name string

Limitations

Only arrays of primitives are currently support. It is not possible, for example, to have arrays of structs or arrays.

Scanning via REST API

The REST API has always returned JSON. Arrays are returned as a json list. For example, curling this endpoint now returns:

> curl REST_SERVER:PORT/api/scan/user_phone_numbers?records=1
[
    {
        "phone_numbers": {
            "uid": 123,
            "numbers": [
                "111-222-3333",
                "444-555-6666"
            ],
            "user": "john"
        }
    }
]

Scanning via SparkSQL

Support for array types behaves as expected in Spark, integrating with Spark’s array type.

Select with all fields and array field only are shown below.

spark.sql(s"""
    |CREATE TEMPORARY VIEW user_phone_numbers
    |USING com.cerebro.recordservice.spark.DefaultSource
    |OPTIONS (RecordServiceTable 'rs_complex.user_phone_numbers')
""".stripMargin)
spark.sql("SELECT * FROM user_phone_numbers").show()
+--------------------+
|       phone_numbers|
+--------------------+
|[123,WrappedArray...|
|[234,WrappedArray...|
|[345,WrappedArray...|
+--------------------+
scala> spark.sql("select phone_numbers.numbers from user_phone_numbers").show()
+--------------------+
|             numbers|
+--------------------+
|[111-222-3333, 44...|
|[222-333-4444, 55...|
|      [111-222-5555]|
+--------------------+

Scanning via Hive

Support for array types integrates with Hive’s array type.

Select with all fields and array field only are shown below.

hive> SELECT * FROM user_phone_numbers
{"uid":123,"numbers":["111-222-3333","444-555-6666"],"user":"john"}
{"uid":234,"numbers":["222-333-4444","555-666-7777"],"user":"jim"}
{"uid":345,"numbers":["111-222-5555"],"user":"jack"}
hive> select phone_numbers.numbers from user_phone_numbers
["111-222-3333","444-555-6666"]
["222-333-4444","555-666-7777"]
["111-222-5555"]

Scanning via Presto

Support for array types integrates with Presto’s array type.

Select just the array field and user name from the data:

presto> select * from recordservice.rs_complex.user_phone_numbers;
                       phone_numbers
------------------------------------------------------------
 {uid=123, numbers=[111-222-3333, 444-555-6666], user=john}
 {uid=234, numbers=[222-333-4444, 555-666-7777], user=jim}
 {uid=345, numbers=[111-222-5555], user=jack}
presto> select phone_numbers.numbers from recordservice.rs_complex.user_phone_numbers;
           numbers
------------------------------
 [111-222-3333, 444-555-6666]
 [222-333-4444, 555-666-7777]
 [111-222-5555]

Union (Avro)

Avro uses unions to represent nullable types by expressing it as a union of the NULL type and the actual type. All types in ODAS are nullable, and this specific use of Avro unions is supported. Other uses, such as a union between a string and int, are not supported.

Enums (Avro)

Enum is a type specific to Avro which Avro considers as a complex type. We do not support enum as a catalog type (i.e. it is not possible to create a table and specify a column as ENUM). Instead, create the column with the type STRING and the enum will be transparently converted. This is the identical behavior as Hive.