Complex Data Types¶
This document details the support for complex data types in Okera. The supported complex data types include:
- structs
- arrays
- maps (limited support; see below for details)
- unions (limited support; see below for details)
There is no plan to support unions or enums in Okera, excepting in special cases for Avro. See details below.
File Format Support¶
Complex type support applies to data stored in Avro, Parquet and JSON formats.
Note: Maps are not supported for JSON file formats.
Struct Data Types¶
Okera supports struct data types with up to 100 levels of nesting. Struct types are useful for logs, event data, and other records that use JSON.
Several examples below are based on the following schema below to illustrate how Okera supports structs.
{
"uid": 100,
"user": "alice",
"address": {
"city": "san francisco",
"state": "ca"
},
"age": 25
}
Using a CREATE SQL Command With Avro-Formatted JSON Data¶
Okera supports the HiveQL SQL for table creation. To create a table using our example schema (assuming Avro format):
CREATE EXTERNAL TABLE users(
uid BIGINT,
user STRING,
address STRUCT<
city: STRING,
state: STRING>,
age INT
) STORED AS AVRO
LOCATION 's3://okera/rs-complex-users/'
A description of this table will return:
> describe users;
uid bigint
name string
address struct<
city:string,
state:string
>
age int
Creating a DDL With File-Based Avro Schema¶
If a schema is stored in an .avsc
file, the CREATE
statement can use that file in lieu of an explicit schema definition. Given a users.avsc
file, for example, you could write:
CREATE EXTERNAL TABLE users
LIKE AVRO 's3://path/to/schemas/users.avsc'
LOCATION 's3://okera/rs-complex-users/'
This approach offers several advantages: * It is less error-prone than typing or a cut-and-paste operation. * It is the only option for large, detailed schema, in particular any schema that contains a struct whose serialized values exceed 4kb. Any attempt to paste such a schema into a CREATE statement would fail and report the field is too long. * If a field name is typed incorrectly in HiveQL, the engine can only infer a new field is being declared, resulting in a schema change instead of an error. Reading from a schema file will avoid this potential confusion. * Some valid Avro artifacts, such as an empty struct, have no analog in HiveQL. The Okera engine can however handle these cases transparently when they are read from an Avro schema file.
Using a Nonstandard Avro Schema File¶
Okera assumes the given Avro path will point to a well-formed Avro JSON schema. The Okera engine can also read an Avro schema that has been embedded in a JSON object.
Say there is an avro-json.json
file that includes the following elements:
{
'owner' : 'user',
'custom' : {
'key' : 'value',
'path' : [avro schema as escaped json]
}
}
You can write a CREATE TABLE
statement to read the avro-json.json
file and extract the Avro schema by setting the avro.schema.field.path
table property as follows:
CREATE EXTERNAL TABLE example
LIKE AVRO 's3://data/schemas/avro-json.json'
STORED AS AVRO LOCATION 's3://data/rs-complex-users/'
TBLPROPERTIES('avro.schema.field.path'='custom.path')
Okera will locate the schema using the JSON identifier custom.path
.
Using CREATE VIEW With Complex Type Subfields¶
You can create views using just a subfield of a complex type. Column names in a view cannot contain a dot(.), so it's necessary to give them an alias:
CREATE VIEW user_city AS SELECT user, address.city AS cityname FROM users
CREATE VIEW user_city AS Select user, address.city from users
Could not execute ddl., detail:AnalysisException: Invalid column/field name: address.city
Scanning Data With the Okera REST API¶
The REST API supports JSON by default. Invoking the curl utility against the users
table endpoint will return:
> curl REST_SERVER:PORT/api/scan/users
[
{
"uid": 100,
"user": "alice",
"address": {
"city": "san francisco",
"state": "ca"
},
"age": 25
},
{
"uid": 101,
"user": "bob",
"address": {
"city": "seattle",
"state": "wa"
},
"age": 25
}
]
Note: This example omits the authentication headers for brevity. See REST API OAuth for details.
Scanning a Table via SparkSQL¶
Okera support for struct types aligns with Spark's record type. Some representative queries on the users
table include:
-
Creating an ephemeral table:
spark.sql(s""" |CREATE TEMPORARY TABLE users |USING com.cerebro.recordservice.spark.DefaultSource |OPTIONS (RecordServiceTable 'users') """.stripMargin)
Note: in Amazon EMR (or any ODAP configuration that uses an external Hive Metastore), you can omit this step and just write queries against
db.table
.) -
Returning a record count:
spark.sql("SELECT count(*) from users").show()
-
Selecting the first column:
spark.sql("SELECT uid from users").show()
Recall the type of the
uid
field isBIGINT
. -
Selecting the first column and a subfield of the struct. This request flattens the nested type:
spark.sql("SELECT uid, address.city from users").show()
-
Selecting the first column and the column containing a struct:
spark.sql("SELECT uid, address from users").show()
-
Selecting all fields from the table:
val row = sc.sql("SELECT * from users").collect()(0) // Row for 'alice' assert(row.get(0) == 100) assert(row.get(1) == "alice") assert(row.get(2).asInstanceOf[Row](0) == "seattle") assert(row.get(2).asInstanceOf[Row](1) == "wa") assert(row.get(3) == 25)
The top-level record consists of four fields
[BIGINT, STRING, STRUCT, INT]
.
Scanning via Apache Hive¶
The Avro struct type maps to Hive's struct type.
For example, describing the users
schema:
hive> describe users;
OK
uid bigint
user string
address struct<city:string,state:string>
age int
A few more examples of Hive DDL-based scans:
Query
hive> select * from users;
OK
{"city":"san francisco","state":"ca"} 25
{"city":"seattle","state":"wa"} 25
Query
hive> select uid, user from users;
OK
100 alice
101 bob
Query
hive> select uid, address from users;
OK
100 {"city":"san francisco","state":"ca"}
101 {"city":"seattle","state":"wa"}
Query
hive> select uid, address.city from users;
OK
100 san francisco
101 seattle
Query
hive> select * from users where uid = 100;
OK
100 alice {"city":"san francisco","state":"ca"} 25
Query
hive> select address.city, count(*) from users group by address.city;
OK
seattle 1
san francisco 1
Query
hive> select * from users where address.state = 'wa';
OK
101 bob {"city":"seattle","state":"wa"} 25
san francisco 1
Scanning Table Data With Presto¶
Presto supports struct types. The examples below reflect the expected Presto experience.
Query
presto> describe recordservice.default.users;
Column | Type | Extra | Comment
---------+----------------------------------+-------+---------
uid | bigint | |
user | varchar | |
address | row(city varchar, state varchar) | |
age | integer | |
Some additional examples:
Query
presto> select * from recordservice.default.users;
uid | user | address | age
-----+-------+--------------------------------+-----
100 | alice | {city=san francisco, state=ca} | 25
101 | bob | {city=seattle, state=wa} | 25
(2 rows)
Query
presto> select uid, user from recordservice.default.users;
uid | user
-----+-------
100 | alice
101 | bob
(2 rows)
Query
presto> select uid, address from recordservice.default.users;
uid | address
-----+--------------------------------
100 | {city=san francisco, state=ca}
101 | {city=seattle, state=wa}
(2 rows)
Query
presto> select uid, address.city from recordservice.default.users;
uid | city
-----+---------------
100 | san francisco
101 | seattle
(2 rows)
Query
presto> select * from recordservice.default.users where uid = 100;
uid | user | address | age
-----+-------+--------------------------------+-----
100 | alice | {city=san francisco, state=ca} | 25
Query
presto> select address.city, count(*) from recordservice.default.users group by address.city
city | _col1
---------------+-------
san francisco | 1
seattle | 1
Query
presto> select * from recordservice.default.users where address.state = 'wa';
uid | user | address | age
-----+------+--------------------------+-----
101 | bob | {city=seattle, state=wa} | 25
(1 row)
Array Data Types¶
Note: PostgreSQL and Oracle array and array(array) data types are not supported in Okera.
The DDL for creating tables that include array types is standard HiveQL:
CREATE EXTERNAL TABLE user_phone_numbers(
phone_numbers struct<
uid: BIGINT,
numbers: ARRAY<STRING>,
user: STRING
>
) STORED AS AVRO
LOCATION 's3://okera/rs-complex-users/'
The table description should appear as follows:
> describe user_phone_numbers;
uid bigint
numbers array<string>
user string
Scanning With the REST API¶
The REST API returns arrays as a JSON ordered list. A curl on the endpoint for the user_phone_numbers
table would return:
> curl REST_SERVER:PORT/api/scan/user_phone_numbers?records=1
[
{
"phone_numbers": {
"uid": 123,
"numbers": [
"111-222-3333",
"444-555-6666"
],
"user": "john"
}
}
]
Scanning With SparkSQL¶
Okera support for array types aligns with Spark's array type. To create an ephemeral view:
spark.sql(s"""
|CREATE TEMPORARY VIEW user_phone_numbers
|USING com.cerebro.recordservice.spark.DefaultSource
|OPTIONS (RecordServiceTable 'rs_complex.user_phone_numbers')
""".stripMargin)
A SELECT
on all fields, including the array type, will return:
spark.sql("SELECT * FROM user_phone_numbers").show()
+--------------------+
| phone_numbers|
+--------------------+
|[123,WrappedArray...|
|[234,WrappedArray...|
|[345,WrappedArray...|
+--------------------+
scala> spark.sql("select phone_numbers.numbers from user_phone_numbers").show()
+--------------------+
| numbers|
+--------------------+
|[111-222-3333, 44...|
|[222-333-4444, 55...|
| [111-222-5555]|
+--------------------+
Scanning Table Data With Apache Hive¶
Support for array types also aligns with Hive's array type.
A SELECT
on all fields including an array type and one on the array type alone are shown below:
hive> SELECT * FROM user_phone_numbers
{"uid":123,"numbers":["111-222-3333","444-555-6666"],"user":"john"}
{"uid":234,"numbers":["222-333-4444","555-666-7777"],"user":"jim"}
{"uid":345,"numbers":["111-222-5555"],"user":"jack"}
hive> select phone_numbers.numbers from user_phone_numbers
["111-222-3333","444-555-6666"]
["222-333-4444","555-666-7777"]
["111-222-5555"]
Scanning Table Data With Presto¶
Support for array types also aligns with Presto's array type.
A SELECT
on all the fields including an array type and the array type by itself are shown below:
presto> select * from recordservice.rs_complex.user_phone_numbers;
phone_numbers
------------------------------------------------------------
{uid=123, numbers=[111-222-3333, 444-555-6666], user=john}
{uid=234, numbers=[222-333-4444, 555-666-7777], user=jim}
{uid=345, numbers=[111-222-5555], user=jack}
presto> select phone_numbers.numbers from recordservice.rs_complex.user_phone_numbers;
numbers
------------------------------
[111-222-3333, 444-555-6666]
[222-333-4444, 555-666-7777]
[111-222-5555]
JSON file format¶
JSON files are supported by Okera starting with release 1.4. JSON files can be specified as STORED AS JSON
as part of the create table statement. For example,
CREATE EXTERNAL TABLE user_phone_numbers(
phone_numbers struct<
uid: BIGINT,
numbers: ARRAY<STRING>,
user: STRING
>
)
STORED AS JSON
LOCATION 's3://okera/rs-complex-users-json/'
The returned records would be consistent and similar to results from other file formats.
Map Data Types¶
Scope of Support¶
Maps are supported for map
Note: PostgreSQL and Oracle map data types are not supported by Okera.
Using Map Types¶
The DDL for creating tables with a map type corresponds to standard HiveQL:
CREATE EXTERNAL TABLE user_phone_numbers_map(
uid BIGINT,
username STRING,
contact_numbers MAP<STRING, STRING>
) STORED AS AVRO
LOCATION 's3://okera/rs-complex-map-userphones'
A description of this table should return:
> describe user_phone_numbers_map;
uid bigint
username string
contact_numbers map<string,string>
Scanning With the REST API¶
The REST API returns Map types as a JSON object in the form of name-value pairs. A curl on the endpoint for the user_phone_numbers_map
table should return:
> curl REST_SERVER:PORT/api/scan/user_phone_numbers_map?records=1&format=1
[
{
"uid": 1,
"username": "Alex",
"contact_numbers": {
"work": "444-555-6666",
"home": "111-222-3333"
}
}
]
Scanning Table Data via SparkSQL¶
Okera support for map types aligns with Spark's map type. A select on all fields including a map type returns:
scala> spark.sql("select * from rs_complex.user_phone_numbers_map").show()
+---+--------+--------------------+
|uid|username| contact_numbers|
+---+--------+--------------------+
| 1| Alex|[work -> 444-555-...|
| 2| Melinda|[work -> 444-555-...|
+---+--------+--------------------+
Scanning Table Data With Apache Hive¶
Support for map types also aligns with Hive's map type. A SELECT
on all fields including a map type will return:
hive> SELECT * FROM rs_complex.user_phone_numbers_map;
OK
1 Alex {"work":"444-555-6666","home":"111-222-3333"}
2 Melinda {"work":"444-555-8888","home":"111-222-7777"}
Limited Support for Unions in Avro¶
The Avro specification uses unions only to incorporate the NULL type into all other types, thus making all Avro types nullable. All types in Okera are also nullable. In this context, the specific case for Avro unions is supported. Any other use case, such as accepting a union between a string and int type, is not supported.
Support for the enum Type in Avro¶
The Avro specification defines enums as a complex type; Okera does not. A table or view in Okera cannot define a column with the ENUM
type.
The Okera engine will, however, convert any Avro field of enum type to a string if the corresponding table column is declared as STRING
. This behavior matches what Hive does under the same conditions.
Configuring Parquet File Resolution Types¶
Table property parquet.resolve-by.type
can be used to configure how a Parquet data file is resolved. Valid values are ordinal
(positional resolution) and name
(name resolution).
For example:
ALTER TABLE nation SET TBLPROPERTIES('parquet.resolve-by.type'='name')
ALTER TABLE nation SET TBLPROPERTIES('parquet.resolve-by.type'='ordinal')