Data Types

This document describes how Okera handles data types and values. We differentiate between the two in that data types are used when specifying schemas (for example, during a ‘create table’ call) and values are the data that exists in a given row within a table.

Currently Supported Data Types

  • BOOL
  • TINYINT
  • SMALLINT
  • INT
  • BIGINT
  • FLOAT
  • DOUBLE
  • STRING
  • VARCHAR
  • CHAR
  • DECIMAL
  • TIMESTAMP
  • BINARY
  • REAL
  • STRUCT

See the NOTES section at the bottom of this page for more information on types.

Conversions

Okera must convert both values as well as data types in some situations, based on the storage format and the compute engine being used. Some platforms do not not support the full range of types that ODAS does.

Parquet and Spark DataFrames

These are the conversions that occur when working with Parquet data or Spark DataFrames values.

Datatype Parquet type Spark Data frame type Avro type
boolean boolean BooleanType boolean
tinyint int32 IntegerType int
smallint int32 IntegerType int
int int32 IntegerType int
bigint int64 LongType long
float float FloatType float
double double DoubleType double
timestamp int96 TimestampType string
string byte_array StringType string
binary byte_array NA bytes
decimal fixed_len_byte_array createDecimalType() bytes
binary byte_array BinaryType bytes
real double DoubleType double

Notes

  • The string and binary data types are stored as a binary blob and not interpreted in any way.
  • REAL type is now supported in ODAS. Since, Hive does not support REAL data type, odb may be used to create a field with REAL datatype. DOUBLE type can be used as an alias for REAL.
  • For complex datatypes, refer to complex types
  • Decimal type is returned as a string in the json resultset when the client connects to odas rest server The rest server client may choose to convert it back to decimal type as needed. Note that most compute engines/applications connect to ODAS planner directly and support and retrieve decimal type directly.