Skip to content

Data Types

This document describes how Okera handles data types and values. We differentiate between the two in that data types are used when specifying schemas (for example, during a 'create table' call) and values are the data that exists in a given row within a table.

Currently Supported Data Types

  • BOOL
  • TINYINT
  • SMALLINT
  • INT
  • BIGINT
  • FLOAT
  • DOUBLE
  • STRING
  • VARCHAR
  • CHAR
  • DECIMAL
  • DATE
  • TIMESTAMP
  • BINARY
  • REAL
  • STRUCT

See the NOTES section at the bottom of this page for more information on types.

Conversions

Okera must convert both values as well as data types in some situations, based on the storage format and the compute engine being used. Some platforms do not not support the full range of types that ODAS does.

Parquet and Spark DataFrames

These are the conversions that occur when working with Parquet data or Spark DataFrames values.

Datatype Parquet type Spark Data frame type Avro type
boolean boolean BooleanType boolean
tinyint int32 IntegerType int
smallint int32 IntegerType int
int int32 IntegerType int
bigint int64 LongType long
float float FloatType float
double double DoubleType double
date int32 IntegerType int
timestamp int96 TimestampType string
string byte_array StringType string
binary byte_array NA bytes
decimal fixed_len_byte_array createDecimalType() bytes
binary byte_array BinaryType bytes
real double DoubleType double

Spark notes

Filter pushdown on date partition, and hence partition pruning, is supported in spark. For this to work, spark requires DATE function to be used, to cast to date type.

Example: This will push the date partition filter to ODAS.

select * from datedb.part_test where d = DATE('2019-01-01')

This will NOT push the date partition filter to ODAS.

select * from datedb.part_test where d = '2019-01-01'

Notes

  • The string and binary data types are stored as a binary blob and not interpreted in any way.
  • DATE type is now supported. The display format is "YYYY-MM-DD". It is stored internally as an int number of days since unix epoch. Note that timestamp literal is accepted in the filters. The time portion is ignored.
  • REAL type is now supported in ODAS. Since, Hive does not support REAL data type, odb may be used to create a field with REAL datatype. DOUBLE type can be used as an alias for REAL.
  • For complex datatypes, refer to complex types
  • Decimal type is returned as a string in the json resultset when the client connects to odas rest server The rest server client may choose to convert it back to decimal type as needed. Note that most compute engines/applications connect to ODAS planner directly and support and retrieve decimal type directly.
  • ODAS supports the above mentioned data types for JSON file formats. Since everything is stored as a string, there is no tight mapping and the users need to select the appropriate data type in the table creation or use the auto-inference by providing sample JSON file.

JDBC Data Types Mapping

The following table shows the mapping between Okera and Relational/JDBC source data types.

Okera MySQL PostgreSQL Snowflake Redshift AWS Athena SQL Server Sybase Oracle
boolean BOOL, BOOLEAN BOOL, BOOLEAN, BIT BOOL, BOOLEAN BOOL, BOOLEAN BOOLEAN BOOLEAN
tinyint TINYINT TINYINT TINYINT TINYINT
smallint SMALLINT SMALLINT SMALLINT, INT2 SMALLINT SMALLINT SMALLINT
int MEDIUMINT, INT INT, SERIAL INTEGER, INT, INT4 INT INT INT
bigint BIGINT BIGINT, BIGSERIAL SMALLINT, INT, INTEGER, BIGINT, NUMBER BIGINT, INT8 BIGINT BIGINT BIGINT TINYINT, SMALLINT, INT, INTEGER, BIGINT, NUMBER
double DOUBLE DOUBLE, MONEY, REAL DOUBLE, REAL DOUBLE, FLOAT8, FLOAT, REAL, FLOAT4 DOUBLE REAL REAL REAL
decimal(p,s) FLOAT(p,s), decimal(p,s) NUMERIC(p,s), DECIMAL(p,s) NUMERIC(p,s), DECIMAL(p,s) DECIMAL(p,s), NUMERIC DECIMAL(p,s) DECIMAL(p,s) DECIMAL(p,s) DECIMAL(p,s), NUMERIC(p,s)
char(N) CHAR(N) CHAR(N) CHAR, CHARACTER, NCHAR, BPCHAR CHAR CHAR(N) CHAR(N) char(N)
char(2) ENUM
char(6) SET
varchar(N) VARCHAR(N) VARCHAR(N) CHAR(N), VARCHAR(N) CHARACTER VARYING, NVARCHAR, TEXT VARCHAR VARCHAR(N) VARCHAR(N) VARCHAR(N)
varchar(255) TINYTEXT STRING
varchar(65355) TEXT
string MEDIUMTEXT, LONGTEXT, BLOB, TINYBLOB, MEDIUMBLOB, LONGBLOB, BINARY(N), VARBINARY(N) TEXT STRING, TEXT, BINARY(N), VARBINARY(N) TEXT, TIMESTAMP, BINARY(N), VARBINARY(N) TEXT, TIMESTAMP, BINARY(N), VARBINARY(N)
timestamp TIMESTAMP, TIME, DATETIME TIMESTAMP TIMESTAMP TIMESTAMP TIMESTAMP DATETIME DATETIME TIMESTAMP
date DATE DATE DATE DATE DATE DATE DATE DATE

Starting from Okera version 2.1.0, Decimals/Numeric types without a p/s or > 38 precision are capped at 38(p) at Okera. If scale is not specified in the source JDBC data source, a default of 6(s) is considered to avoid scale loss

Unsupported JDBC Data types

  • MySQL: YEAR
  • PostgreSQL: JSON, TIME, UUID and XML
  • Snowflake: TIME, VARIANT, OBJECT and ARRAY