Skip to content

Data Types

This document describes how Okera handles data types and values. We differentiate between the two in that data types are used when specifying schemas (for example, during a CREATE TABLE call) and values are the data that exists in a given row within a table.

Currently Supported Okera Data Types

  • bigint
  • binary
  • bool
  • char
  • date
  • decimal
  • double
  • float
  • int
  • real
  • smallint
  • string
  • struct
  • timestamp
  • tinyint
  • varchar

See Notes at the bottom of this page for more information on these types.

Conversions

Okera must convert both values as well as data types in some situations, based on the storage format and the compute engine being used. Some platforms do not not support the full range of types that Okera does.

Parquet and Spark DataFrames

These are the conversions that occur when working with Parquet data or Spark DataFrames values.

Data Type Parquet Type Spark Data Frame Type Avro Type
boolean boolean BooleanType boolean
tinyint int32 IntegerType int
smallint int32 IntegerType int
int int32 IntegerType int
bigint int64 LongType long
float float FloatType float
double double DoubleType double
date int32 IntegerType int
timestamp int96 TimestampType string
string byte_array StringType string
binary byte_array NA bytes
decimal fixed_len_byte_array DecimalType() bytes
binary byte_array BinaryType bytes
real double DoubleType double

Spark Notes

Filter pushdown on a date partition, and hence partition pruning, is supported in Spark.

  • In Spark 2 environments, by default, Spark requires the DATE function to be used to cast to the date data type. This is controlled by the spark.recordservice.spark.parser configuration setting. By default, this setting is set to PrunedFilteredScan.

    In the following example, the DATE function pushes the date partition filter to Okera.

    select * from datedb.part_test where d = DATE('2019-01-01')
    

    But in the following example, the DATE function is not used, so this will NOT push the date partition filter to Okera.

    select * from datedb.part_test where d = '2019-01-01'
    

    However, pushdown of date filters can be performed without the DATE function if you use SparkConf to change the spark.recordservice.spark.parser setting to CatalystScan.

  • In Spark 3, date filters are pushed down by default.

Notes

  • The string and binary data types are stored as a binary blob and not interpreted in any way.
  • The date data type is supported. The display format is "YYYY-MM-DD". It is stored internally as an int number of days since Unix epoch. Note that a timestamp literal is accepted in the filters. The time portion is ignored.
  • The real data type is supported in Okera. The double data type can be used as an alias for the real data type.
  • For complex data types, refer to complex types.
  • The decimal type is returned as a string in the JSON result set when the client connects to the Okera REST server. The REST server client can convert it back to decimal type as needed. Note that most compute engines and applications connect to the Okera Policy Engine (planner) directly and support and retrieve decimal types directly.
  • Okera supports the data types mentioned above for JSON file formats. Since everything is stored as a string, there is no tight mapping and you need to select the appropriate data type in the table creation or use the auto-inference by providing a sample JSON file.

JDBC Data Types Mapping

The following table maps Okera data types and relational/JDBC data source data types. (Scroll to the right to see SQL Server, Sybase, and Oracle mappings.)

Okera MySQL PostgreSQL Snowflake Redshift AWS Athena SQL Server Sybase Oracle
bigint bigint bigint, bigserial, int8 smallint, int, integer, bigint, number bigint, int8 bigint bigint bigint tinyint, smallint, decimal, decimal(p,s), int, integer, number(p,s), numeric, numeric(p,s)
boolean bool, boolean bool, boolean, bit bool, boolean bool, boolean boolean boolean
char(n) char(n) char(n) char, character, nchar, bpchar char char(n) char(n) char, nchar
char(2) enum
char(6) set
date date date date date date date date date
decimal(p,s) float(p,s), decimal(p,s) numeric(p,s), decimal(p,s) numeric(p,s), decimal(p,s) decimal(p,s), numeric decimal(p,s) decimal(p,s) decimal(p,s) decimal(p,s), number, number(p,s), numeric(p,s)
double double double, money, real, float8 double, real double, float8, float, real, float4 double real real real
float float, double precision, real
int mediumint, int int, serial, int4, integer integer, int, int4 int int int
smallint smallint smallint, int2 smallint, int2 smallint smallint smallint
string mediumtext, longtext, blob, tinyblob, mediumblob, longblob, binary(n), uuid, varbinary(n) text, BYTEA, uuid string, text, binary(n), varbinary(n) text, timestamp, binary(n), uuid, varbinary(n) text, timestamp, binary(n), varbinary(n) blob, clob, long
timestamp timestamp, time, datetime timestamp timestamp timestamp timestamp datetime datetime timestamp
timestamp_nanos timestamp without time zone, timestamptz date, timestamp
tinyint tinyint tinyint tinyint tinyint
varchar(n) varchar(n) varchar(n) char(n), varchar(n) character varying, nvarchar, text varchar varchar(n) varchar(n) varchar(n), varchar2, nvarchar2
varchar(255) tinytext string
varchar(65355) text

Decimal and numeric data types without a p/s or > 38 precision are capped at 38(p). If the scale is not specified in the JDBC data source, a default of 6(s) is considered to avoid scale loss.

Unsupported JDBC Data Types

The following relational/JDBC data source data types are not supported in Okera:

  • MySQL: year
  • Oracle: array, array(array), map(string,string), timestamp with time zone, and timestamp with local time zone
  • PostgreSQL: array, array(array), json, map(string,string), time, uuid, and xml
  • Snowflake: array, object, time, and variant