Data Types¶

This document describes how Okera handles data types and values. We differentiate between the two in that data types are used when specifying schemas (for example, during a CREATE TABLE call) and values are the data that exists in a given row within a table.

Currently Supported Okera Data Types¶

bigint
binary
bool
char
date
decimal
double
float
int
real
smallint
string
struct
timestamp
tinyint
varchar

See Notes at the bottom of this page for more information on these types.

Conversions¶

Okera must convert both values as well as data types in some situations, based on the storage format and the compute engine being used. Some platforms do not not support the full range of types that Okera does.

Parquet and Spark DataFrames¶

These are the conversions that occur when working with Parquet data or Spark DataFrames values.

Data Type	Parquet Type	Spark Data Frame Type	Avro Type
boolean	boolean	BooleanType	boolean
tinyint	int32	IntegerType	int
smallint	int32	IntegerType	int
int	int32	IntegerType	int
bigint	int64	LongType	long
float	float	FloatType	float
double	double	DoubleType	double
date	int32	IntegerType	int
timestamp	int96	TimestampType	string
string	byte_array	StringType	string
binary	byte_array	NA	bytes
decimal	fixed_len_byte_array	DecimalType()	bytes
binary	byte_array	BinaryType	bytes
real	double	DoubleType	double

Spark Notes¶

Filter pushdown on a date partition, and hence partition pruning, is supported in Spark.

In Spark 2 environments, by default, Spark requires the DATE function to be used to cast to the date data type. This is controlled by the spark.recordservice.spark.parser configuration setting. By default, this setting is set to PrunedFilteredScan.

In the following example, the DATE function pushes the date partition filter to Okera.
```
select * from datedb.part_test where d = DATE('2019-01-01')
```
But in the following example, the DATE function is not used, so this will NOT push the date partition filter to Okera.
```
select * from datedb.part_test where d = '2019-01-01'
```
However, pushdown of date filters can be performed without the DATE function if you use SparkConf to change the spark.recordservice.spark.parser setting to CatalystScan.
In Spark 3, date filters are pushed down by default.

Notes¶

The string and binary data types are stored as a binary blob and not interpreted in any way.
The date data type is supported. The display format is "YYYY-MM-DD". It is stored internally as an int number of days since Unix epoch. Note that a timestamp literal is accepted in the filters. The time portion is ignored.
The real data type is supported in Okera. The double data type can be used as an alias for the real data type.
For complex data types, refer to complex types.
The decimal type is returned as a string in the JSON result set when the client connects to the Okera REST server. The REST server client can convert it back to decimal type as needed. Note that most compute engines and applications connect to the Okera Policy Engine (planner) directly and support and retrieve decimal types directly.
Okera supports the data types mentioned above for JSON file formats. Since everything is stored as a string, there is no tight mapping and you need to select the appropriate data type in the table creation or use the auto-inference by providing a sample JSON file.

JDBC Data Types Mapping¶

The following table maps Okera data types and relational/JDBC data source data types. (Scroll to the right to see SQL Server, Sybase, and Oracle mappings.)

Okera	MySQL	PostgreSQL	Snowflake	Redshift	AWS Athena	SQL Server	Sybase	Oracle
bigint	bigint	bigint, bigserial, int8	smallint, int, integer, bigint, number	bigint, int8	bigint	bigint	bigint	tinyint, smallint, decimal, decimal(p,s), int, integer, number(p,s), numeric, numeric(p,s)
boolean	bool, boolean	bool, boolean, bit	bool, boolean	bool, boolean	boolean			boolean
char(n)	char(n)	char(n)		char, character, nchar, bpchar	char	char(n)	char(n)	char, nchar
char(2)	enum
char(6)	set
date	date	date	date	date	date	date	date	date
decimal(p,s)	float(p,s), decimal(p,s)	numeric(p,s), decimal(p,s)	numeric(p,s), decimal(p,s)	decimal(p,s), numeric	decimal(p,s)	decimal(p,s)	decimal(p,s)	decimal(p,s), number, number(p,s), numeric(p,s)
double	double	double, money, real, float8	double, real	double, float8, float, real, float4	double	real	real	real
float								float, double precision, real
int	mediumint, int	int, serial, int4, integer		integer, int, int4	int	int	int
smallint	smallint	smallint, int2		smallint, int2	smallint	smallint	smallint
string	mediumtext, longtext, blob, tinyblob, mediumblob, longblob, binary(n), uuid, varbinary(n)	text, BYTEA, uuid	string, text, binary(n), varbinary(n)			text, timestamp, binary(n), uuid, varbinary(n)	text, timestamp, binary(n), varbinary(n)	blob, clob, long
timestamp	timestamp, time, datetime	timestamp	timestamp	timestamp	timestamp	datetime	datetime	timestamp
timestamp_nanos		timestamp without time zone, timestamptz						date, timestamp
tinyint	tinyint				tinyint	tinyint	tinyint
varchar(n)	varchar(n)	varchar(n)	char(n), varchar(n)	character varying, nvarchar, text	varchar	varchar(n)	varchar(n)	varchar(n), varchar2, nvarchar2
varchar(255)	tinytext				string
varchar(65355)	text

Decimal and numeric data types without a p/s or > 38 precision are capped at 38(p). If the scale is not specified in the JDBC data source, a default of 6(s) is considered to avoid scale loss.

Unsupported JDBC Data Types¶

The following relational/JDBC data source data types are not supported in Okera:

MySQL: year
Oracle: array, array(array), map(string,string), timestamp with time zone, and timestamp with local time zone
PostgreSQL: array, array(array), json, map(string,string), time, uuid, and xml
Snowflake: array, object, time, and variant