Data Types¶
This document describes how Okera handles data types and values. We differentiate
between the two in that data types are used when specifying schemas (for example,
during a CREATE TABLE
call) and values are the data that exists in a given row
within a table.
Currently Supported Okera Data Types¶
- bigint
- binary
- bool
- char
- date
- decimal
- double
- float
- int
- real
- smallint
- string
- struct
- timestamp
- tinyint
- varchar
See Notes at the bottom of this page for more information on these types.
Conversions¶
Okera must convert both values as well as data types in some situations, based on the storage format and the compute engine being used. Some platforms do not not support the full range of types that Okera does.
Parquet and Spark DataFrames¶
These are the conversions that occur when working with Parquet data or Spark DataFrames values.
Data Type | Parquet Type | Spark Data Frame Type | Avro Type |
---|---|---|---|
boolean | boolean | BooleanType | boolean |
tinyint | int32 | IntegerType | int |
smallint | int32 | IntegerType | int |
int | int32 | IntegerType | int |
bigint | int64 | LongType | long |
float | float | FloatType | float |
double | double | DoubleType | double |
date | int32 | IntegerType | int |
timestamp | int96 | TimestampType | string |
string | byte_array | StringType | string |
binary | byte_array | NA | bytes |
decimal | fixed_len_byte_array | DecimalType() | bytes |
binary | byte_array | BinaryType | bytes |
real | double | DoubleType | double |
Spark Notes¶
Filter pushdown on a date partition, and hence partition pruning, is supported in Spark.
-
In Spark 2 environments, by default, Spark requires the DATE function to be used to cast to the
date
data type. This is controlled by thespark.recordservice.spark.parser
configuration setting. By default, this setting is set toPrunedFilteredScan
.In the following example, the DATE function pushes the date partition filter to Okera.
select * from datedb.part_test where d = DATE('2019-01-01')
But in the following example, the DATE function is not used, so this will NOT push the date partition filter to Okera.
select * from datedb.part_test where d = '2019-01-01'
However, pushdown of date filters can be performed without the DATE function if you use SparkConf to change the
spark.recordservice.spark.parser
setting toCatalystScan
. -
In Spark 3, date filters are pushed down by default.
Notes¶
- The
string
andbinary
data types are stored as a binary blob and not interpreted in any way. - The
date
data type is supported. The display format is "YYYY-MM-DD". It is stored internally as anint
number of days since Unix epoch. Note that a timestamp literal is accepted in the filters. The time portion is ignored. - The
real
data type is supported in Okera. Thedouble
data type can be used as an alias for thereal
data type. - For complex data types, refer to complex types.
- The
decimal
type is returned as astring
in the JSON result set when the client connects to the Okera REST server. The REST server client can convert it back todecimal
type as needed. Note that most compute engines and applications connect to the Okera Policy Engine (planner) directly and support and retrievedecimal
types directly. - Okera supports the data types mentioned above for JSON file formats.
Since everything is stored as a
string
, there is no tight mapping and you need to select the appropriate data type in the table creation or use the auto-inference by providing a sample JSON file.
JDBC Data Types Mapping¶
The following table maps Okera data types and relational/JDBC data source data types. (Scroll to the right to see SQL Server, Sybase, and Oracle mappings.)
Okera | MySQL | PostgreSQL | Snowflake | Redshift | AWS Athena | SQL Server | Sybase | Oracle |
---|---|---|---|---|---|---|---|---|
bigint | bigint | bigint, bigserial, int8 | smallint, int, integer, bigint, number | bigint, int8 | bigint | bigint | bigint | tinyint, smallint, decimal, decimal(p,s), int, integer, number(p,s), numeric, numeric(p,s) |
boolean | bool, boolean | bool, boolean, bit | bool, boolean | bool, boolean | boolean | boolean | ||
char(n) | char(n) | char(n) | char, character, nchar, bpchar | char | char(n) | char(n) | char, nchar | |
char(2) | enum | |||||||
char(6) | set | |||||||
date | date | date | date | date | date | date | date | date |
decimal(p,s) | float(p,s), decimal(p,s) | numeric(p,s), decimal(p,s) | numeric(p,s), decimal(p,s) | decimal(p,s), numeric | decimal(p,s) | decimal(p,s) | decimal(p,s) | decimal(p,s), number, number(p,s), numeric(p,s) |
double | double | double, money, real, float8 | double, real | double, float8, float, real, float4 | double | real | real | real |
float | float, double precision, real | |||||||
int | mediumint, int | int, serial, int4, integer | integer, int, int4 | int | int | int | ||
smallint | smallint | smallint, int2 | smallint, int2 | smallint | smallint | smallint | ||
string | mediumtext, longtext, blob, tinyblob, mediumblob, longblob, binary(n), uuid, varbinary(n) | text, BYTEA, uuid | string, text, binary(n), varbinary(n) | text, timestamp, binary(n), uuid, varbinary(n) | text, timestamp, binary(n), varbinary(n) | blob, clob, long | ||
timestamp | timestamp, time, datetime | timestamp | timestamp | timestamp | timestamp | datetime | datetime | timestamp |
timestamp_nanos | timestamp without time zone, timestamptz | date, timestamp | ||||||
tinyint | tinyint | tinyint | tinyint | tinyint | ||||
varchar(n) | varchar(n) | varchar(n) | char(n), varchar(n) | character varying, nvarchar, text | varchar | varchar(n) | varchar(n) | varchar(n), varchar2, nvarchar2 |
varchar(255) | tinytext | string | ||||||
varchar(65355) | text |
Decimal and numeric data types without a p/s
or > 38
precision are capped at 38(p)
. If the scale is not specified in the JDBC data source, a default of 6(s)
is considered to avoid scale loss.
Unsupported JDBC Data Types¶
The following relational/JDBC data source data types are not supported in Okera:
- MySQL: year
- Oracle: array, array(array), map(string,string), timestamp with time zone, and timestamp with local time zone
- PostgreSQL: array, array(array), json, map(string,string), time, uuid, and xml
- Snowflake: array, object, time, and variant