JDBC Data Source

This document outlines the ODAS support for JDBC data sources.

Introduction

The JDBC support allows for data that is stored in relational database systems to be available through the Okera Platform. The JDBC data sources supported out of the box are:

  • MySQL (5.6, 5.7)
  • PostgreSQL (9.6, 10.6)
  • Snowflake
  • Amazon Redshift
  • AWS Athena (simba jdbc driver)
  • SQL Server (microsoft sql-server)
  • Sybase

ODAS utilizes the drivers provided for JDBC sources to connect to the underlying database. The driver is identified by the driver property specified as part of the connection properties. Although, different drivers are used, the functionality remains common. Users will be able to perform the same DDL and scan operations across different supported JDBC data sources.

The JDBC driver versions supported are:

  • MySQL : 8.0.11
  • PostgreSQL : 42.2.8
  • Snowflake : 3.6.8
  • AWS Athena (simba-jdbc-driver) : 2.0.2
  • Redshift : 1.2.34.1058
  • SQL Server : 7.4.1.jre8
  • jTDS : 1.3.1

Getting Started

Configuration

Connection properties for database can be provided in a file located in a secured place. For example, a S3 folder that is secured for access and is accessible from the application may be used. A sample properties file, lets say in an S3 location s3://okera/mysql-prod.properties, would be:

driver=mysql
host=<A cloud url or host accessible from the application>
port=<Port to connect to the DB>
user=<username>
password=<password>
jdbc.db.name=<Default catalog to connect to. Note, skipping this value can result in errors>
jdbc.schema.name<Optional. Specify the schema name if used. MySQL does not support schema>
defaultdb=<The default database to connect to. Typically same as jdbc.db.name>
fetchSize=<Fetchsize to batch and retrieve data from JDBC data source. Default 100000>

Note

The jdbc.schema.name is not required for MySQL but it is a required parameter for Snowflake, PostgreSQL and Redshift.

fetchSize Use this parameter to control batch size for retrieval of data in batches.

A sample properties file for PostgreSQL would be:

driver=postgresql
host=<A cloud url or host accessible from the application>
port=<Port to connect to the DB>
user=<username>
password=<password>
jdbc.db.name=<Default catalog to connect to. Note, skipping this value can result in errors>
jdbc.schema.name=<Specify the schema name>
defaultdb=<The default database to connect to. Typically same as jdbc.db.name>
fetchSize=100000
ssl=<Optional. Valid values, true|false>
sslrootcert=<Required if ssl=true. A cloud location of the root ca certificate. Example for AWS RDS postgresql this value would be, s3://rds-downloads/rds-ca-2015-root.pem>

A sample properties file for Snowflake would be:

driver=snowflake
host=<snowflake account>.snowflakecomputing.com
port=<n/a for snowflake>
db=SNOWFLAKE_SAMPLE_DATA
user=<username>
password=<password>
account=<Required. Snowflake account>
jdbc.db.name=<Default catalog to connect to. Note, skipping this value can result in errors>
jdbc.schema.name<Optional. Specify the schema name if used>
defaultdb=<The default database to connect to. Typically same as jdbc.db.name>

Note

For Snowflake, 'db' parameter is mandatory. This should be the default database for connection.

A sample properties file for AWS Redshift would be:

driver=redshift
host=[InstanceName].[Region].redshift.amazonaws.com
port=<port number>
user=<username>
password=<password>
jdbc.db.name=<Database to connect to>
defaultdb=<Default database to connect to>
jdbc.schema.name=public

A sample properties file for AWS Athena would be:

driver=awsathena
host=athena.[Region].amazonaws.com
port=443
db=<defaultdb>
user=<awsAccessKey>
password=<awsSecretKey>
jdbc.db.name=<Default catalog to connect to. Note, skipping this value can result in errors. Use default as 'AwsDataCatalog'>
jdbc.schema.name=<Optional. Specify the schema name if used, This is referred to as database in awsathena>
S3OutputLocation=<The s3 default output path, can be found in awsathena settings>
defaultdb=<The default database to connect to. Typically same as jdbc.db.name>

Note

For awsathena the jdbc.db.name setting is the catalog name, which is typically defaulted to AwsDataCatalog and the jdbc.schema.name is the actual awsathena database name.

A sample properties file for jTDS driver based JDBC data sources would be:

driver=jtds:<sybase/sqlserver>
host=<host>
port=<port>
defaultdb=<defaultDb>
user=<userName>
password=<password>
fetchSize=<fetchSize>
jdbc.db.name=<jdbcDbName>
jdbc.schema.name=<schema>

Initializating Metastore Information

Note

The following commands are currently supported only from Okera UI Workspace.

ODAS supports special SQL commands to initialize metadata from the JDBC data source tables into its metastore. The following shows SQL statements that can be used to get started with a LOAD DEFINITIONS command. Using these commands will also ensure that the datatypes between JDBC data source and Okera Schema Registry datatypes are configured correctly.

The CREATE DATABASE command supports specifying properties using the DBPROPERTIES keyword. For DBPROPERTIES you need to specify a map of key/value pairs, similar to the TBLPROPERTIES at the table level. Using this map, users are able to specify the file containing the connection credentials, which are needed to connect to the JDBC source. Setting the properties at the database level is required in order to pull multiple table informations, as explained in the subsequent commands. The matching ALTER DATABASE command supports adding and/or modifying the DBPROPERTIES entries.

Example: Create a database object with JDBC related properties

CREATE DATABASE marketingdb
DBPROPERTIES (
 'credentials.file' = 's3://<bucket>/<object>',
 'jdbc.db.name' = 'marketingdb')

The next command is used to bulk load table definitions from the JDBC source database. The command can be used to:

  • Load table definitions for the first time
  • Add additional table definitions into existing metadata info in ODAS catalog.
  • Optionally overwrite entire table definitions from the JDBC data source to ODAS catalog.

Example: Load JDBC table definitions into the ODAS registry. The command takes in a list of JDBC data source tablenames.

ALTER DATABASE marketingdb LOAD DEFINITIONS (transactions_schemaed, users);

The above command would load table definitions for tables transactions_schemaed and users from the JDBC source into Okera Schema Registry. This can be verified by DESCRIBE command.

Example: Verify that the JDBC source table has been registered to the ODAS registry

DESCRIBE marketingdb.users;
  userid    int
  fname varchar(50)
  lname varchar(50)
  fullname  varchar(101)
  emailid   varchar(150)

Running the same command would skip the existing table definitions. However, if you wish to refresh the entire table definition again from the JDBC source, use the OVERWRITE option for the command.

Example: Overwrite the existing definition for a table

ALTER DATABASE marketingdb LOAD DEFINITIONS (users) OVERWRITE

Not specifying any table names as a list in the command would load all the tables definitions from the jdbc data source. Example:

ALTER DATABASE marketingdb LOAD DEFINITIONS () OVERWRITE

Recover Column Definitions

The JDBC table definition can be changed at the column level by using the ALTER TABLE <tableName> RECOVER COLUMNS command. This is specifically useful to add or replace existing definitions per table at the column level.

Example: Add a new column to the existing definition

ALTER TABLE marketingdb.users RECOVER COLUMNS (country) ADD;
DESCRIBE marketingdb.users;
  userid    int
  fname varchar(50)
  lname varchar(50)
  fullname  varchar(101)
  emailid   varchar(150)
  country varchar(100)

To remove existing columns and add specific columns, use the REPLACE option.

Example: Replace an existing column definition with a selective new one

ALTER TABLE marketingdb.users RECOVER COLUMNS (userid) REPLACE;
DESCRIBE marketingdb.users;
  userid    int

Passing an empty list of columns would behave similar to ALTER DATABASE LOAD DEFINITIONS OVERWRITE for that specific table.

Example: Replace existing columns with the ones available at the source

ALTER TABLE marketingdb.users RECOVER COLUMNS () REPLACE;
DESCRIBE marketingdb.users;
  userid    int
  fname varchar(50)
  lname varchar(50)
  fullname  varchar(101)
  emailid   varchar(150)
  country varchar(100)

Scan Data

ODAS supports scanning JDBC data much similar to other data sources.

Example: Scan a table using the REST server

$ curl REST_SERVER:PORT/api/scan/marketingdb.users

[
    {
        "userid": 4,
        "fname": "Kevin",
        "lname": "Kenneth",
        "fullname": "Kevin Kenneth",
        "emailid": "Kevin.Kenneth@sample.com"
    },
    {
        "userid": 5,
        "fname": "Marie",
        "lname": "Donald",
        "fullname": "Marie Donald",
        "emailid": "Marie.Donald@sample.com"
    },
.....
]

JDBC Filter push down to JDBC data source.

Okera also supports filters push down to the JDBC data source. End users would still use the SQL dialect supported by Okera, however, if the filters are supported by the ANSI-SQL, those would be pushed down to the data source. For example, consider the following EXPLAIN query statement,

EXPLAIN select * from jdbc_filter_push_scan_test.scan_key_test
WHERE ((scankey = 1 and scankeytestdata = 'One') OR (scankey = 2 and scankeytestdata = 'Two')) AND scankey > 3

When executed, the output would contain something like,

00:SCAN HDFS [jdbc_filter_push_scan_test.scan_key_test]
     partitions=1/1 files=1 size=1B
     jdbc predicates: ((scanKey = 1 AND scanKeyTestData = 'One') OR (scanKey = 2 AND scanKeyTestData = 'Two')), scanKey > 3

This indicates that the JDBC predicates supported by any JDBC data source will be pushed down. That is, the query to the JDBC data source will have a WHERE clause with these filters. This speeds up the query execution considerably and allowing end users to leverage the JDBC data sources as sql evaluation engine and still utilizing Okera as access engine and let join these results with results from another data source.

Scan records in parallel

Data from JDBC data sources are scanned in parallel provided there is a valid numeric field specified in the tblproperties with the key mapred.jdbc.scan.key. Note, if the JDBC source table has a single numeric field as primarykey, the mapred.jdbc.scan.key key is auto-populated with that field as value when the catalog table is created. However, users can override this value by specifying another column in the table thats of numeric type. Internally, the query to the JDBC data source is split into query for ranges of 100K of the mapred.jdbc.scan.key. The range size (fetch size) can be controlled by the fetchSize property in the properties file.

Note

ScanKey concept might not apply for all data sources. For example, AWS Athena.

Other JDBC Operations

Besides regular scans, you can also include the JDBC-based datasets into more complex SQL queries.

Example: Query that joins data between two JDBC tables

SELECT t.*, u.*
FROM marketingdb.transactions_schemaed t
LEFT JOIN marketingdb.users u ON (u.userid = t.userid)
WHERE t.txnid = 304;

304|10/26/2016 08:59 PM|sku20|34225|100.000000|9821-5981-3318-9164|819.427.617.534|AU|34225|Melissa|Kenneth|Melissa Kenneth|Melissa.Kenneth@sample.com

Example: Combine two tables from the same JDBC source

SELECT *
FROM marketingdb.transactions_schemaed
WHERE txnid = 304
UNION ALL
  SELECT *
  FROM marketingdb.transactions_schemaed_demo
  WHERE txnid = 376;

304|10/26/2016 08:59 PM|sku20|34225|100.000000|9821-5981-3318-9164|819.427.617.534|AU
376|12/18/2016 08:39 AM|sku19|51145|150.000000|3360-2156-3415-5155|224.872.343.772|FR

ODAS also supports UNION ALL operations between JDBC and S3 data sources. For example, if the data is spread across JDBC and S3, we can run a UNION ALL query to retrieve them from these two sources in the same combined resultset.

Example: Combine two datasets from different storage systems

SELECT j.txnid, j.dt_time, j.sku, j.userid, j.price, j.creditcard, j.ip
FROM marketingdb.transactions_schemaed_demo j
WHERE j.txnid = 304
UNION ALL
  SELECT *
  FROM demo_test.transactions s3t
  WHERE s3t.txnid = 98314894;

304|10/26/2016 08:59 PM|sku20|34225|100.000000|9821-5981-3318-9164|819.427.617.534
98314894|03/13/2016 10:41 PM|sku47|22505|960.000000|9018-5233-8275-5887|105.424.939.543

Advanced Usage

Many times, users would want to run complex queries on JDBC data sources. One way to run the repeated queries is to persist the SQL as a view in the database and refer to it as a view on ODAS. However, this could create many bloated views in the JDBC data source. Also users who only have SELECT permissions on the database would not be able to create view on the databases for this purpose. ODAS supports persisting these INLINE VIEWs as part of the catalog table definitions. Users can create tables with a backing inline view in the form of a SQL string, as part of the CREATE TABLE command. During scan operations, ODAS would automatically replace the tablename with the actual SQL string specified during table creation. The SQL string can be written in a form that is specific to the database. For example, if the underlying database is MySQL, the SQL string can be the supported MySQL version compliant SQL statement. The table name provided as part of this command is used as the alias of the resulting inline view of the converted SQL. Additionally, this removes the need to load all table definitions into the catalog.

Note: This feature is ONLY available for JDBC data source tables.

Example: Create a catalog table backed by a SQL string (inline view)

CREATE EXTERNAL TABLE marketingdb.table_using_jdbc_complex_query
STORED AS JDBC
TBLPROPERTIES (
 'credentials.file' = 's3://<bucket>/<object>',
 'driver' = 'mysql')
USING VIEW AS
'SELECT t.*, u.fname, lname, fullname, emailid FROM marketingdb.transactions_schemaed t
LEFT JOIN marketingdb.users u ON (u.userid = t.userid)'

Note: The describe operation on the table would display the full definition of the resulting schema.

DESCRIBE marketingdb.table_using_jdbc_complex_query;
  txnid  bigint
  dt_time  varchar(30)
  sku  varchar(40)
  userid  int
  price  double
  creditcard  varchar(40)
  ip  varchar(40)
  region  varchar(2)
  fname  varchar(50)
  lname  varchar(50)
  fullname  varchar(101)
  emailid  varchar(150)

The extended DESCRIBE FORMATTED command would also show the actual query that would be used during query execution as part of the TBLPROPERTIES key/value list.

Example: Retrieve the extended definition for dataset

DESCRIBE FORMATTED marketingdb.table_using_jdbc_complex_query;

<Table Column info>
...
Table Parameters:   NULL    NULL
    EXTERNAL                TRUE
    credentials.file        s3://<bucket>/<object>
    driver                  mysql
    jdbc.query              SELECT t.*, u.fname, lname, fullname, emailid FROM jdbc_demo.transactions_schemaed t
                                LEFT JOIN jdbc_demo.users u ON (u.userid = t.userid)
    mapred.input.table.name marketingdb.table_using_jdbc_complex_query

Scan Datasets with Inline Views

Scan operations are similar to those on regular tables. The WHERE clauses can be applied on the table as usual.

Example: Scanning a dataset with inline view

SELECT *
FROM marketingdb.test_using_jdbc_complex_query
WHERE txnid = 304;

Internally the actual SQL is converted into the following:

SELECT *
FROM (
  SELECT t.*, u.fname, lname, fullname, emailid
  FROM marketingdb.transactions_schemaed t
  LEFT JOIN marketingdb.users u ON (u.userid = t.userid)
  ) AS test_using_jdbc_complex_query
WHERE txnid = 304;

The table can be used for complex joins with other tables as well.

Note: This is for representational purpose and we join to same users table to demo the capability

Example: Perform a complex join on datasets with inline views

SELECT n.*
FROM marketingdb.table_using_jdbc_complex_query n
JOIN marketingdb.users u ON (u.userid = n.userid)
WHERE n.txnid = 376;

376|12/18/2016 08:39 AM|sku19|51145|150.000000|3360-2156-3415-5155|224.872.343.772|FR|Stephanie|Maria|Stephanie Maria|Stephanie.Maria@sample.com

Data Types Mapping

The following table shows the mapping between Okera and JDBC source data types.

Okera MySQL PostgreSQL Snowflake Redshift AWS Athena SQL Server Sybase
boolean BOOL, BOOLEAN BOOL, BOOLEAN, BIT BOOL, BOOLEAN BOOL, BOOLEAN BOOLEAN
tinyint TINYINT TINYINT TINYINT TINYINT
smallint SMALLINT SMALLINT SMALLINT, INT2 SMALLINT SMALLINT SMALLINT
int MEDIUMINT, INT INT, SERIAL INTEGER, INT, INT4 INT INT INT
bigint BIGINT BIGINT, BIGSERIAL SMALLINT, INT, INTEGER, BIGINT, NUMBER BIGINT, INT8 BIGINT BIGINT BIGINT
double DOUBLE DOUBLE, MONEY, REAL DOUBLE, REAL DOUBLE, FLOAT8, FLOAT, REAL, FLOAT4 DOUBLE REAL REAL
decimal(p,s) FLOAT(p,s), decimal(p,s) NUMERIC(p,s), DECIMAL(p,s) NUMERIC(p,s), DECIMAL(p,s) DECIMAL(p,s), NUMERIC DECIMAL(p,s) DECIMAL(p,s) DECIMAL(p,s)
char(N) CHAR(N) CHAR(N) CHAR, CHARACTER, NCHAR, BPCHAR CHAR CHAR(N) CHAR(N)
char(2) ENUM
char(6) SET
varchar(N) VARCHAR(N) VARCHAR(N) CHAR(N), VARCHAR(N) CHARACTER VARYING, NVARCHAR, TEXT VARCHAR VARCHAR(N) VARCHAR(N)
varchar(255) TINYTEXT STRING
varchar(65355) TEXT
string MEDIUMTEXT, LONGTEXT, BLOB, TINYBLOB, MEDIUMBLOB, LONGBLOB, BINARY(N), VARBINARY(N) TEXT STRING, TEXT, BINARY(N), VARBINARY(N) TEXT, TIMESTAMP, BINARY(N), VARBINARY(N) TEXT, TIMESTAMP, BINARY(N), VARBINARY(N)
timestamp TIMESTAMP, TIME, DATETIME TIMESTAMP TIMESTAMP TIMESTAMP TIMESTAMP DATETIME DATETIME
date DATE DATE DATE DATE DATE DATE DATE

Unsupported MySQL types: YEAR

Unsupported PostgreSQL types: JSON, TIME, UUID and XML

Unsupported Snowflake types: TIME, VARIANT, OBJECT and ARRAY