JDBC Data Source

This document outlines the ODAS support for JDBC data sources.

Introduction

The JDBC support allows for data that is stored in relational database systems to be available through the Okera Platform. The JDBC data sources supported out of the box are:

  • MySQL (5.6,5.7)
  • PostgreSQL (9.6)
  • Snowflake
  • Amazon Redshift (using postgresql as ‘driver’)

ODAS utilizes the drivers provided for JDBC sources to connect to the underlying database. The driver is identified by the driver property specified as part of the connection properties. Although, different drivers are used, the functionality remains common. Users will be able to perform the same DDL and scan operations across different supported JDBC data sources.

The JDBC driver versions supported are:

  • MySQL : 8.0.11
  • PostgreSQL : 9.0-801.jdbc4
  • Snowflake : 3.6.8

Note Amazon Redshift is current supported via PostgreSQL driver. Amazon redshift uses PostgreSQL 8.0.2 in OLAP mode and hence the PostgreSQL driver works for redshift as well. Reference: https://docs.aws.amazon.com/redshift/latest/dg/c_redshift-and-postgres-sql.html

Getting Started

Configuration

Connection properties for database can be provided in a file located in a secured place. For example, a S3 folder that is secured for access and is accessible from the application may be used. A sample properties file, s3://okera/mysql-prod.properties, would be:

driver=mysql
host=<A cloud url or host accessible from the application>
port=<Port to connect to the DB>
user=<username>
password=<password>
jdbc.db.name=<Default DB to connect to. Note, skipping this value can result in errors>
jdbc.schema.name<Optional. Specify the schema name if used. MySQL does not support schema>
table=<Default table to query from>
fetchSize=100000

Note The jdbc.schema.name is not required for MySQL but its a required field for Snowflake and PostgreSQL.

fetchSize Use this parameter to control batch size for retrieval of data in batches.

A sample properties file for PostgreSQL/Redshift would be:

driver=postgresql
host=<A cloud url or host accessible from the application>
port=<Port to connect to the DB>
user=<username>
password=<password>
jdbc.db.name=<Default DB to connect to. Note, skipping this value can result in errors>
jdbc.schema.name<Optional. Specify the schema name if used>
table=<Default table to query from>
fetchSize=100000

A sample properties file for Snowflake would be:

driver=snowflake
host=<snowflake account>.snowflakecomputing.com
port=<n/a for snowflake>
db=SNOWFLAKE_SAMPLE_DATA
user=<username>
password=<password>
account=<snowflake account>
driver=snowflake
jdbc.db.name=<Default DB to connect to. Note, skipping this value can result in errors>
jdbc.schema.name<Optional. Specify the schema name if used>
table=<Default table to query from>

Note for snowflake, ‘db’ parameter is mandatory. This should be the default database for connection.

Initializating Metastore Information

ODAS supports special SQL commands to transfer the metadata for the JDBC database and table objects into its metastore. The following shows SQL statements that can be used to get started quickly in an automated way. Using these commands will also ensure that the datatypes between JDBC source and Okera Schema Registry datatypes are configured correctly.

The CREATE DATABASE command supports specifying properties using the DBPROPERTIES keyword. For DBPROPERTIES you need to specify a map of key/value pairs, similar to the TBLPROPERTIES at the table level. Using this map, users are able to specify the file containing the connection credentials, which are needed to connect to the JDBC source. Setting the properties at the database level is required in order to pull multiple table informations, as explained in the subsequent commands. The matching ALTER DATABASE command supports adding and/or modifying the DBPROPERTIES entries.

Example: Create a database object with JDBC related properties

CREATE DATABASE marketingdb
DBPROPERTIES (
 'driver' = 'mysql',
 'credentials.file' = 's3://<bucket>/<object>',
 'jdbc.db.name' = 'marketingdb')

The next command is used to bulk load table definitions from the JDBC data source database. The command can be used to:

  • Load table definitions for the first time
  • Add additional table definitions into existing schema
  • Optionally overwrite entire table definitions

Example: Load JDBC table schemas into the ODAS registry

ALTER DATABASE marketingdb LOAD DEFINITIONS (transactions_schemaed, users);

The above command would load table definitions from the JDBC source into Okera Schema Registry. This can be verified by DESCRIBE command.

Example: Verify that the JDBC schema has been added to the ODAS registry

DESCRIBE marketingdb.users;
  userid	int
  fname	varchar(50)
  lname	varchar(50)
  fullname	varchar(101)
  emailid	varchar(150)

Running the same command would skip the existing table definitions. However, if you wish to refresh the entire table definition again from the database source, use the OVERWRITE option for the command.

Example: Overwrite the existing definition for a table

ALTER DATABASE marketingdb LOAD DEFINITIONS (users) OVERWRITE

Recover Column Definitions

The JDBC table definition can be changed at the column level by using the ALTER TABLE <tableName> RECOVER COLUMNS command. This is specifically useful to add or replace existing definitions per table at the column level.

Example: Add a new column to the existing definition

ALTER TABLE marketingdb.users RECOVER COLUMNS (country) ADD;
DESCRIBE marketingdb.users;
  userid	int
  fname	varchar(50)
  lname	varchar(50)
  fullname	varchar(101)
  emailid	varchar(150)
  country varchar(100)

To remove existing columns and add specific columns, use the REPLACE option.

Example: Replace an existing column definition with a selective new one

ALTER TABLE marketingdb.users RECOVER COLUMNS (userid) REPLACE;
DESCRIBE marketingdb.users;
  userid	int

Passing an empty list of columns would behave similar to ALTER DATABASE LOAD DEFINITIONS OVERWRITE for that specific table.

Example: Replace existing columns with the ones available at the source

ALTER TABLE marketingdb.users RECOVER COLUMNS () REPLACE;
DESCRIBE marketingdb.users;
  userid	int
  fname	varchar(50)
  lname	varchar(50)
  fullname	varchar(101)
  emailid	varchar(150)
  country varchar(100)

Scan Data

ODAS supports scanning JDBC data much similar to other data sources.

Example: Scan a table using the REST server

$ curl REST_SERVER:PORT/api/scan/marketingdb.users

[
    {
        "userid": 4,
        "fname": "Kevin",
        "lname": "Kenneth",
        "fullname": "Kevin Kenneth",
        "emailid": "Kevin.Kenneth@sample.com"
    },
    {
        "userid": 5,
        "fname": "Marie",
        "lname": "Donald",
        "fullname": "Marie Donald",
        "emailid": "Marie.Donald@sample.com"
    },
.....
]

Other JDBC Operations

Besides regular scans, you can also include the JDBC-based datasets into more complex SQL queries.

Example: Query that joins data between two JDBC tables

SELECT t.*, u.*
FROM marketingdb.transactions_schemaed t
LEFT JOIN marketingdb.users u ON (u.userid = t.userid)
WHERE t.txnid = 304;

304|10/26/2016 08:59 PM|sku20|34225|100.000000|9821-5981-3318-9164|819.427.617.534|AU|34225|Melissa|Kenneth|Melissa Kenneth|Melissa.Kenneth@sample.com

Example: Combine two tables from the same JDBC source

SELECT *
FROM marketingdb.transactions_schemaed
WHERE txnid = 304
UNION ALL
  SELECT *
  FROM marketingdb.transactions_schemaed_demo
  WHERE txnid = 376;

304|10/26/2016 08:59 PM|sku20|34225|100.000000|9821-5981-3318-9164|819.427.617.534|AU
376|12/18/2016 08:39 AM|sku19|51145|150.000000|3360-2156-3415-5155|224.872.343.772|FR

ODAS also supports UNION ALL operations between JDBC and S3 data sources. For example, if the data is spread across JDBC and S3, we can run a UNION ALL query to retrieve them from these two sources in the same combined resultset.

Example: Comnbine two datasets from different storage systems

SELECT j.txnid, j.dt_time, j.sku, j.userid, j.price, j.creditcard, j.ip
FROM marketingdb.transactions_schemaed_demo j
WHERE j.txnid = 304
UNION ALL
  SELECT *
  FROM demo_test.transactions s3t
  WHERE s3t.txnid = 98314894;

304|10/26/2016 08:59 PM|sku20|34225|100.000000|9821-5981-3318-9164|819.427.617.534
98314894|03/13/2016 10:41 PM|sku47|22505|960.000000|9018-5233-8275-5887|105.424.939.543

Advanced Usage

Many times, users would want to run complex queries on JDBC data sources. One way to run the repeated queries is to persist the SQL as a view in the database and refer to it as a view on ODAS. However, this could create many bloated views in the JDBC data source. Also users who only have SELECT permissions on the database would not be able to create view on the databases for this purpose. ODAS supports persisting these INLINE VIEWs as part of the catalog table definitions. Users can create tables with a backing inline view in the form of a SQL string, as part of the CREATE TABLE command. During scan operations, ODAS would automatically replace the tablename with the actual SQL string specified during table creation. The SQL string can be written in a form that is specific to the database. For example, if the underlying database is MySQL, the SQL string can be the supported MySQL version compliant SQL statement. The table name provided as part of this command is used as the alias of the resulting inline view of the converted SQL. Additionally, this removes the need to load all table definitions into the catalog.

Note: This feature is ONLY available for JDBC data source tables.

Example: Create a catalog table backed by a SQL string (inline view)

CREATE EXTERNAL TABLE marketingdb.table_using_jdbc_complex_query
STORED AS JDBC
TBLPROPERTIES (
 'credentials.file' = 's3://<bucket>/<object>',
 'driver' = 'mysql')
USING VIEW AS
'SELECT t.*, u.fname, lname, fullname, emailid FROM marketingdb.transactions_schemaed t
LEFT JOIN marketingdb.users u ON (u.userid = t.userid)'

Note: The describe operation on the table would display the full definition of the resulting schema.

DESCRIBE marketingdb.table_using_jdbc_complex_query;
  txnid  bigint
  dt_time  varchar(30)
  sku  varchar(40)
  userid  int
  price  double
  creditcard  varchar(40)
  ip  varchar(40)
  region  varchar(2)
  fname  varchar(50)
  lname  varchar(50)
  fullname  varchar(101)
  emailid  varchar(150)

The extended DESCRIBE FORMATTED command would also show the actual query that would be used during query execution as part of the TBLPROPERTIES key/value list.

Example: Retrieve the extended definition for dataset

DESCRIBE FORMATTED marketingdb.table_using_jdbc_complex_query;

<Table Column info>
...
Table Parameters:	NULL	NULL
  	EXTERNAL            	TRUE
  	credentials.file    	s3://<bucket>/<object>
  	driver              	mysql
  	jdbc.query          	SELECT t.*, u.fname, lname, fullname, emailid FROM jdbc_demo.transactions_schemaed t
                                LEFT JOIN jdbc_demo.users u ON (u.userid = t.userid)
  	mapred.input.table.name	marketingdb.table_using_jdbc_complex_query

Scan Datasets with Inline Views

Scan operations are similar to those on regular tables. The WHERE clauses can be applied on the table as usual.

Example: Scanning a dataset with inline view

SELECT *
FROM marketingdb.test_using_jdbc_complex_query
WHERE txnid = 304;

Internally the actual SQL is converted into the following:

SELECT *
FROM (
  SELECT t.*, u.fname, lname, fullname, emailid
  FROM marketingdb.transactions_schemaed t
  LEFT JOIN marketingdb.users u ON (u.userid = t.userid)
  ) AS test_using_jdbc_complex_query
WHERE txnid = 304;

The table can be used for complex joins with other tables as well.

Note: This is for representational purpose and we join to same users table to demo the capability

Example: Perform a complex join on datasets with inline views

SELECT n.*
FROM marketingdb.table_using_jdbc_complex_query n
JOIN marketingdb.users u ON (u.userid = n.userid)
WHERE n.txnid = 376;

376|12/18/2016 08:39 AM|sku19|51145|150.000000|3360-2156-3415-5155|224.872.343.772|FR|Stephanie|Maria|Stephanie Maria|Stephanie.Maria@sample.com

Data Types Mapping

The following table shows the mapping between Okera and JDBC source data types.

Okera MySQL PostgreSQL Snowflake Redshift (using PostgreSQL driver)
boolean BOOL, BOOLEAN BOOL, BOOLEAN, BIT BOOL, BOOLEAN BOOL, BOOLEAN
tinyint TINYINT      
smallint SMALLINT SMALLINT   SMALLINT, INT2
int MEDIUMINT, INT INT, SERIAL   INTEGER, INT, INT4
bigint BIGINT BIGINT, BIGSERIAL SMALLINT, INT, INTEGER, BIGINT, NUMBER BIGINT, INT8
double DOUBLE DOUBLE, MONEY, REAL DOUBLE, REAL DOUBLE, FLOAT8, FLOAT, REAL, FLOAT4
decimal(p,s) FLOAT(p,s), decimal(p,s) NUMERIC(p,s), DECIMAL(p,s) NUMERIC(p,s), DECIMAL(p,s) DECIMAL(p,s), NUMERIC
char(N) CHAR(N) CHAR(N)   CHAR, CHARACTER, NCHAR, BPCHAR
char(2) ENUM      
char(6) SET      
varchar(N) VARCHAR(N) VARCHAR(N) CHAR(N), VARCHAR(N) CHARACTER VARYING, NVARCHAR, TEXT
varchar(255) TINYTEXT      
varchar(65355) TEXT      
string MEDIUMTEXT, LONGTEXT, BLOB, TINYBLOB, MEDIUMBLOB, LONGBLOB, BINARY(N), VARBINARY(N) TEXT STRING, TEXT, BINARY(N), VARBINARY(N)  
timestamp TIMESTAMP, TIME, DATETIME TIMESTAMP TIMESTAMP TIMESTAMP

Unsupported MySQL types: DATE, YEAR

Unsupported PostgreSQL types: DATE, JSON, TIME, UUID and XML

Unsupported Snowflake types: DATE, TIME, VARIANT, OBJECT and ARRAY