Getting started with JDBC Data Sources

This document outlines Okera's support for JDBC Data Sources and how to get started with connecting to data.

Okera supports connecting to these JDBC data sources out of the box:

You can also connect to a custom JDBC data source by providing your own JDBC driver.

Getting Started

There are three main steps to get started with JDBC data sources:

  1. Create a connection to the data source
  2. Link that connection to an Okera virtual database
  3. Load tables from that Data Source into the Okera database

Connecting to a Data Source

Create a connection

Note

Only catalog admins have access to create connections.

Create a DATACONNECTION object with your connection properties. For examples of the specific properties for each database please click on the link for your specific database.

Example: creating a connection to a snowflake database

CREATE DATACONNECTION snowflake_connection CXNPROPERTIES
(
  'connection_type' = 'JDBC',
  'driver' = 'snowflake',
  'host' = 'my-host.com',
  'port'= '3306',
  'user'= 'awsps:///my-username',
  'password' = 'awsps:///my-password',
  'account' = 'okera123', -- snowflake only
  'defaultdb' = 'OKERA_DEMO',
  'default_schema' = 'marketing' -- optional
)

defaultdb: is the source JDBC database you want to establish a connection to.

Note

Some data sources are sensitive to case, so please specify your database and schema name in the correct case.

Providing secure credentials

Sensitive credentials cannot be provided in plain text, and instead must be provided in a secrets file either from local secret sources such as Kubernetes secrets, or from secure Cloud services such as AWS Secrets Manager and AWS System Manager Parameter Store. It is recommended for auditability that you create a new system user for Okera in your underlying database, and use those credentials in your Okera connection. Note that this system user will need to have read access on your data.

Supported secure credential stores:

Note

If using AWS System Manager Parameter Store or AWS Secrets Manager, you will need to provide the correct IAM credentials to access your secrets.

Editing your connection

Warning

Altering a connection will impact access to all tables that have been registered via that connection, and should be done carefully.

Example: Altering the default catalog for a connection. You will need to specify all the necessary parameters to create a connection, amending the ones you want to change.

ALTER DATACONNECTION snowflake_connection CXNPROPERTIES
(
  'connection_type' = 'JDBC',
  'driver' = 'snowflake',
  'host' = 'my-host.com',
  'port'= '3306',
  'user'= 'awsps:///my-username',
  'password' = 'awsps:///my-password',
  'account' = 'okera123', -- snowflake only
  'defaultdb' = 'NYTAXI',
  'default_schema' = 'marketing' -- optional
)

Dropping your connection

Note dropping a connection will impact the accessibility of any tables registered from that connection. Therefore connections can only be dropped if there are no associated tables.

Properties file (old way)

Create a file with the connection properties to your database can be provided and store it in a secured place. A sample properties file, lets say in an S3 location s3://okera/mysql-prod.properties, would be:

driver=mysql
host=<A cloud url or host accessible from the application>
port=<Port to connect to the DB>
user= awsps:///my-username
password= awsps:///my-password
jdbc.db.name=<Default catalog to connect to. Note, skipping this value can result in errors>
jdbc.schema.name<Optional. Specify the schema name if used. MySQL does not support schema>
defaultdb=<The default database to connect to. Typically same as jdbc.db.name>
fetchSize=<Fetchsize to batch and retrieve data from JDBC data source. Default 100000>

Note

The jdbc.schema.name is not required for MySQL but it is a required parameter for Snowflake, PostgreSQL and Redshift.

Create a new Okera database

CREATE DATABASE snowflake_db
DBPROPERTIES (
 'okera.connection.name' = 'snowflake_connection'
 );

Connect an existing Okera database

ALTER DATABASE snowflake_db SET DBPROPERTIES
(
 'okera.connection.name' = 'snowflake_connection'
);

Note

You need permissions to create a database and alter its properties to do the above.

Register data from your connection

After creating the database, you will need to actually load the table definitions for the tables and views you want to bring over from your underlying database connection.

There are few different DDL commands that make it easy to register and alter metadata from database connections:

  • Load table definitions for the first time
  • Add additional table definitions into existing metadata info in ODAS catalog.
  • Overwrite entire table definitions from the JDBC data source to ODAS catalog.

Specify the specific tables

Note

These tables should be in the database specified in your connection. The connection will leverage your default schema unless specified otherwise.

Example - using default schema The command takes in a list of JDBC data source tablenames.

ALTER DATABASE snowflake_db LOAD DEFINITIONS (TRANSACTIONS, CUSTOMERS);

Note

Remember some databases expect table names to be case-sensitive.

Example - load definitions from another schema This is used if no default schema was specified in the connection, or if you wish to load tables from a different schema in the same database.

Syntax

ALTER DATABASE <okera_db> LOAD DEFINITIONS (<table_name>)
FROM JDBC DATABASE <underlying_database_name> SCHEMA <underlying_schema_name>

Example

ALTER DATABASE snowflake_db LOAD DEFINITIONS (CUSTOMER)
FROM JDBC DATABASE OKERA_DEMO SCHEMA CUSTOMER_INFO

Load all tables from connection

ALTER DATABASE snowflake_db LOAD DEFINITIONS ();

Override existing table definitions

Running the above commands repeatedly would skip the existing table definitions. However, if you wish to refresh the entire table definition again from the JDBC source, use the OVERWRITE option.

ALTER DATABASE snowflake_db LOAD DEFINITIONS () OVERWRITE;

Automatically refresh a schema

You can refresh your table definition from the underlying data source by using the ALTER TABLE <tableName> RECOVER COLUMNS command. This is specifically useful to add or replace existing columns.

Example: Add a new column to the existing definition

-- automatically refresh schema and add column
ALTER TABLE marketingdb.users RECOVER COLUMNS (country) ADD;
-- new column appears in table definition
DESCRIBE marketingdb.users;
  userid    int
  fname varchar(50)
  lname varchar(50)
  fullname  varchar(101)
  emailid   varchar(150)
  country varchar(100)

To remove existing columns and add specific columns, use the REPLACE option.

Example: Replace an existing column definition with a selective new one

ALTER TABLE marketingdb.users RECOVER COLUMNS (userid) REPLACE;
DESCRIBE marketingdb.users;
  userid    int

Passing an empty list of columns would behave similar to ALTER DATABASE LOAD DEFINITIONS OVERWRITE for that specific table.

Example: Replace existing columns with the ones available at the source

ALTER TABLE marketingdb.users RECOVER COLUMNS () REPLACE;
DESCRIBE marketingdb.users;
  userid    int
  fname varchar(50)
  lname varchar(50)
  fullname  varchar(101)
  emailid   varchar(150)
  country varchar(100)

Errors while registering data

When loading table definitions, any columns that error out are skipped by default and the table is still registered. The skipped/error columns are annotated with a prefix jdbc.unsupported.column- and can be seen in the describe formatted <table> output in the Table Properties section.

For example,

jdbc.unsupported.column-NCHAR_COL   Unsupported JDBC type NCHAR

In the Okera WebUI the table will show a warning message Unsupport type error(s) for a table that had such an auto-inference error.

Predicate pushdown

Okera pushes down predicates for JDBC-backet data sources by default. To disable predicate pushdown for a particular JDBC-backed database or table, you can specify 'jdbc.predicates.pushdown.enabled' = 'false' in the DBPROPERTIES or TBLPROPERTIES. Users would still specify the SQL dialect supported by Okera. However, if the filters are supported by the ANSI-SQL, those would be pushed down to the data source.

Scan records in parallel

Data from JDBC data sources are scanned in parallel provided there is a valid numeric field specified in the tblproperties with the key mapred.jdbc.scan.key. Note, if the JDBC source table has a single numeric field as primarykey, the mapred.jdbc.scan.key key is auto-populated with that field as value when the catalog table is created. However, users can override this value by specifying another column in the table thats of numeric type. Internally, the query to the JDBC data source is split into query for ranges of 100K of the mapred.jdbc.scan.key. The range size (fetch size) can be controlled by the fetchSize property in the properties file.

Note

ScanKey concept might not apply for all data sources. For example, AWS Athena.

Querying data from JDBC Data Sources

Data from JDBC data sources can be queried reguarly like all data sources. For more information on this topic, as well as on advanced concepts such as in-line SQL views read here.