Getting started with JDBC Data Sources¶
This document outlines Okera's support for JDBC Data Sources and how to get started with connecting to data.
Okera supports connecting to these JDBC data sources out of the box:
- MySQL (5.6, 5.7, 8.0)
- PostgreSQL (9.6, 10.6)
- Amazon Redshift
- AWS Athena (simba jdbc driver)
- SQL Server (microsoft sql-server)
You can also connect to a custom JDBC data source by providing your own JDBC driver.
There are three main steps to get started with JDBC data sources:
- Create a connection to the data source
- Link that connection to an Okera virtual database
- Load tables from that Data Source into the Okera database
Connecting to a Data Source¶
Create a connection¶
Only catalog admins have access to create connections.
DATACONNECTION object with your connection properties.
For examples of the specific properties for each database please click on the link for your specific database.
Example: creating a connection to a snowflake database
CREATE DATACONNECTION snowflake_connection CXNPROPERTIES ( 'connection_type' = 'JDBC', 'driver' = 'snowflake', 'host' = 'my-host.com', 'port'= '3306', 'user'= 'awsps:///my-username', 'password' = 'awsps:///my-password', 'account' = 'okera123', -- snowflake only 'defaultdb' = 'OKERA_DEMO', 'default_schema' = 'marketing' -- optional )
defaultdb: is the source JDBC database you want to establish a connection to.
Some data sources are sensitive to case, so please specify your database and schema name in the correct case.
Providing secure credentials¶
Sensitive credentials cannot be provided in plain text, and instead must be provided in a secrets file either from local secret sources such as Kubernetes secrets, or from secure Cloud services such as AWS Secrets Manager and AWS System Manager Parameter Store. It is recommended for auditability that you create a new system user for Okera in your underlying database, and use those credentials in your Okera connection. Note that this system user will need to have read access on your data.
Supported secure credential stores:
awsps://- AWS System Manager Parameter Store
awssm://- AWS Secrets Manager
file://- local files (using Kubernetes mounted secrets)
If using AWS System Manager Parameter Store or AWS Secrets Manager, you will need to provide the correct IAM credentials to access your secrets.
Editing your connection¶
Altering a connection will impact access to all tables that have been registered via that connection, and should be done carefully.
Example: Altering the default catalog for a connection. You will need to specify all the necessary parameters to create a connection, amending the ones you want to change.
ALTER DATACONNECTION snowflake_connection CXNPROPERTIES ( 'connection_type' = 'JDBC', 'driver' = 'snowflake', 'host' = 'my-host.com', 'port'= '3306', 'user'= 'awsps:///my-username', 'password' = 'awsps:///my-password', 'account' = 'okera123', -- snowflake only 'defaultdb' = 'NYTAXI', 'default_schema' = 'marketing' -- optional )
Dropping your connection¶
Note dropping a connection will impact the accessibility of any tables registered from that connection. Therefore connections can only be dropped if there are no associated tables.
Properties file (old way)¶
Create a file with the connection properties to your database can be provided and store it in a secured place.
A sample properties file, lets say in an S3 location
s3://okera/mysql-prod.properties, would be:
driver=mysql host=<A cloud url or host accessible from the application> port=<Port to connect to the DB> user= awsps:///my-username password= awsps:///my-password jdbc.db.name=<Default catalog to connect to. Note, skipping this value can result in errors> jdbc.schema.name<Optional. Specify the schema name if used. MySQL does not support schema> defaultdb=<The default database to connect to. Typically same as jdbc.db.name> fetchSize=<Fetchsize to batch and retrieve data from JDBC data source. Default 100000>
jdbc.schema.name is not required for MySQL but it is a required parameter for Snowflake, PostgreSQL and Redshift.
Link connection to Okera database¶
Create a new Okera database¶
CREATE DATABASE snowflake_db DBPROPERTIES ( 'okera.connection.name' = 'snowflake_connection' );
Connect an existing Okera database¶
ALTER DATABASE snowflake_db SET DBPROPERTIES ( 'okera.connection.name' = 'snowflake_connection' );
You need permissions to create a database and alter its properties to do the above.
Register data from your connection¶
After creating the database, you will need to actually load the table definitions for the tables and views you want to bring over from your underlying database connection.
There are few different DDL commands that make it easy to register and alter metadata from database connections:
- Load table definitions for the first time
- Add additional table definitions into existing metadata info in ODAS catalog.
- Overwrite entire table definitions from the JDBC data source to ODAS catalog.
Specify the specific tables¶
These tables should be in the database specified in your connection. The connection will leverage your default schema unless specified otherwise.
Example - using default schema The command takes in a list of JDBC data source tablenames.
ALTER DATABASE snowflake_db LOAD DEFINITIONS (TRANSACTIONS, CUSTOMERS);
Remember some databases expect table names to be case-sensitive.
Example - load definitions from another schema This is used if no default schema was specified in the connection, or if you wish to load tables from a different schema in the same database.
ALTER DATABASE <okera_db> LOAD DEFINITIONS (<table_name>) FROM JDBC DATABASE <underlying_database_name> SCHEMA <underlying_schema_name>
ALTER DATABASE snowflake_db LOAD DEFINITIONS (CUSTOMER) FROM JDBC DATABASE OKERA_DEMO SCHEMA CUSTOMER_INFO
Load all tables from connection¶
ALTER DATABASE snowflake_db LOAD DEFINITIONS ();
Override existing table definitions¶
Running the above commands repeatedly would skip the existing table definitions.
However, if you wish to refresh the entire table definition again from the JDBC source, use the
ALTER DATABASE snowflake_db LOAD DEFINITIONS () OVERWRITE;
Automatically refresh a schema¶
You can refresh your table definition from the underlying data source by using the
ALTER TABLE <tableName> RECOVER COLUMNS command.
This is specifically useful to add or replace existing columns.
Example: Add a new column to the existing definition
-- automatically refresh schema and add column ALTER TABLE marketingdb.users RECOVER COLUMNS (country) ADD; -- new column appears in table definition DESCRIBE marketingdb.users; userid int fname varchar(50) lname varchar(50) fullname varchar(101) emailid varchar(150) country varchar(100)
To remove existing columns and add specific columns, use the
Example: Replace an existing column definition with a selective new one
ALTER TABLE marketingdb.users RECOVER COLUMNS (userid) REPLACE; DESCRIBE marketingdb.users; userid int
Passing an empty list of columns would behave similar to
ALTER DATABASE LOAD DEFINITIONS OVERWRITE for that specific table.
Example: Replace existing columns with the ones available at the source
ALTER TABLE marketingdb.users RECOVER COLUMNS () REPLACE; DESCRIBE marketingdb.users; userid int fname varchar(50) lname varchar(50) fullname varchar(101) emailid varchar(150) country varchar(100)
Errors while registering data¶
When loading table definitions, any columns that error out are skipped by default and the table is still registered.
The skipped/error columns are annotated with a prefix
jdbc.unsupported.column- and can be seen in the
describe formatted <table> output in the
Table Properties section.
jdbc.unsupported.column-NCHAR_COL Unsupported JDBC type NCHAR
In the Okera WebUI the table will show a warning message
Unsupport type error(s) for a table that had such an auto-inference error.
Okera pushes down predicates for JDBC-backet data sources by default. To disable predicate pushdown for a particular JDBC-backed database or table, you can specify 'jdbc.predicates.pushdown.enabled' = 'false' in the DBPROPERTIES or TBLPROPERTIES. Users would still specify the SQL dialect supported by Okera. However, if the filters are supported by the ANSI-SQL, those would be pushed down to the data source.
Scan records in parallel¶
Data from JDBC data sources are scanned in parallel provided there is a valid numeric field specified in the tblproperties with the key
mapred.jdbc.scan.key. Note, if the JDBC source table has a single numeric field as primarykey, the
mapred.jdbc.scan.key key is auto-populated with that field as value when the catalog table is created. However, users can override this value by specifying another column in the table thats of numeric type. Internally, the query to the JDBC data source is split into query for ranges of 100K of the
The range size (fetch size) can be controlled by the
fetchSize property in the properties file.
ScanKey concept might not apply for all data sources. For example, AWS Athena.
Querying data from JDBC Data Sources¶
Data from JDBC data sources can be queried reguarly like all data sources. For more information on this topic, as well as on advanced concepts such as in-line SQL views read here.