Supported SQL¶
Okera enables the use of SQL for defining datasets. This is done by creating base datasets and then defining views. Okera is not a full massively parallel processing (MPP) analytics engine, as only a subset of SQL is permitted. This document describes the supported SQL subset.
Okera in general supports the identical data model as Apache Hive and is generally compatible with HiveQL. Supported functionality may differ depending on the client.
Data Definition Language (DDL) Statements¶
Okera generally supports and is compatible with HiveQL SQL commands.
In some cases, Okera is not compatible.
In other cases, the supported SQL has been extended for Okera-specific capabilities -- including all statements that modify the catalog and do not read any data (for example, CREATE
, DROP
, ALTER
).
MSCK REPAIR Statement¶
Okera does not support the Hive MSCK REPAIR TABLE <table_name>
.
Instead it supports the alternative, ALTER TABLE <table_name> RECOVER PARTITIONS
.
This statement otherwise behaves identically, automatically adding partitions to the table based on the storage directory structure.
Extensions¶
Apply the WITH GRANT OPTION to the Okera admin_role
¶
If the Okera configuration parameter INIT_WITH_GRANT_OPTION
has not been set or is set to false
, the admin user and any other user assigned the admin_role
in Okera cannot edit permissions. To override the setting in your local environment, you can specify the following DDL in the Workspace:
GRANT ALL ON CATALOG TO ROLE ADMIN_ROLE WITH GRANT OPTION
Drop a Database and its Datasets or Views¶
You can drop a database using the SQL command DROP DATABASE <database-name>
. However, you cannot drop a database if it contains datasets or internal views. To drop a dataset from a database, issue the SQL command DROP TABLE <database-name>.<dataset-name>
. To drop an internal view from a database, issue the SQL command DROP VIEW <database-name>.<view-name>
. SQL commands can be issued in the Workspace.
For example, the following statements drop a view named cancerward
and a dataset named cancerpatients
from a database named hospitaldb
.
DROP VIEW hospitaldb.cancerward
DROP TABLE hospitaldb.cancerpatients
hospitaldb
database:
DROP DATABASE hospitaldb
Optionally Drop Permissions¶
There are use cases where it is valid to retain or drop permissions when the corresponding catalog object (database, table, or view) is dropped.
Okera extends the DROP DATABASE
, DROP TABLE
, and DROP VIEW
statements to optionally specify whether the associated permissions should also be dropped.
DROP DATABASE [IF EXISTS] db [CASCADE] [(INCLUDING | EXCLUDING) PERMISSIONS];
DROP TABLE [IF EXISTS] [db.]tbl [(INCLUDING | EXCLUDING) PERMISSIONS];
DROP VIEW [IF EXISTS] [db.]v [(INCLUDING | EXCLUDING) PERMISSIONS];
If INCLUDING PERMISSIONS
is specified, the corresponding permissions are also dropped.
Otherwise, permissions will not be dropped, and they will be applied to future catalog objects with that name.
If CASCADE
is specified, then all permissions on the tables and views in the database are dropped.
It is recommended that users default to the INCLUDING PERMISSIONS
behavior and update existing workflows so as to not rely on permissions being retained longer than the object for which they are created.
Example: DROP directives with INCLUDING PERMISSIONS
DROP DATABASE IF EXISTS database1 CASCADE INCLUDING PERMISSIONS;
DROP TABLE IF EXISTS database2.table1 INCLUDING PERMISSIONS;
DROP VIEW IF EXISTS database2.view1 INCLUDING PERMISSIONS;
For users to drop the permissions, they must have grant permissions on the catalog object.
For example, to be able to drop a database and its permissions, the user must be able to issue GRANT
/REVOKE
statements on the database.
The user needs to be a catalog admin or having been granted grant permissions.
Create Table As Select (CTAS)¶
Okera supports tables created using this method with the following restrictions:
-
The table must be created from Hive in Amazon EMR.
-
The Hive warehouse must be configured to use Amazon S3 (as opposed to using the Amazon EMR-local HDFS cluster).
Create Table Over a Single File¶
Traditionally, Hive tables (and partitions) are defined over a directory of files. In Okera, you can define a table to a single file by specifying the full path to the file as the location. This allows creating tables without first having to create the parent directory and moving the data files.
Example: CREATE
for a table to a single file
CREATE EXTERNAL TABLE tbl(s string)
LOCATION 's3://bucket/path/file.parquet'
In this example, when this table is scanned only the specified path is read, specifically, no other files in the s3://bucket/path/
directory are read.
Note that tables created this way may not read correctly if the client is bypassing Okera. For example, Hive clients that are directly reading this table without Okera, may fail.
Create Table With Storage File Format¶
Okera supports specifying the underlying storage type file format when creating a table.
The storage type can be specified using the STORED AS <format-type>
statement.
The Valid storage types are:
- AVRO
- PARQUET
- TEXTFILE
- JSON
- JDBC
- ORC
Example: CREATE
a table with STORED AS JSON
format
CREATE EXTERNAL TABLE tbl(s string, value int)
STORED AS JSON
LOCATION 's3://bucket/path_to_dataset_as_json/'
This command can be used with the file formats listed above.
Create Table With Automatic Inference of Data Types¶
Okera supports creating tables for which field data types are auto-inferred by specifying a sample data file.
The CREATE
statement can be run with an option LIKE <path_to_sample_file>
.
The auto-inference is currently supported for:
- AVRO
- PARQUET
- TEXTFILE
- JSON
Example: CREATE
a table with LIKE <json_file>
CREATE EXTERNAL TABLE tbl(s string, value int)
LIKE JSON 's3://bucket/path_to_dataset_as_json/sample_json.json'
STORED AS JSON
LOCATION 's3://bucket/path_to_dataset_as_json/'
This statement can be used with the file formats listed above.
Note: Auto-inference for JSON is entirely based on the data provided in the JSON file and based on the sample data. The field types can vary. Okera recommends that you review the table and amend the data types as necessary. For example, if the sample data has values for a column named
volume
as 300000, 400000, the data type for that column would beINT
. If the data type is actuallyBIGINT
, you will need to amend it post-table creation.
NULL DEFINED AS Support¶
The Hive NULL DEFINED AS parameter, used to define the specific value for NULL, is supported.
Change Partition Location¶
Okera allows you to change the file location of any particular partition in an external table using SQL, after the external table has been created. An example is shown below.
Usage:
ALTER TABLE tbl PARTITION (part1="partition_name") SET LOCATION 's3://bucket/path_to_dataset'
Example: Change the location for partition snap_dt="20170809"
ALTER TABLE hp_test.cerebro_hr_emp PARTITION (snap_dt="20170809") SET LOCATION
's3://qhrcntrl/hr/appl/dde/cln/worker_non_api_cln.dat/snap_dt=20170809'
Registering Hive Serialization/Deserialization (SerDe) Libraries¶
See Extending Okera for the DDL grammar and other SerDe considerations.
Creating User-Defined Functions (UDFs)¶
See Extending Okera for the DDL grammar and other UDF considerations.
Internal Versus External Views¶
Okera views can be defined as either internal or external. This distinction defines how Okera evaluates data at runtime. It can have a profound effect when evaluating joins between tables and views.
In both internal and external cases, data resides in its source systems. Okera-managed data continues to be managed in Okera. External views continue to be managed by their non-Okera source.
The primary difference between internal and external views is that external data is not evaluated during an Okera query. External data does not have fine-grained access control, UDFs, and other features that Okera provides to managed datasets.
It is because of this property that joins are handled differently in internal views verses external views.
Okera manages the join of views defined in Okera created from two tables or views, internal or external, at the query level. Okera evaluates the join prior to it being sent to the analytics/compute engine for further processing. This allows for fine-grained access control and UDF functionality to be applied to the entire view, regardless of where the source data resides.
Note: Okera is not a compute engine. Full SQL functionality is not available through the Okera SQL interface. The use of a compute engine for full analytics functionality is required. For a list of known SQL incompatibilities, refer to the Known Incompatibilities section in this document.
External views created of two tables or views, internal or external, are evaluated in a slightly different way. Data managed by Okera continues to be evaluated within the Okera cluster. But the join between the two tables or views occurs in the analytics/compute engine. The advantage of this approach is that Okera continues to provide fine-grained access control and UDFs on Okera-managed data, while allowing the sometimes heavy compute of a join to be done outside the Okera system.
External views require an analytics engine such as Hive or Spark to process aggregate requests and complete the join prior to execution.
Creating External Views¶
This section provides a number of example common EXTERNAL
view uses.
Example: Creating an external view that requires no evaluation in Okera
CREATE EXTERNAL VIEW random_user_subset AS SELECT *
FROM all_users WHERE rand() % 10 = 0
Note: Views on aggregate functions need to be created as
EXTERNAL
views, since the aggregates are calculated in compute applications like Hive or Spark.
Example: Creating external views with aggregate functions
CREATE EXTERNAL VIEW revenue_cal
AS SELECT min(revenue) as minRevenue,
max(revenue) as maxRevenue
FROM sales_transactions
WHERE region = 'california'
Since compute applications do not accept the "EXTERNAL
view" syntax, it can be executed using the Okera Portal UI.
By default, views without EXTERNAL
are evaluated in Okera, maintaining backwards compatibility with all previous Okera functions.
Last Partitions/Last Files¶
Okera extends the SQL grammar to easily restrict a table scan to just the most recent parts of a dataset. This can be very useful for development or ad hoc queries that only need to scan the most recent data, for example data from the last day or week. These clauses can be used during development, switching to the complete dataset once the application is developed.
OKERA supports LAST N PARTITIONS
, LAST N FILES
and LAST PARTITION
.
LAST PARTITION
is just an alias for LAST 1 PARTITIONS
.
For example, if the data is partitioned by day, LAST PARTITION
can be used to return the results for the last day.
LAST 30 PARTITIONS
will return the last month and LAST 100 FILES
will return the most recent 100 files worth of data.
These clauses can be added after the table name as an extension of the FROM
clause.
Example: Returning the last days results using LAST N PARTITIONS
SELECT * from part_tbl;
SELECT * from part_tbl(LAST PARTITION);
SELECT * from part_tbl(LAST 10 PARTITIONS);
SELECT * from fact_tbl(LAST 20 FILES);
For the last partition clauses, the partitions returned refer to the newest partitions of the table after sorting by partition. Note that this may not be the partition from when the data was most recently added. It is not the last-modified partition.
As an example, if the table is partitioned by year, month, and day, it will always return the latest dates regardless of when the partitions were added or when the data in the partitions changed.
These LAST
clause computations occur after partition pruning is done.
Example: Filtering the window for which to select the last partition
SELECT * from part_tbl(LAST PARTITION);
SELECT * from part_tbl(LAST PARTITION) where year = 2010;
SELECT * from part_tbl(LAST 3 PARTITIONS) where month = 6;
Since the LAST
clauses are specified per table, it is possible to specify it for each table in a query independently.
For example:
SELECT * from t1(LAST 10 PARTITIONS)
JOIN t2 ...
JOIN t3(LAST PARTITION)
LAST Clause on Views¶
If the LAST
clause is specified on a view, then it is propagated to each base table (recursively) in the view.
It is invalid to specify the LAST
clause on the view and on the LAST
clause on an object in the view definition as this can be ambiguous.
Example:
CREATE VIEW v1 AS SELECT * FROM base_tbl;
-- This automatically propagates last to the base table and will only scan 1 partition.
SELECT * FROM v1(LAST PARTITION);
-- If the view is a UNION ALL, it will do so on each base table.
CREATE VIEW v2 AS SELECT * FROM base_tbl1 UNION ALL SELECT * FROM base_tbl2;
-- This will select 1 partition from each table
SELECT * FROM v2(LAST PARTITION);
-- Views can be created ontop of these with the last clause
CREATE VIEW v3 AS SELECT * FROM v1(LAST 10 PARTITIONS);
-- This will select up to 10 partitions each.
SELECT * FROM v3;
-- Both of these are invalid as it is not possible to have multiple LAST clauses
SELECT * FROM v3(LAST PARTITION); -- INVALID: last from v3 clause and SELECT clause
CREATE VIEW v3 AS SELECT * FROM v1(LAST 10 FILES); -- INVALID for same reason.
LAST Clause as Catalog Metadata¶
In the previous example, the last clause would need to be explicitly specified, either as part of the query, or as explicit views that users use.
While the explicit view definition (e.g. sales_data.transactions_dev_sample
, which is a view on top of sales_data.transactions
) has benefits, it may be easier to just set it as a property on the base tables or views.
This can be done by setting the okera.default-last-n-partitions
property on the Okera catalog object.
The value should be an integer, which specifies a default LAST N PARTITIONS
clause.
For example,
-- Returns all partitions
SELECT * FROM sales_data.transactions;
-- Set catalog metadata so that queries by default will only scan 3 partitions
ALTER TABLE sales_data.transactions SET TBLPROPERTIES('okera.default-last-n-partitions', 3);
-- Now the same query will automatically scan only the last 3 partitions.
SELECT * FROM sales_data.transactions;
The default value in the metadata is only used if
* No predicates are specified on the partition columns.
If a query explicitly specified a partitioning column, this value is ignored.
* No explicit LAST
clause is specified.
This property can be set on base tables and views.
The expected use of this feature is to set this property on the largest datasets to implement automatic sampling. This can dramatically improve BI or ad hoc use cases and end users can get sampling transparently. As this is intended to be transparent to the end user, it may be difficult for users to know this is happening and this should be reserved for tables where sampling is appropriate.
Limitations¶
-
LAST PARTITIONS
can only be used in internal views as other compute engines may not be able to support it.Example: Improper query with no view
SELECT * from (SELECT * from t) (LAST PARTITION)
Example: Recommended alternative with view creation
CREATE VIEW t_last_partition AS SELECT * FROM t (LAST PARTITION); SELECT * FROM t_last_partition;
-
Catalog metadata configuration only supports partitions, not files.
Adding or Dropping Multiple Partitions in a Single ALTER TABLE Statement¶
Okera supports the addition of multiple partitions in a single ALTER TABLE statement. For example, this is a valid construct:
ALTER TABLE page_view ADD
PARTITION (dt='2008-08-08', country='us') location '/path/to/us/part080808'
PARTITION (dt='2008-08-09', country='us') location '/path/to/us/part080809';
Okera also supports dropping multiple partitions in a single ALTER TABLE statement. For example, this is a valid construct:
ALTER TABLE page_view DROP
PARTITION (dt='2008-08-08', country='us')
PARTITION (dt='2008-08-09', country='us');
Dropping Partial or Nested Partitions in an ALTER TABLE Statement¶
Okera supports dropping partitions by specifying only part of the partition specification. It will also drop nested partitions within a parent partition when you drop the parent partition.
For example, suppose the following table (my_table
) is partitioned on year and month (year=INT/month=INT
):
year | month | #Rows | #Files | Size | Format |
---|---|---|---|---|---|
2016 | 1 | -1 | 0 | 0B | TEXT |
2016 | 2 | -1 | 0 | 0B | TEXT |
2017 | 3 | -1 | 0 | 0B | TEXT |
2017 | 4 | -1 | 0 | 0B | TEXT |
The following ALTER TABLE statement will drop all partitions that have year
set to 2016
, including all nested partitions contained within the 2016
partition.
ALTER TABLE my_table DROP PARTITION(year=2016)
The resulting table would look like this:
year | month | #Rows | #Files | Size | Format |
---|---|---|---|---|---|
2017 | 3 | -1 | 0 | 0B | TEXT |
2017 | 4 | -1 | 0 | 0B | TEXT |
Support for Referencing Dynamic Parameters in a URI¶
When the ENABLE_PARAMETRIZED_URI_GRANTS
and the ENABLE_PARAMETRIZED_GRANTS
configuration settings are set to true
, you can reference dynamic parameters in a URI when doing a permission grant. For example, you can create a grant that allows every user to have ALL
access to a URI that includes their username:
GRANT ALL ON URI s3://bucket/${user} TO GENERIC_ROLE
Note: The only currently supported parameter is
${user}
.
Support for Referencing Dynamic Parameters in Database and Table GRANTs¶
When the ENABLE_PARAMETRIZED_GRANTS
configuration setting is set to true
, you can reference dynamic parameters in GRANTs on databases and tables. For example, For example, the following GRANTs will work:
GRANT ALL ON DATABASE user_${user} TO ROLE public_role
GRANT ALL ON DATABASE shared.${user}_tbl TO ROLE public_role
GRANT ALL ON DATABASE ${user}.${user}_tbl TO ROLE public_role
Note: The only currently supported dynamic parameter is
${user}
.
Table-Level Attributes With Table or View Creation¶
Okera supports adding table-level attributes when creating a table or views using CREATE TABLE
and CREATE VIEW
. For example:
```
CREATE TABLE users (
id BIGINT,
name STRING
)
ATTRIBUTE classification.sensitive
```
DROP_TABLE_OR_VIEW Statement¶
Okera supports DROP_TABLE_OR_VIEW
which can be used instead of DROP TABLE
or DROP VIEW
.
This SQL command drops the dataset regardless of whether the dataset is a table or a view.
Since this extends the SQL grammar, this is only available from tools that directly interact with Okera, such as the Okera Portal UI workspace.
For example, this is not available in the Hive CLI.
Known Incompatibilities¶
Stricter Type Promotion¶
Hive/HiveQL is very permissive in type promotions, allowing implicit conversions between most types.
In Okera, only lossless type promotion is implicit (for example, INT -> BIGINT).
Explicit cast()
functions may need to be added for existing SQL statements.
Disallowing Explicit Partitioning Clause when Creating Views¶
Hive/HiveQL is used for creating views with an explicit partitioning clause.
Example: Creating views with an explicit partitioning clause
CREATE VIEW v as SELECT ... FROM base_tbl
PARTITIONED BY c1
Okera does not allow partitioning to be specified for views. Partitioning is instead inferred based on the VIEW statement and base table. The partitioning on the base table is preserved for the view.
This is disallowed. It is unclear what the semantics are, if the partitioning specified in the view is different from the base table, and what the resulting performance implications might be.
ALTER DATABASE SET DBPROPERTIES Statement¶
Users can add additional properties to a catalog database using the ALTER DATABASE SET DBPROPERTIES
statement.
Example:
ALTER DATABASE <db_name> SET DBPROPERTIES ('description'='Marketing database')
DESCRIBE Statement¶
DESCRIBE [FORMATTED] table_name|view_name
This statement displays the definition of the table or view. If the FORMATTED keyword is specified, additional details of the definition are displayed.
Example usage:
DESCRIBE tpch.nation
Example usage on view:
DESCRIBE FORMATTED partdb.ymdata_view
When used on a view, the DESCRIBE FORMATTED
statement provides two important pieces of information about the view definition: the base tables used in the view and the information about partitioned columns in the view. In the following example, the view partdb.ymdata_view
is a view on top of the table partdb.ymdata
(sample year-month data). The partitioned columns on the partdb.ymdata
table are year,month and that information is displayed as part of view definition as below.
Example output in the 'FORMATTED' display for a view with partitioned columns:
partitioned_cols year,month
tables partdb.ymdata
DESCRIBE DATABASE Statement¶
The DESCRIBE DATABASE statement has been extended to describe the properties of a database object.
Example:
DESCRIBE DATABASE jdbc_demo
Property Value¶
dbname = jdbc_demo
type = REGULAR
Owner = root
driver = mysql
mapred.jdbc.schema.name =
mapred.input.table.name = jdbc_demo.transactions_schemaed_demo
mapred.jdbc.driver.class = mysql
credentials.file = /opt/conf/jdbc_demo.conf
mapred.jdbc.db.name = jdbc_demo
mapred.jdbc.credentials.file = /opt/conf/jdbc_demo.conf
Data Manipulation Language (DML) Statements¶
Okera is not a distributed SQL engine.
It only supports a subset of SQL statements.
It does not support the other DML statements (such as INSERT
, DELETE
, UPDATE
, etc).
And even for SELECT
statements, only a subset of the SQL standard is supported.
A typical configuration is to run a SQL engine (on top of Spark or Presto) on top of Okera.
SELECT
statements with projection, filters and union all are fully supported.
COUNT(*)
with no grouping is the only aggregation supported.
Multiple records are returned for this query, each containing a partial count.
Summing up all the counts returns the complete result.
UNION ALL¶
Okera supports UNION ALL
but not UNION DISTINCT
.
Per the SQL standard, UNION
without a qualifier defaults to UNION DISTINCT
so ALL
needs to be explicitly specified.
For example:
SELECT * from okera_sample.sample UNION ALL SELECT * from okera_sample.sample
JOINs¶
When using views, Okera supports a limited set of joins for the purpose of restricting access to specific rows for particular users. A canonical use case could be having a fact dataset for user transactions, which contains a column for the user ID. Another, much-smaller dataset, contains the set of user ids that allow analytics to be carried out on their activity. Okera would support filtering the transactions dataset by creating a view that is a join over the two.
The specific limitations are:
- Only
INNER
andLEFT
(optionally withOUTER
) joins are allowed. - The smaller tables must be under a maximum configured size. If the smaller tables exceed the maximum configured size, the request fails at runtime.
Subquery rewrites are supported but must be executable subject to the same constraints.
Configurations¶
Configurations can be specified at cluster creation time using the CLI.
By default, joins are enabled with a maximum memory of 128MB per join.
See the Cluster Sizing document for more information on how much memory joins need and how that affects cluster node requirements.