Skip to content

Supported SQL

Okera enables the use of SQL for defining datasets. This is done by creating base datasets and then defining views. Okera is not a full massively parallel processing (MPP) analytics engine, as only a subset of SQL is permitted. This document describes the supported SQL subset.

Okera in general supports the identical data model as Apache Hive and is generally compatible with HiveQL. Supported functionality may differ depending on the client.

Data Definition Language (DDL) Statements

Okera generally supports and is compatible with HiveQL SQL commands. In some cases, Okera is not compatible. In other cases, the supported SQL has been extended for Okera-specific capabilities -- including all statements that modify the catalog and do not read any data (for example, CREATE, DROP, ALTER).

MSCK REPAIR Statement

Okera does not support the Hive MSCK REPAIR TABLE <table_name>. Instead it supports the alternative, ALTER TABLE <table_name> RECOVER PARTITIONS. This statement otherwise behaves identically, automatically adding partitions to the table based on the storage directory structure.

Extensions

Apply the WITH GRANT OPTION to the Okera admin_role

If the Okera configuration parameter INIT_WITH_GRANT_OPTION has not been set or is set to false, the admin user and any other user assigned the admin_role in Okera cannot edit permissions. To override the setting in your local environment, you can specify the following DDL in the Workspace:

GRANT ALL ON CATALOG TO ROLE ADMIN_ROLE WITH GRANT OPTION

Drop a Database and its Datasets or Views

You can drop a database using the SQL command DROP DATABASE <database-name>. However, you cannot drop a database if it contains datasets or internal views. To drop a dataset from a database, issue the SQL command DROP TABLE <database-name>.<dataset-name>. To drop an internal view from a database, issue the SQL command DROP VIEW <database-name>.<view-name>. SQL commands can be issued in the Workspace.

For example, the following statements drop a view named cancerward and a dataset named cancerpatients from a database named hospitaldb.

DROP VIEW hospitaldb.cancerward
DROP TABLE hospitaldb.cancerpatients
After all datasets and views are removed from a database, you can drop the database. The following example drops the entire hospitaldb database:

DROP DATABASE hospitaldb

Optionally Drop Permissions

There are use cases where it is valid to retain or drop permissions when the corresponding catalog object (database, table, or view) is dropped. Okera extends the DROP DATABASE, DROP TABLE, and DROP VIEW statements to optionally specify whether the associated permissions should also be dropped.

DROP DATABASE [IF EXISTS] db [CASCADE] [(INCLUDING | EXCLUDING) PERMISSIONS];
DROP TABLE [IF EXISTS] [db.]tbl [(INCLUDING | EXCLUDING) PERMISSIONS];
DROP VIEW [IF EXISTS] [db.]v [(INCLUDING | EXCLUDING) PERMISSIONS];

If INCLUDING PERMISSIONS is specified, the corresponding permissions are also dropped. Otherwise, permissions will not be dropped, and they will be applied to future catalog objects with that name. If CASCADE is specified, then all permissions on the tables and views in the database are dropped.

It is recommended that users default to the INCLUDING PERMISSIONS behavior and update existing workflows so as to not rely on permissions being retained longer than the object for which they are created.

Example: DROP directives with INCLUDING PERMISSIONS

DROP DATABASE IF EXISTS database1 CASCADE INCLUDING PERMISSIONS;
DROP TABLE IF EXISTS database2.table1 INCLUDING PERMISSIONS;
DROP VIEW IF EXISTS database2.view1 INCLUDING PERMISSIONS;

For users to drop the permissions, they must have grant permissions on the catalog object. For example, to be able to drop a database and its permissions, the user must be able to issue GRANT/REVOKE statements on the database. The user needs to be a catalog admin or having been granted grant permissions.

Create Table As Select (CTAS)

Okera supports tables created using this method with the following restrictions:

  • The table must be created from Hive in Amazon EMR.

  • The Hive warehouse must be configured to use Amazon S3 (as opposed to using the Amazon EMR-local HDFS cluster).

Create Table Over a Single File

Traditionally, Hive tables (and partitions) are defined over a directory of files. In Okera, you can define a table to a single file by specifying the full path to the file as the location. This allows creating tables without first having to create the parent directory and moving the data files.

Example: CREATE for a table to a single file

CREATE EXTERNAL TABLE tbl(s string)
LOCATION 's3://bucket/path/file.parquet'

In this example, when this table is scanned only the specified path is read, specifically, no other files in the s3://bucket/path/ directory are read.

Note that tables created this way may not read correctly if the client is bypassing Okera. For example, Hive clients that are directly reading this table without Okera, may fail.

Create Table With Storage File Format

Okera supports specifying the underlying storage type file format when creating a table. The storage type can be specified using the STORED AS <format-type> statement. The Valid storage types are:

  • AVRO
  • PARQUET
  • TEXTFILE
  • JSON
  • JDBC
  • ORC

Example: CREATE a table with STORED AS JSON format

CREATE EXTERNAL TABLE tbl(s string, value int)
STORED AS JSON
LOCATION 's3://bucket/path_to_dataset_as_json/'

This command can be used with the file formats listed above.

Create Table With Automatic Inference of Data Types

Okera supports creating tables for which field data types are auto-inferred by specifying a sample data file. The CREATE statement can be run with an option LIKE <path_to_sample_file>.

The auto-inference is currently supported for:

  • AVRO
  • PARQUET
  • TEXTFILE
  • JSON

Example: CREATE a table with LIKE <json_file>

CREATE EXTERNAL TABLE tbl(s string, value int)
LIKE JSON 's3://bucket/path_to_dataset_as_json/sample_json.json'
STORED AS JSON
LOCATION 's3://bucket/path_to_dataset_as_json/'

This statement can be used with the file formats listed above.

Note: Auto-inference for JSON is entirely based on the data provided in the JSON file and based on the sample data. The field types can vary. Okera recommends that you review the table and amend the data types as necessary. For example, if the sample data has values for a column named volume as 300000, 400000, the data type for that column would be INT. If the data type is actually BIGINT, you will need to amend it post-table creation.

NULL DEFINED AS Support

The Hive NULL DEFINED AS parameter, used to define the specific value for NULL, is supported.

Change Partition Location

Okera allows you to change the file location of any particular partition in an external table using SQL, after the external table has been created. An example is shown below.

Usage:

ALTER TABLE tbl PARTITION (part1="partition_name") SET LOCATION 's3://bucket/path_to_dataset'

Example: Change the location for partition snap_dt="20170809"

ALTER TABLE hp_test.cerebro_hr_emp PARTITION (snap_dt="20170809") SET LOCATION
's3://qhrcntrl/hr/appl/dde/cln/worker_non_api_cln.dat/snap_dt=20170809'

Registering Hive Serialization/Deserialization (SerDe) Libraries

See Extending Okera for the DDL grammar and other SerDe considerations.

Creating User-Defined Functions (UDFs)

See Extending Okera for the DDL grammar and other UDF considerations.

Internal Versus External Views

Okera views can be defined as either internal or external. This distinction defines how Okera evaluates data at runtime. It can have a profound effect when evaluating joins between tables and views.

In both internal and external cases, data resides in its source systems. Okera-managed data continues to be managed in Okera. External views continue to be managed by their non-Okera source.

The primary difference between internal and external views is that external data is not evaluated during an Okera query. External data does not have fine-grained access control, UDFs, and other features that Okera provides to managed datasets.

It is because of this property that joins are handled differently in internal views verses external views.

Okera manages the join of views defined in Okera created from two tables or views, internal or external, at the query level. Okera evaluates the join prior to it being sent to the analytics/compute engine for further processing. This allows for fine-grained access control and UDF functionality to be applied to the entire view, regardless of where the source data resides.

Note: Okera is not a compute engine. Full SQL functionality is not available through the Okera SQL interface. The use of a compute engine for full analytics functionality is required. For a list of known SQL incompatibilities, refer to the Known Incompatibilities section in this document.

External views created of two tables or views, internal or external, are evaluated in a slightly different way. Data managed by Okera continues to be evaluated within the Okera cluster. But the join between the two tables or views occurs in the analytics/compute engine. The advantage of this approach is that Okera continues to provide fine-grained access control and UDFs on Okera-managed data, while allowing the sometimes heavy compute of a join to be done outside the Okera system.

External views require an analytics engine such as Hive or Spark to process aggregate requests and complete the join prior to execution.

Creating External Views

This section provides a number of example common EXTERNAL view uses.

Example: Creating an external view that requires no evaluation in Okera

CREATE EXTERNAL VIEW random_user_subset AS SELECT *
FROM all_users WHERE rand() % 10 = 0

Note: Views on aggregate functions need to be created as EXTERNAL views, since the aggregates are calculated in compute applications like Hive or Spark.

Example: Creating external views with aggregate functions

CREATE EXTERNAL VIEW revenue_cal
AS SELECT min(revenue) as minRevenue,
max(revenue) as maxRevenue
FROM sales_transactions
WHERE region = 'california'

Since compute applications do not accept the "EXTERNAL view" syntax, it can be executed using the Okera Portal UI.

By default, views without EXTERNAL are evaluated in Okera, maintaining backwards compatibility with all previous Okera functions.

Last Partitions/Last Files

Okera extends the SQL grammar to easily restrict a table scan to just the most recent parts of a dataset. This can be very useful for development or ad hoc queries that only need to scan the most recent data, for example data from the last day or week. These clauses can be used during development, switching to the complete dataset once the application is developed.

OKERA supports LAST N PARTITIONS, LAST N FILES and LAST PARTITION. LAST PARTITION is just an alias for LAST 1 PARTITIONS.

For example, if the data is partitioned by day, LAST PARTITION can be used to return the results for the last day. LAST 30 PARTITIONS will return the last month and LAST 100 FILES will return the most recent 100 files worth of data.

These clauses can be added after the table name as an extension of the FROM clause.

Example: Returning the last days results using LAST N PARTITIONS

SELECT * from part_tbl;
SELECT * from part_tbl(LAST PARTITION);
SELECT * from part_tbl(LAST 10 PARTITIONS);
SELECT * from fact_tbl(LAST 20 FILES);

For the last partition clauses, the partitions returned refer to the newest partitions of the table after sorting by partition. Note that this may not be the partition from when the data was most recently added. It is not the last-modified partition.

As an example, if the table is partitioned by year, month, and day, it will always return the latest dates regardless of when the partitions were added or when the data in the partitions changed.

These LAST clause computations occur after partition pruning is done.

Example: Filtering the window for which to select the last partition

SELECT * from part_tbl(LAST PARTITION);
SELECT * from part_tbl(LAST PARTITION) where year = 2010;
SELECT * from part_tbl(LAST 3 PARTITIONS) where month = 6;
These queries respectively return the last partition, the last partition in 2010, and the last 3 partitions in June.

Since the LAST clauses are specified per table, it is possible to specify it for each table in a query independently.

For example:

SELECT * from t1(LAST 10 PARTITIONS)
JOIN t2 ...
JOIN t3(LAST PARTITION)

LAST Clause on Views

If the LAST clause is specified on a view, then it is propagated to each base table (recursively) in the view. It is invalid to specify the LAST clause on the view and on the LAST clause on an object in the view definition as this can be ambiguous.

Example:

CREATE VIEW v1 AS SELECT * FROM base_tbl;

-- This automatically propagates last to the base table and will only scan 1 partition.
SELECT * FROM v1(LAST PARTITION);

-- If the view is a UNION ALL, it will do so on each base table.
CREATE VIEW v2 AS SELECT * FROM base_tbl1 UNION ALL SELECT * FROM base_tbl2;

-- This will select 1 partition from each table
SELECT * FROM v2(LAST PARTITION);

-- Views can be created ontop of these with the last clause
CREATE VIEW v3 AS SELECT * FROM v1(LAST 10 PARTITIONS);

-- This will select up to 10 partitions each.
SELECT * FROM v3;

-- Both of these are invalid as it is not possible to have multiple LAST clauses
SELECT * FROM v3(LAST PARTITION); -- INVALID: last from v3 clause and SELECT clause
CREATE VIEW v3 AS SELECT * FROM v1(LAST 10 FILES); -- INVALID for same reason.

LAST Clause as Catalog Metadata

In the previous example, the last clause would need to be explicitly specified, either as part of the query, or as explicit views that users use. While the explicit view definition (e.g. sales_data.transactions_dev_sample, which is a view on top of sales_data.transactions) has benefits, it may be easier to just set it as a property on the base tables or views. This can be done by setting the okera.default-last-n-partitions property on the Okera catalog object. The value should be an integer, which specifies a default LAST N PARTITIONS clause.

For example,

-- Returns all partitions
SELECT * FROM sales_data.transactions;

-- Set catalog metadata so that queries by default will only scan 3 partitions
ALTER TABLE sales_data.transactions SET TBLPROPERTIES('okera.default-last-n-partitions', 3);

-- Now the same query will automatically scan only the last 3 partitions.
SELECT * FROM sales_data.transactions;

The default value in the metadata is only used if * No predicates are specified on the partition columns. If a query explicitly specified a partitioning column, this value is ignored. * No explicit LAST clause is specified.

This property can be set on base tables and views.

The expected use of this feature is to set this property on the largest datasets to implement automatic sampling. This can dramatically improve BI or ad hoc use cases and end users can get sampling transparently. As this is intended to be transparent to the end user, it may be difficult for users to know this is happening and this should be reserved for tables where sampling is appropriate.

Limitations
  • LAST PARTITIONS can only be used in internal views as other compute engines may not be able to support it.

    Example: Improper query with no view

    SELECT * from (SELECT * from t) (LAST PARTITION)
    

    Example: Recommended alternative with view creation

    CREATE VIEW t_last_partition
    AS SELECT * FROM t (LAST PARTITION);
    
    SELECT * FROM t_last_partition;
    
  • Catalog metadata configuration only supports partitions, not files.

Adding or Dropping Multiple Partitions in a Single ALTER TABLE Statement

Okera supports the addition of multiple partitions in a single ALTER TABLE statement. For example, this is a valid construct:

ALTER TABLE page_view ADD 
PARTITION (dt='2008-08-08', country='us') location '/path/to/us/part080808' 
PARTITION (dt='2008-08-09', country='us') location '/path/to/us/part080809';

Okera also supports dropping multiple partitions in a single ALTER TABLE statement. For example, this is a valid construct:

ALTER TABLE page_view DROP 
PARTITION (dt='2008-08-08', country='us')
PARTITION (dt='2008-08-09', country='us');

Dropping Partial or Nested Partitions in an ALTER TABLE Statement

Okera supports dropping partitions by specifying only part of the partition specification. It will also drop nested partitions within a parent partition when you drop the parent partition.

For example, suppose the following table (my_table) is partitioned on year and month (year=INT/month=INT):

year month #Rows #Files Size Format
2016 1 -1 0 0B TEXT
2016 2 -1 0 0B TEXT
2017 3 -1 0 0B TEXT
2017 4 -1 0 0B TEXT

The following ALTER TABLE statement will drop all partitions that have year set to 2016, including all nested partitions contained within the 2016 partition.

ALTER TABLE my_table DROP PARTITION(year=2016)

The resulting table would look like this:

year month #Rows #Files Size Format
2017 3 -1 0 0B TEXT
2017 4 -1 0 0B TEXT

Support for Referencing Dynamic Parameters in a URI

When the ENABLE_PARAMETRIZED_URI_GRANTS and the ENABLE_PARAMETRIZED_GRANTS configuration settings are set to true, you can reference dynamic parameters in a URI when doing a permission grant. For example, you can create a grant that allows every user to have ALL access to a URI that includes their username:

GRANT ALL ON URI s3://bucket/${user} TO GENERIC_ROLE

Note: The only currently supported parameter is ${user}.

Support for Referencing Dynamic Parameters in Database and Table GRANTs

When the ENABLE_PARAMETRIZED_GRANTS configuration setting is set to true, you can reference dynamic parameters in GRANTs on databases and tables. For example, For example, the following GRANTs will work:

GRANT ALL ON DATABASE user_${user} TO ROLE public_role
GRANT ALL ON DATABASE shared.${user}_tbl TO ROLE public_role
GRANT ALL ON DATABASE ${user}.${user}_tbl TO ROLE public_role

Note: The only currently supported dynamic parameter is ${user}.

Table-Level Attributes With Table or View Creation

Okera supports adding table-level attributes when creating a table or views using CREATE TABLE and CREATE VIEW. For example:

```
CREATE TABLE users (
    id BIGINT,
    name STRING
)
ATTRIBUTE classification.sensitive
```

DROP_TABLE_OR_VIEW Statement

Okera supports DROP_TABLE_OR_VIEW which can be used instead of DROP TABLE or DROP VIEW. This SQL command drops the dataset regardless of whether the dataset is a table or a view. Since this extends the SQL grammar, this is only available from tools that directly interact with Okera, such as the Okera Portal UI workspace. For example, this is not available in the Hive CLI.

Known Incompatibilities

Stricter Type Promotion

Hive/HiveQL is very permissive in type promotions, allowing implicit conversions between most types. In Okera, only lossless type promotion is implicit (for example, INT -> BIGINT). Explicit cast() functions may need to be added for existing SQL statements.

Disallowing Explicit Partitioning Clause when Creating Views

Hive/HiveQL is used for creating views with an explicit partitioning clause.

Example: Creating views with an explicit partitioning clause

CREATE VIEW v as SELECT ... FROM base_tbl
PARTITIONED BY c1

Okera does not allow partitioning to be specified for views. Partitioning is instead inferred based on the VIEW statement and base table. The partitioning on the base table is preserved for the view.

This is disallowed. It is unclear what the semantics are, if the partitioning specified in the view is different from the base table, and what the resulting performance implications might be.

ALTER DATABASE SET DBPROPERTIES Statement

Users can add additional properties to a catalog database using the ALTER DATABASE SET DBPROPERTIES statement.

Example:

ALTER DATABASE <db_name> SET DBPROPERTIES ('description'='Marketing database')

DESCRIBE Statement

DESCRIBE [FORMATTED] table_name|view_name

This statement displays the definition of the table or view. If the FORMATTED keyword is specified, additional details of the definition are displayed.

Example usage:

DESCRIBE tpch.nation

Example usage on view:

DESCRIBE FORMATTED partdb.ymdata_view

When used on a view, the DESCRIBE FORMATTED statement provides two important pieces of information about the view definition: the base tables used in the view and the information about partitioned columns in the view. In the following example, the view partdb.ymdata_view is a view on top of the table partdb.ymdata (sample year-month data). The partitioned columns on the partdb.ymdata table are year,month and that information is displayed as part of view definition as below.

Example output in the 'FORMATTED' display for a view with partitioned columns:

    partitioned_cols        year,month
    tables                  partdb.ymdata

DESCRIBE DATABASE Statement

The DESCRIBE DATABASE statement has been extended to describe the properties of a database object.

Example:

DESCRIBE DATABASE jdbc_demo

Property Value
dbname = jdbc_demo
type =  REGULAR
Owner =  root
driver =  mysql
mapred.jdbc.schema.name =
mapred.input.table.name = jdbc_demo.transactions_schemaed_demo
mapred.jdbc.driver.class = mysql
credentials.file = /opt/conf/jdbc_demo.conf
mapred.jdbc.db.name = jdbc_demo
mapred.jdbc.credentials.file = /opt/conf/jdbc_demo.conf

Data Manipulation Language (DML) Statements

Okera is not a distributed SQL engine. It only supports a subset of SQL statements. It does not support the other DML statements (such as INSERT, DELETE, UPDATE, etc). And even for SELECT statements, only a subset of the SQL standard is supported. A typical configuration is to run a SQL engine (on top of Spark or Presto) on top of Okera.

SELECT statements with projection, filters and union all are fully supported.

COUNT(*) with no grouping is the only aggregation supported. Multiple records are returned for this query, each containing a partial count. Summing up all the counts returns the complete result.

UNION ALL

Okera supports UNION ALL but not UNION DISTINCT. Per the SQL standard, UNION without a qualifier defaults to UNION DISTINCT so ALL needs to be explicitly specified.

For example:

SELECT * from okera_sample.sample UNION ALL SELECT * from okera_sample.sample

JOINs

When using views, Okera supports a limited set of joins for the purpose of restricting access to specific rows for particular users. A canonical use case could be having a fact dataset for user transactions, which contains a column for the user ID. Another, much-smaller dataset, contains the set of user ids that allow analytics to be carried out on their activity. Okera would support filtering the transactions dataset by creating a view that is a join over the two.

The specific limitations are:

  • Only INNER and LEFT (optionally with OUTER) joins are allowed.
  • The smaller tables must be under a maximum configured size. If the smaller tables exceed the maximum configured size, the request fails at runtime.

Subquery rewrites are supported but must be executable subject to the same constraints.

Configurations

Configurations can be specified at cluster creation time using the CLI.

By default, joins are enabled with a maximum memory of 128MB per join.

See the Cluster Sizing document for more information on how much memory joins need and how that affects cluster node requirements.