Skip to content

Extending Okera

You can extend Okera functionality in two ways:

  • Run user-defined functions (UDFs)
  • Use Hive serialization/deserialization libraries (SerDe).

These extend Okera capabilities by including support for:

  • UDF - Custom anonymization or data encryption algorithm
  • UDF - Complicated filters that are more naturally expressed in code vs. SQL
  • SerDe - Reading a file format (i.e. multi-delimited CSV) that is not supported by Okera
  • SerDe - Custom in-house file format

Security

User-defined functions (UDFs) and serialization/deserialization (SerDe) libraries run using the same process and the same permissions as the rest of Okera (typically run as the system user). These libraries have access to all the data that Okera has access to. A malicious library can potentially access data that is currently being processed. It is assumed that the libraries are trusted. Currently the only measure to protect against ill-behaving libraries is to restrict who can register them.

Only users who are catalog admins can register and unregister user-defined functions and SerDe. It is not possible to delegate this capability to other users (i.e. in the same way the permission to grant can be delegated).

As a best practice, it is recommended that library locations be secured to prevent other users from replacing it with malicious binaries (i.e. most users should not have write access to that location.)

User-Defined Functions (UDFs)

To use user-defined functions, the steps are:

  1. Register the user-defined function in the Okera catalog. Only the catalog admins can do this.
  2. Create views (or issue queries) that use the user-defined function with typical SQL statements. Any user can use the function (assuming they have access to the user-defined function database).
  3. Read data from the view. The user-defined function is evaluated by Okera before the data is returned.

For example, use a user-defined function that accepts strings and masks all characters in it:

mask('hello')-> 'xxxxx'.

Okera supports user-defined functions written against the Hive interface. These JARs should be compatible with Hive, and require no additional steps.

For this user-defined function, the code might look like:

package com.okera.hiveudf;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

/**
  * UDF which masks all characters from the input string.
  *
  * SELECT mask('hello')
  * > 'xxxx'
  */
public class MaskUDF extends UDF {
  public Text evaluate(Text t) {
    if (t == null) return null;
    byte[] result = new byte[t.getBytes().length];
    for (int i = 0; i < result.length; i++) {
      result[i] = 'x';
    }
    return new Text(result);
  }
}

Registering the User-Defined Function

To register the user-defined function, use one of the Okera client tools (such as curl) to execute DDL statements against Okera.

Note: Creating user-defined functions via our client integrations (Hive, Spark, etc.) is not supported.

For example:

CREATE FUNCTION [IF NOT EXISTS] [db_name.]function_name([arg_type[, arg_type...])
RETURNS return_type
LOCATION 's3_path'
SYMBOL='class name of UDF'

Tip

Function overloading is supported. Functions can have the same name with different signatures.

For example, the MaskUDF:

CREATE FUNCTION okera_sample.mask(STRING) RETURNS STRING
LOCATION 's3://okera-public-east/udfs/mask-udf.jar'
SYMBOL='com.cerebro.hiveudf.MaskUDF'

Dropping a User-Defined Function

To drop a user-defined function, use the DROP FUNCTION statement:

DROP FUNCTION [IF EXISTS] [db_name.]function_name([arg_type[, arg_type...])

-- For example
DROP FUNCTION okera_sample.mask(STRING)

Globally Available User-Defined Functions

Few custom functions are generic and can be used across databases. With global user-defined function capability, you can reuse commonly used functions not available in builtins, across databases.

Creating Global User-Defined Functions

To use a user-defined function across databases, create it in the global database _okera_globals.

Accessing Global User-Defined Functions

By default all users have read access to this database. The global user-defined function can be accessed by qualifying the function with or without the namespace.

SELECT touppercase(record) FROM okera_sample.sample
-- Or
SELECT _okera_globals.touppercase(record) FROM okera_sample.sample

If the user-defined function shares the same name as a builtin function, the lookup precedence (when namespace not specified) will be builtin and global respectively.

Using a User-Defined Function Directly

A user-defined function is used in the same way as any other built-in function when directly issuing SQL against the Okera Policy Engine (planner).

For example:

SELECT record, okera_sample.mask(record) FROM okera_sample.sample
record okera_sample.mask(record)
This is a sample test file. xxxxxxxxxxxxxxxxxxxxxxxxxxx
It should consist of two lines. xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Using the User-Defined Function From Views

When accessing Okera with another SQL tool, possibly unable to handle the user-defined functions, you may "hide" the user-defined function behind an Okera view. In the on-going example, if you want to protect the okera_sample.sample dataset with the user-defined function, create a view that applies the function to the columns in it. Then, grant access to the view.

For example:

CREATE VIEW okera_sample.secure_sample as
SELECT okera_sample.mask(record) as record
FROM okera_sample.sample;

GRANT SELECT ON table okera_sample.secure_sample to ROLE analyst_role;

Then select from this view. The user-defined function is automatically applied.

SELECT * from okera_sample.secure_sample
record
xxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

In this case, the user-defined function is completely hidden from the compute tool talking to Okera. The tool does not know about, and therefore does not have access to, the function binary, in any way.

Using a User-Defined Function From Hive EMR

Since user-defined functions are Hive-compatible, they are automatically usable from an Okera-integrated Hive EMR cluster. Hive user-defined function (i.e. function) commands work transparently. For example, if the user-defined function okera_sample.mask is registered in Okera, it can be used from Hive, with no additional changes or configs.

For example from the Hive CLI:

show functions like "okera_sample.*";
OK
okera_sample.mask

hive> select okera_sample.mask('abcd');
Added [/mnt/tmp/53ad548e-5f83-44dd-985b-3c9c267fe736_resources/mask-udf.jar] to class path
Added resources: [s3://okera-public-east/udfs/mask-udf.jar]
OK
xxxx

Note: You cannot register an Okera user-defined function from Hive. The function must be registered in Okera and then is visible through Hive. It is possible to register user-defined functions in Hive that are local to the EMR cluster, by creating the function (through Hive) in a local database.

Reviewing User-Defined Function Definitions

You can review the definitions of user-defined functions registered in Okera by running the following command:

show functions in <db>

Specify a relevant database name for <db>.

Common Errors

User is not admin

The user is not a catalog admin and sees:

AuthorizationException: User 'dev-user' does not have privileges to CREATE/DROP functions.

Dropping a function that is being used

The views that depend on it fails with the error:

AnalysisException: okera_sample.mask() unknown.

Limitations

Cannot grant on UDF

Access controls on the user-defined function is not provided. Only admins can create them before all users can use them. It is not possible to grant use of a user-defined function to a particular user or group.

Cannot issue CREATE/DROP FUNCTION through hive CLI/beeline

The Hive client integration currently does not support the DDL statement. It is expected to be added in the near future.

Serialization/Deserialization (SerDe) Libraries

Okera supports a subset of valid Hive serialization/deserialization (SerDe) libraries. The SerDe must use text serialization. That file format consists of line by line text, with arbitrary line serialization.

Examples of supported SerDe:

  • Regex SerDe - each line content can be extracted by RegEx
  • Thrift SerDe - each line is a serialized thrift object
  • Json SerDe - each line is json
  • Multi-Delimiter CSV SerDe - each line has a CSV structure with many options

Examples of unsupported SerDe include the following file formats for sophisticated file structures:

  • SequenceFile
  • Avro
  • Parquet
  • Orc

Note: SequenceFile, Avro, and Parquet are natively supported by Okera and do not require a SerDe.

Using a SerDe Library

The SerDe library must be specified when creating the table. The DDL to use a SerDe is an extension of the CREATE TABLE statement with additional options. The path to the SerDe jar must be specified; It can be any URI, but typically is an S3 path. The fully qualified Java class name must be specified as well. Optionally, any additional properties for the SerDe can be provided. The table can be partitioned or created with comments, in a similar way to other tables.

CREATE [EXTERNAL] [IF NOT EXISTS] TABLE <tbl>(<SCHEMA>)
ROW FORMAT SERDE '<PATH TO SERDE JAR>' SYMBOL '<CLASS NAME OF SERDE>'
[WITH SERDEPROPERTIES('<key>' = 'value')]
[LOCATION | COMMENT | ...]

For example, when using the Hive RegexSerDe, it assumes the library is available at: s3://my-company/serdes/regex-serde.jar:

CREATE TABLE apachelog (
  host STRING,
  identity STRING,
  user STRING,
  ts STRING,
  request STRING,
  status STRING,
  size STRING,
  referrer STRING,
  agent STRING)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
JAR_PATH 's3://my-company/serdes/regex-serde.jar'
WITH SERDEPROPERTIES (
  "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?"
)

Tutorial for Using Ç (Cedilla-C) as a Delimiter

As an end-to-end example, load a dataset in S3 that uses the Cedilla-C as the delimiter, and uses the Hive MultiDelimitSerde.

The folder used in this tutorial is available at: s3://cerebro-datasets/cedilla_sample/.

The folder contains a file named cedilla_sample.txt, with the following contents:

1Ç23
Ç45
The steps are:

  1. Create the database.

  2. Create a table that uses Ç as the field delimiter.

  3. Read from the table to make sure everything works.

  4. The output expected is:

    --- multibyte_db.cedilla_sample ---
    [
        {
            "int1": 1,
            "int2": 23
        },
        {
            "int1": null,
            "int2": 45
        }
    ]
    

Permission Cascading

When permissions are granted on S3 URIs, they are applied to any subdirectories contained within the directory to which the permissions are granted. This allows access to be granted on a top-level S3 bucket, or any arbitrarily deep subdirectory.