Extending ODAS

You can extend Okera Data Access Service (ODAS)’s functionality in two ways:

  • running user defined functions (UDFs)
  • Hive Serialization/Deserialization libraries (SerDes).

UDFs and SerDes provide powerful ways to extend the capabilities of ODAS. Example use cases include support for:

  • UDF - Custom anonymization or data encryption algorithm
  • UDF - Complicated filters that are more naturally expressed in code vs. SQL
  • SerDe - Reading a file format (i.e. Multi-delimited CSV) that is not supported by ODAS
  • SerDe - Custom in-house file format

Security

UDFs and SerDes run using the same process and the same permissions as the rest of ODAS (typically run as the system user). These libraries have access to all the data that ODAS has access to. A malicious library can potentially access data that is currently being processed. It is assumed that the libraries are trusted. Currently the only measure to protect against ill-behaving libraries is to restrict who can register them.

Only users who are catalog admins can register and unregister UDFs and SerDes. It is not possible to delegate this capability to other users (i.e. in the same way the permission to grant can be delegated).

As a best practice, it is recommended that library locations be secured to prevent other users from replacing it with malicious binaries (i.e. most users should not have write access to that location.)

UDFs

To use UDFs, the steps are:

  1. Register the UDF in the ODAS catalog. Only the catalog admins can do this.
  2. Create views (or issue queries) that use the UDF with typical SQL statements. Any user can use the UDF (assuming they have access to the UDF database).
  3. Read data from the view. The UDF is evaluated by ODAS before the data is returned.

For example, use a UDF that accepts strings and masks all characters in it:

mask('hello')-> 'xxxxx'.

ODAS supports UDFs written against the Hive interface. These JARs should be compatible with Hive, and require no additional steps.

For this UDF, the code might look like:

package com.okera.hiveudf;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

/**
  * UDF which masks all characters from the input string.
  *
  * SELECT mask('hello')
  * > 'xxxx'
  */
public class MaskUDF extends UDF {
  public Text evaluate(Text t) {
    if (t == null) return null;
    byte[] result = new byte[t.getBytes().length];
    for (int i = 0; i < result.length; i++) {
      result[i] = 'x';
    }
    return new Text(result);
  }
}

Registering the UDF

To register the UDF, use one of our client tools (odb, curl, etc.) to execute DDL statements against ODAS. For example:

CREATE FUNCTION [IF NOT EXISTS] [db_name.]function_name([arg_type[, arg_type...])
RETURNS return_type
LOCATION 's3_path'
SYMBOL='class name of UDF'

Tip: Function overloading is supported. Functions can have the same name with different signatures.

For example, the MaskUDF:

CREATE FUNCTION okera_sample.mask(STRING) RETURNS STRING
LOCATION 's3://okera-public-east/udfs/mask-udf.jar'
SYMBOL='com.cerebro.hiveudf.MaskUDF'

To drop the UDF, use the DROP FUNCTION statement:

DROP FUNCTION [IF EXISTS] [db_name.]function_name([arg_type[, arg_type...])

-- For example
DROP FUNCTION okera_sample.mask(STRING)

Using the UDF directly

If directly issuing SQL against the planner, the UDF is used like any other builtin.

For example:

SELECT record, okera_sample.mask(record) FROM okera_sample.sample
record okera_sample.mask(record)
This is a sample test file. xxxxxxxxxxxxxxxxxxxxxxxxxxx
It should consist of two lines. xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Using the UDF from views

When accessing ODAS with another SQL tool, possibly unable to handle the UDFs, you may “hide” the UDF behind a ODAS view. In the on-going example, if you want to protect the okera_sample.sample dataset with the UDF, create a view that applies the UDF to the columns in it. Then, grant access to the view.

For example:

CREATE VIEW okera_sample.secure_sample as
SELECT okera_sample.mask(record) as record
FROM okera_sample.sample;

GRANT SELECT ON table okera_sample.secure_sample to ROLE analyst_role;

Then select from this view. The UDF is automatically applied.

SELECT * from okera_sample.secure_sample
record
xxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

In this case, the UDF is completely hidden from the compute tool talking to ODAS. The tool does not know about, and therefore does not have access to, the UDF binary, in any way.

Using it from Hive EMR

Since the UDFs are Hive compatible, they are automatically usable from a ODAS integrated Hive EMR cluster. Hive UDF (i.e. function) commands work transparently. For example, if the UDF okera_sample.mask is registered in ODAS, it can be used from Hive, with no additional changes or configs.

For example from the Hive CLI:

show functions like "okera_sample.*";
OK
okera_sample.mask

hive> select okera_sample.mask('abcd');
Added [/mnt/tmp/53ad548e-5f83-44dd-985b-3c9c267fe736_resources/mask-udf.jar] to class path
Added resources: [s3://okera-public-east/udfs/mask-udf.jar]
OK
xxxx

Note: It is not possible to register an ODAS UDF from Hive. The UDF must be registered in ODAS and then is visible through Hive. It is possible to register UDFs in Hive that are local to the EMR cluster, by creating the UDF (through Hive) in a local db.

Common Errors

User is not admin

The user is not a catalog admin and sees:

AuthorizationException: User 'dev-user' does not have privileges to CREATE/DROP functions.

Dropping a function that is being used

The views that depend on it fails with the error:

AnalysisException: okera_sample.mask() unknown.

Limitations

Cannot grant on UDF

Access controls on the UDF is not provided. Only admins can create them before all users can use them. It is not possible to grant usage of a UDF to a particular user or group.

Cannot issue CREATE/DROP FUNCTION through hive CLI/beeline

The Hive client integration currently does not support the DDL statement. It is expected to be added in the near future.

SerDes

ODAS supports a subset of valid Hive SerDes. The SerDes must use text serialization. That file format consists of line by line text, with arbitrary line serialization.

Examples of supported SerDes:

  • Regex SerDe - each line content can be extracted by RegEx
  • Thrift SerDe - each line is a serialized thrift object
  • Json SerDe - each line is json
  • Multi-Delimiter CSV SerDe - each line has a CSV structure with many options

Examples of unsupported SerDes include the following file formats for sophisticated file structures:

  • SequenceFile
  • Avro
  • Parquet
  • Orc

Note: SequenceFile, Avro, and Parquet are natively supported by ODAS and do not require a SerDe.

Using a SerDe

The SerDe library must be specified when creating the table. The DDL to use a SerDe is an extension of the CREATE TABLE statement with additional options. The path to the SerDe jar must be specified; It can be any URI, but typically is an S3 path. The fully qualified Java class name must be specified as well. Optionally, any additional properties for the SerDe can be provided. The table can be partitioned or created with comments, in a similar way to other tables.

CREATE [EXTERNAL] [IF NOT EXISTS] TABLE <tbl>(<SCHEMA>)
ROW FORMAT SERDE '<PATH TO SERDE JAR>' SYMBOL '<CLASS NAME OF SERDE>'
[WITH SERDEPROPERTIES('<key>' = 'value')]
[LOCATION | COMMENT | ...]

For example, when using the Hive RegexSerDe, it assumes the library is available at: s3://my-company/serdes/regex-serde.jar:

CREATE TABLE apachelog (
  host STRING,
  identity STRING,
  user STRING,
  ts STRING,
  request STRING,
  status STRING,
  size STRING,
  referrer STRING,
  agent STRING)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
JAR_PATH 's3://my-company/serdes/regex-serde.jar'
WITH SERDEPROPERTIES (
  "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?"
)

Tutorial for using Ç (Cedilla-C) as a delimiter

As an end to end example, load a dataset in S3 that uses the Cedilla-C as the delimiter, and uses the Hive MultiDelimitSerde.

The folder used in this tutorial is available at: s3://cerebro-datasets/cedilla_sample/.

The folder contains a file named cedilla_sample.txt, with the following contents:

1Ç23
Ç45

The steps are:

  1. Create the database
odb dataset hive-ddl "create database if not exists multibyte_db"
  1. Create a table that uses Ç as the field delimiter
odb dataset hive-ddl "CREATE EXTERNAL TABLE multibyte_db.cedilla_sample(
  int1 int,
  int2 int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
WITH SERDEPROPERTIES ('field.delim' = 'Ç')
STORED AS TEXTFILE
LOCATION 's3://cerebro-datasets/cedilla_sample/'"
  1. Read from the table to make sure everything works
./odb dataset cat multibyte_db.cedilla_sample
  1. The output expected is:
--- multibyte_db.cedilla_sample ---
[
    {
        "int1": 1,
        "int2": 23
    },
    {
        "int1": 0,
        "int2": 45
    }
]

Permission Cascading

When permissions are granted on S3 URIs, they are applied to any subdirectories contained within the directory to which the permissions are granted. This allows access to be granted on a top-level S3 bucket, or any arbitrarily deep subdirectory.