Native Python Integration (PyOkera)¶

PyOkera is the native Python library for clients interacting with Okera. It is similar to the Java libraries and calls the lower-level Okera services. The library directly interacts with the Okera Policy Engine (planner) and Enforcement Fleet worker services. There are alternate ways it can interact with Okera from Python. In particular, it can interact by way of the REST API. This library can provide more overall control and better performance. For simple applications, the REST API may be sufficient. We recommend this native library for reading larger volumes of data.

Note: This library is currently in a preview phase. The APIs are subject to change and the performance characteristics are not in their final state.

Setup¶

PyOkera requires Okera 0.8.1 or greater. If running against an older version, scans will fail with a message to upgrade the server.

Dependencies¶

Required:

Python 3.4+.
Linux: GCC (with C++ support)
Packages:
- easy_install
- pip
- six
- bit_array
- thriftpy2
- numpy *
- pandas *

* These packages are required to use the scan APIs, but they are not required for the metadata-related APIs.

Note: Python2 is not supported. There are no plans to support it.

Installing from PyPI¶

The Python library is available on PyPI. It can be installed using pip. Pip assumes that Python3 is already installed on the system.

Get pip.

curl "https://bootstrap.pypa.io/get-pip.py" -o "get-pip.py"

Install pip.
```
sudo python3 get-pip.py
```
Install pyokera.
```
sudo pip3 install pyokera
```
Optionally install pandas by running:
```
sudo pip3 install pandas
```
To confirm that the install was successful, try importing it from the interpreter and getting the version.
```
import okera.odas
okera.odas.version()
# Should output the version string, for example 2.18.1
```

Note: The pandas install can fail due to system dependencies. For more information, see the pandas docs or the docs below.

Installing with `easy_install`¶

If PyPI or pip are not accessible on the network, it is possible to install the client from Okera's release location in Amazon S3 using easy_install. After installing the dependencies, perform the following steps.

Download the library.

curl -O https://s3.amazonaws.com/okera-release-useast/2.18.1/client/pyokera.egg

Install the library.

easy_install --user pyokera.egg

Or, install system-wide.

[sudo] easy_install pyokera.egg

Full Dependency Installations¶

Here are some examples of how to install the dependencies in two different Amazon Web Services (AWS) based environments. Depending on your network restrictions, these may have to be adapted to use your package managers.

Full Installation on RHEL7¶

This assumes a minimal RedHat Enterprise Linux 7 or CentOS 7 machine. For example, the base RHEL7 AMI on Amazon AWS.

# Basic python install and dependencies, this satisfies the requirements.
sudo rpm -ivh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
sudo yum install -y gcc-c++ python34.x86_64 python34-devel
curl "https://bootstrap.pypa.io/get-pip.py" -o "get-pip.py"
sudo python3 get-pip.py
sudo pip3 install six bit_array thriftpy2==0.3.12
curl -O https://s3.amazonaws.com/okera-release-useast/2.18.1/client/pyokera.egg
sudo easy_install pyokera.egg

# Optional packages (installing pandas can take a while)
pip3 install Cython numpy pandas

Establishing Dependencies on Amazon EMR¶

These instructions install all dependencies on a fresh Amazon Elastic MapReduce (EMR) instance.

Note: This instance typically has Python2 and Python3 installed. PyOkera is only supported for Python3.

curl "https://bootstrap.pypa.io/get-pip.py" -o "get-pip.py"
sudo python3 get-pip.py
sudo /usr/local/bin/pip3 install six bit_array thriftpy2==0.3.12
curl -O https://s3.amazonaws.com/okera-release-useast/2.18.1/client/pyokera.egg
sudo easy_install-3.4 pyokera.egg

# Optional packages (installing pandas can take a while)
sudo /usr/local/bin/pip3 install numpy pandas

Using PyOkera¶

Once PyOkera is installed, development can begin.

The PyOkera API has its own documentation site. This includes docs for each of the APIs in detail.

Getting Started¶

In a typical application, first create a context object. The context object represents state that is shared between Okera connections (and therefore also requests), including user credentials. From the context object, users create connection objects, which can be used to execute DDL and scan requests against Okera.

For further data manipulation, the library object provides utilities to read an entire dataset into pandas.

The following example reads the first dataset from the okera_sample database:

from okera import context
ctx = context()
with ctx.connect(host='localhost', port=12050) as conn:
    dataset = conn.list_dataset_names('okera_sample')[0]
    pd = conn.scan_as_pandas(dataset)
    print(pd)

Authentication¶

PyOkera supports authentication using Kerberos or tokens. It also supports connecting to unauthenticated servers, which should only be used in development. Authentication information is stored in the context object.

No Configuration Specified¶

When the context object is created, it automatically configures token-based authentication or no authentication. If a token for the current user is available (typically in ~/.okera/token), then token auth is configured.

Example: Configuring authentication

from okera import context
ctx = context()
ctx.get_auth() # Will be None or 'TOKEN'

# To disable authentication:
ctx.disable_auth()

Enabling Token Auth¶

To set the user token, call enable_token_auth() and specify either the token_str or the token_file argument. - The token_str should be the token text. - The token_file is the path to a file containing the token.

Example: Enabling token authentication

from okera import context
ctx = context()
ctx.enable_token_auth(token_str='super-secret-token-string')
# OR
ctx.enable_token_auth(token_file='/path/to/super/secret')

Enabling Kerberos¶

To enable Kerberos, call enable_kerberos() and specify the service principal name. This assumes you have already run kinit locally. The caller must specify the service name (first segment of principal) then optionally specify the hostname.

Example: Connecting to a server with Kerberos principal okera/service@OKERA.COM

from okera import context
ctx = context()
ctx.enable_kerberos('okera', host_override='service')

Metadata APIs¶

connect() Creates a connection to Okera. Callers should call close() when done or use the with scoped cleanup. To create a connection, call connect() on the context object.

Arguments
- host -- Hostname of the Okera Policy Engine.
- port -- The port at which the Okera Policy Engine is listening.
list_databases() This function takes no arguments and returns all the user-accessible databases.

list_dataset_names(db) This function returns the names of all the datasets in a database. A database name must be specified.

The query in the following example collects the names:

from okera import context
ctx = context()
# Configure auth if necessary
with ctx.connect(host='localhost', port=12050) as conn:
    all_datasets = []
    for db in conn.list_databases():
        all_datasets.append(conn.list_dataset_names(db))
    print(all_datasets)

plan() This is a low-level API to plan a scan request.

Arguments:
- request (str) -- The fully qualified dataset name or SQL statement to scan. This argument is required. Note that if SQL is specified, it is subject to the same SQL restrictions that Okera supports.
- requesting_user (str) -- The name of the user for whom the plan is requested, if it is different from the current user. This argument is optional.
- client (enum) -- The TAuthorizeQueryClient enum value of the client to use for SQL rewrite planning. This argument is optional.
- min_task_size (int) -- For testing only, this controls the minimum number of tasks generated by Okera. This argument is optional.
- cluster_id (str) -- The name or ID of the external nScale cluster making the plan request. This argument is optional. Setting this makes the plan request return nScale tasks with presigned Amazon S3 URIs.
- defer_task_url_signing (bool) -- Indicates whether deferred nScale task URI signing is requested. Valid values are true and false. The default is false. This argument is optional.
The result of this API request is typically sent to the worker and contains an internally serialized binary payload. The result does contain useful low-level information:
```
from okera import context
ctx = context()
with ctx.connect(host='localhost', port=12050) as conn:
    result = conn.plan('okera_sample.users')
    print(result.warnings)
    # Warnings that were generated while planning
    print(len(result.tasks))
    # Total number of worker tasks that will need to run
    print(len(result.schema.cols))
    # Number of columns in result schema
```
It is also possible to pass a supported SQL request.
```
with ctx.connect(host='localhost', port=12050) as conn:
    result = conn.plan('select mask_ccn(ccn) from okera_sample.users')
    assert len(result.schema.cols) == 1
```
execute_ddl(sql) This API takes a single string argument that is the SQL string. The supported SQL is the same as any direct Okera API call. The result set is a table, returned as a list of lists of strings (row major).

As an example, to list the roles and format the output using prettytable (pip3 install prettytable):
```
from okera import context
from prettytable import PrettyTable
ctx = context()
with ctx.connect(host='localhost', port=12050) as conn:
    result = conn.execute_ddl('show roles')
    t = PrettyTable()
    for row in result:
        t.add_row(row)
    print(t)
```

Scanning data¶

PyOkera supports two scan APIs: scan_as_pandas() and scan_as_json(). They behave identically, except in the result structure. The scan_as_pandas() API returns the result as a pandas DataFrame. The scan_as_json() API returns the result as a list of JSON objects.

When there is a need to perform further processing, use the faster scan_as_pandas() API.

Both APIs take as arguments:

request (str) -- Fully qualified dataset name or SQL statement to scan.
max_records (int) -- Optional: Maximum number of records to return. Default is unlimited.

Example: Scanning the okera_sample.sample dataset as JSON

from okera import context
ctx = context()
with ctx.connect(host='localhost', port=12050) as conn:
    results = conn.scan_as_json('okera_sample.sample')
    print(results)

Example: Returning the first 10000 records as a pandas DataFrame from the user's ccn number

from okera import context
ctx = context()
with ctx.connect(host='localhost', port=12050) as conn:
    df = conn.scan_as_pandas('SELECT ccn from okera_sample.users', max_records=10000)
    df.describe()