Native Python Integration (PyOkera)¶
PyOkera is the native Python library for clients interacting with ODAS. It is similar to the Java libraries and calls the lower level ODAS services. The library directly interacts with the planner and worker services. There are alternate ways it can interact with ODAS from Python. In particular, it can interact by way of the REST API. This library can provide more overall control and better performance. For simple applications, the REST API may be sufficient. We recommend this native library for reading larger volumes of data.
Note: This library is currently in a preview phase. The APIs are subject to change and the performance characteristics are not in their final state.
PyOkera requires ODAS 0.8.1 or greater. If running against an older version, scans will fail with a message to upgrade the server.
- Python 3.4+.
- Linux: GCC (with C++ support)
* These packages are required to use the scan APIs, but they are not required for the metadata-related APIs.
Note: Python2 is not supported. There are no plans to support it.
Installing from PyPI¶
The Python library is available on PyPI. It can be installed using pip. Pip assumes that Python3 is already installed on the system.
curl "https://bootstrap.pypa.io/get-pip.py" -o "get-pip.py"
sudo python3 get-pip.py
sudo pip3 install pyokera
Optionally install pandas by running:
sudo pip3 install pandas
To confirm that the install was successful, try importing it from the interpreter and getting the version.
import okera.odas okera.odas.version() # Should output the version string, for example 2.10.0
Note: The pandas install can fail due to system dependencies. For more information, see the pandas docs or the docs below.
If PyPI or pip are not accessible on the network, it is possible to install the client from Okera's release location in S3 using
After installing the dependencies, perform the following steps.
Download the library.
curl -O https://s3.amazonaws.com/okera-release-useast/2.10.0/client/pyokera.egg
Install the library.
easy_install --user pyokera.egg
Or, install system-wide.
[sudo] easy_install pyokera.egg
Full Dependency Installations¶
Here are some examples of how to install the dependencies in two different Amazon Web Services (AWS) based environments. Depending on your network restrictions, these may have to be adapted to use your package managers.
Full Installation on RHEL7¶
This assumes a minimal RedHat Enterprise Linux 7 or CentOS 7 machine. For example, the base RHEL7 AMI on Amazon AWS.
# Basic python install and dependencies, this satisfies the requirements. sudo rpm -ivh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm sudo yum install -y gcc-c++ python34.x86_64 python34-devel curl "https://bootstrap.pypa.io/get-pip.py" -o "get-pip.py" sudo python3 get-pip.py sudo pip3 install six bit_array thriftpy2==0.3.12 curl -O https://s3.amazonaws.com/okera-release-useast/2.10.0/client/pyokera.egg sudo easy_install pyokera.egg # Optional packages (installing pandas can take a while) pip3 install Cython numpy pandas
Establishing Dependencies on Amazon EMR¶
These instructions install all dependencies on a fresh Amazon Elastic MapReduce (EMR) instance.
Note: This instance typically has Python2 and Python3 installed. PyOkera is only supported for Python3.
curl "https://bootstrap.pypa.io/get-pip.py" -o "get-pip.py" sudo python3 get-pip.py sudo /usr/local/bin/pip3 install six bit_array thriftpy2==0.3.12 curl -O https://s3.amazonaws.com/okera-release-useast/2.10.0/client/pyokera.egg sudo easy_install-3.4 pyokera.egg # Optional packages (installing pandas can take a while) sudo /usr/local/bin/pip3 install numpy pandas
Once PyOkera is installed, development can begin.
The PyOkera API has its own documentation site. This includes docs for each of the APIs in detail.
In a typical application, first create a context object. The context object represents state that is shared between ODAS connections (and therefore also requests), including user credentials. From the context object, users create connection objects, which can be used to execute DDL and scan requests against ODAS.
For further data manipulation, the library object provides utilities to read an entire dataset into pandas.
The following example reads the first dataset from the
from okera import context ctx = context() with ctx.connect(host='localhost', port=12050) as conn: dataset = conn.list_dataset_names('okera_sample') pd = conn.scan_as_pandas(dataset) print(pd)
PyOkera supports authentication using Kerberos or tokens.
It also supports connecting to unauthenticated servers, which should only be used in development.
Authentication information is stored in the
No Configuration Specified¶
When the context object is created, it automatically configures token-based authentication or no authentication.
If a token for the current user is available (typically in
~/.okera/token), then token auth is configured.
Example: Configuring authentication
from okera import context ctx = context() ctx.get_auth() # Will be None or 'TOKEN' # To disable authentication: ctx.disable_auth()
Enabling Token Auth¶
To set the user token, call
enable_token_auth() and specify either the
token_str or the
token_str should be the token text.
token_file is the path to a file containing the token.
Example: Enabling token authentication
from okera import context ctx = context() ctx.enable_token_auth(token_str='super-secret-token-string') # OR ctx.enable_token_auth(token_file='/path/to/super/secret')
To enable Kerberos, call
enable_kerberos() and specify the service principal name.
This assumes you have already run
The caller must specify the service name (first segment of principal) then optionally specify the hostname.
Example: Connecting to a server with Kerberos principal
from okera import context ctx = context() ctx.enable_kerberos('okera', host_override='service')
connect() Creates a connection to ODAS. Callers should call
close()when done or use the
withscoped cleanup. To create a connection, call
connect()on the context object.
host-- Hostname of the planner.
port-- The port at which the planner is listening.
list_databases() This function takes no arguments and returns all the user-accessible databases.
list_dataset_names(db) This function returns the names of all the datasets in a database. A database name must be specified.
The query in the following example collects the names:
from okera import context ctx = context() # Configure auth if necessary with ctx.connect(host='localhost', port=12050) as conn: all_datasets =  for db in conn.list_databases(): all_datasets.append(conn.list_dataset_names(db)) print(all_datasets)
plan() This is a low-level API to plan a scan request.
request(str) -- The fully qualified dataset name or SQL statement to scan. This argument is required. Note that if SQL is specified, it is subject to the same SQL restrictions that ODAS supports.
requesting_user(str) -- The name of the user for whom the plan is requested, if it is different from the current user. This argument is optional.
client(enum) -- The
TAuthorizeQueryClientenum value of the client to use for SQL rewrite planning. This argument is optional.
min_task_size(int) -- For testing only, this controls the minimum number of tasks generated by Okera. This argument is optional.
cluster_id(str) -- The name or ID of the external nScale cluster making the plan request. This argument is optional. Setting this makes the plan request return nScale tasks with presigned S3 URIs.
defer_task_url_signing(bool) -- Indicates whether deferred nScale task URI signing is requested. Valid values are
false. The default is
false. This argument is optional.
The result of this API request is typically sent to the worker and contains an internally serialized binary payload. The result does contain useful low-level information:
from okera import context ctx = context() with ctx.connect(host='localhost', port=12050) as conn: result = conn.plan('okera_sample.users') print(result.warnings) # Warnings that were generated while planning print(len(result.tasks)) # Total number of worker tasks that will need to run print(len(result.schema.cols)) # Number of columns in result schema
It is also possible to pass a supported SQL request.
with ctx.connect(host='localhost', port=12050) as conn: result = conn.plan('select mask_ccn(ccn) from okera_sample.users') assert len(result.schema.cols) == 1
execute_ddl(sql) This API takes a single string argument that is the SQL string. The supported SQL is the same as any direct ODAS API call. The result set is a table, returned as a list of lists of strings (row major).
As an example, to list the roles and format the output using
pip3 install prettytable):
from okera import context from prettytable import PrettyTable ctx = context() with ctx.connect(host='localhost', port=12050) as conn: result = conn.execute_ddl('show roles') t = PrettyTable() for row in result: t.add_row(row) print(t)
PyOkera supports two scan APIs:
They behave identically, except in the result structure.
scan_as_pandas() API returns the result as a pandas DataFrame.
scan_as_json() API returns the result as a list of JSON objects.
When there is a need to perform further processing, use the faster
Both APIs take as arguments:
request(str) -- Fully qualified dataset name or SQL statement to scan.
max_records(int) -- Optional: Maximum number of records to return. Default is unlimited.
Example: Scanning the
okera_sample.sample dataset as JSON
from okera import context ctx = context() with ctx.connect(host='localhost', port=12050) as conn: results = conn.scan_as_json('okera_sample.sample') print(results)
Example: Returning the first 10000 records as a pandas DataFrame from the user's ccn number
from okera import context ctx = context() with ctx.connect(host='localhost', port=12050) as conn: df = conn.scan_as_pandas('SELECT ccn from okera_sample.users', max_records=10000) df.describe()