Okera Catalog Overview

Okera Catalog is a unified, common set of services that provide vital details to users and the Okera platform itself. Currently the following services are included:

Each of these is described in the next sections. First though, a clarification of the roles of users we see in practice, and how they relate to the Okera Catalog services.

Okera User Roles

Data owners or stewards publish datasets and access policies. They audit access to data they want to share with others, such as for analytics or building new applications and data flows. Meanwhile, data analysts use the Catalog services to discover and understand datasets.

Data producers publish datasets that they want to share with others for analytics or building new applications and workflows. They typically use the following functionality:

  • Register datasets
  • Define tags and other metadata
  • Define descriptions of datasets
  • Define access policies
  • See audit trail of access

The first three tasks are handled by the Okera Schema Registry, while the access policies are handled by the Okera Policy Engine.

Further, consumers (for example, the data analysts) use the Catalog services to discover and understand datasets as well as request access where they don’t have access. Consumers typically use the following functionality:

  • Discover datasets
  • Understand metadata (description, schema)
  • Understand what they have access to
  • Preview samples
  • Request access

The first two are handled by the Schema Registry, while the last is done using the Policy Engine.

Okera Schema Registry

The Schema Registry is providing big-data stacks with the ability to store and consume dataset metadata in a platform-agnostic manner. The Schema Registry makes metadata and capabilities around metadata a function of the data itself rather than depend on the platform being used to store or process the data. The disaggregation of the data stack has enabled companies to construct data platforms using different storage systems and different access engines, each suited for its role, without being limited to a vertically integrated solution that constrains the functionality to what the system provides.

The Schema Registry is a distributed service that coordinates the storage of dataset definitions and other dataset related metadata. Typically, a single instance of Okera Schema Registry is deployed that is shared across multiple teams. The Schema Registry exposes standard interfaces of the Hive Metastore and REST APIs for interacting with the metadata. This makes it the long running, common metastore for the different clients, including the many Hadoop components, irrespective of which infrastructure they are running in (on-premises or cloud) and which analytics tools are used for consuming data (such as AWS’s EMR, Cloudera, HWX, MapR, or a custom implementation of any of the frameworks they are based on).

Typically a data producer -- which could be a user interactively using the system or an automated data pipeline job -- would send a data-definition language (DDL) SQL statement to the Schema Registry service (by means of the Catalog API). For example, the following statement creates a table over an existing location in AWS S3.

Example: Creating an external TABLE schema

CREATE EXTERNAL TABLE sales.transactions_schemaed(
  txnid BIGINT,
  dt_time STRING,
  sku STRING,
  userid INT,
  price FLOAT,
  creditcard STRING,
  ip STRING)
COMMENT 'Online transactions 2016'
LOCATION 's3://sales-data/transactions';

There are commands to CREATE, ALTER, or DROP objects, including databases and tables. In the context of the Okera platform, tables are generically referred to as datasets, such as the one shown in the example.

Further reading:

  • See the Catalog API document for the list of REST endpoints supported by the Catalog services.

Okera Policy Engine

The Policy Engine enables end users to request access to data as needed. It also enforces the granted access levels of users as they consume the data. Like the Schema Registry, the Policy Engine is a shared service which is able to handle many ODAS clusters.

All the access to database, datasets, and so on, commonly referred to as objects, is assigned to arbitrary roles. These roles are then assigned to groups of known users, enabling the so-called role-based access control. The user groups are provided using a variety of sources, including Active Directory (AD), LDAP, JWT, and more. The next example is creating a role, and assigns it to two different user groups.

Example: Creating a role and assigning it to user groups

CREATE ROLE analyst_role;
GRANT ROLE analyst_role TO GROUP analysts;
GRANT ROLE admin_role TO GROUP admins;

Then we can define a VIEW over the table created in the previous section, applying tokenization and masking functions that allow only the user admin to gain full access to the field data. All the other users would see the redacted content only. The GRANT statement is required to give read access to the shared role.

Example: Creating a VIEW with masking and allowing access to it

CREATE VIEW sales.transactions AS
  if (has_access('sales.transactions_schemaed'), userid, tokenize(userid)) as userid,
  if (has_access('sales.transactions_schemaed'), creditcard,mask_ccn(creditcard)) as creditcard,
  if (has_access('sales.transactions_schemaed'), ip, cast(tokenize(ip) as STRING)) as ip
FROM sales.transactions_schemaed;

GRANT SELECT ON TABLE sales.transactions TO ROLE analyst_role;

Using the GRANT <PRIVILEGE> ... WITH GRANT OPTION variant of the statement allows, like in an RDBMS, to delegate control over a subset of the data to a dedicated set of users (via a role again).

The assigned privileges are stored by the Policy Engine and enforced by ODAS as data is accessed by a client.