Okera Metadata Services Overview

Okera Metadata Services are a unified, common set of services that provide vital details to users and the Okera platform itself. Currently the following services are included:

Each of these is described in the next sections. First though, a clarification of the roles of users we see in practice, and how they relate to the Okera Metadata Services.

Okera User Roles

Data owners or stewards publish datasets and access policies. They audit access to data they want to share with others, such as for analytics or building new applications and data flows. Meanwhile, data analysts use the Catalog services to discover and understand datasets.

Data producers publish datasets that they want to share with others for analytics or building new applications and workflows. They typically use the following functionality:

  • Register datasets
  • Define tags and other metadata
  • Define descriptions of datasets
  • Define access policies
  • See audit trail of access

The first three tasks are handled by the Okera Schema Registry, while the access policies are handled by the Okera Policy Engine.

Further, consumers (for example, the data analysts) use the Catalog services to discover and understand datasets as well as request access where they don’t have access. Consumers typically use the following functionality:

  • Discover datasets
  • Understand metadata (description, schema)
  • Understand what they have access to
  • Preview samples
  • Request access

The first two are handled by the Schema Registry, while the last is done using the Policy Engine.

Okera Schema Registry

The Schema Registry is providing big-data stacks with the ability to store and consume dataset metadata in a platform-agnostic manner. The Schema Registry makes metadata and capabilities around metadata a function of the data itself rather than depend on the platform being used to store or process the data. The disaggregation of the data stack has enabled companies to construct data platforms using different storage systems and different access engines, each suited for its role, without being limited to a vertically integrated solution that constrains the functionality to what the system provides.

The Schema Registry is a distributed service that coordinates the storage of dataset definitions and other dataset related metadata. Typically, a single instance of Okera Schema Registry is deployed that is shared across multiple teams and clusters - which can also be located in different environments, for example on-premises and in the cloud, or across multiple cloud services. The Schema Registry exposes standard interfaces of the Hive Metastore and REST APIs for interacting with the metadata. This makes it the long running, common metastore for the different clients, including the many Hadoop components, irrespective of which infrastructure they are running in and which analytics tools are used for consuming data (such as AWS’s EMR, Cloudera's CDH/CDP, or a custom implementation of any of the frameworks they are based on).

Typically a data producer -- which could be a user interactively using the system or an automated data pipeline job -- would send a data-definition language (DDL) SQL statement to the Schema Registry service (by means of the Metadata API). For example, the following statement creates a table over an existing location in AWS S3.

Example: Creating an external TABLE schema

CREATE EXTERNAL TABLE sales.transactions(
  txnid BIGINT,
  dt_time STRING,
  sku STRING,
  userid INT ATTRIBUTE misc.guid,
  price FLOAT,
  creditcard STRING ATTRIBUTE pii.credit_card,
  ip STRING ATTRIBUTE pii.ip_address)
COMMENT 'Online transactions 2016'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION 's3://sales-data/transactions';

There are commands to CREATE, ALTER, or DROP objects, including databases and tables. In the context of the Okera platform, tables are generically referred to as datasets, such as the one shown in the example.

Note also how some of the columns have been tagged with attributes that are later used to apply access policies with implicit transformations. More on the topic of attribute-based access control can be found in the documentation.

Alternatively to using SQL statements, a data producer can use the Data Registration tab in the WebUI to automatically scan data sources to discover datasets, while applying the auto-tagger service guiding with appropriately assigning attributes to schemas. This will allow onboarding datasets without writing a single line of SQL code.

Further reading:

  • See the Metadata Services API document for the list of REST endpoints supported by the Metadata Services.

Okera Policy Engine

The Policy Engine enables end users to request access to data as needed. It also enforces the granted access levels of users as they consume the data. Like the Schema Registry, the Policy Engine is a shared service which is able to handle many ODAS clusters across heterogeneous environments.

All the access to database, datasets, and so on, commonly referred to as objects, is assigned to arbitrary roles. These roles are then assigned to groups of known users, enabling the so-called role-based access control. The user groups are provided using a variety of sources, including Active Directory (AD), LDAP, JWT, and more.

Combined with the ability to tag objects you can also use the attribute-based access control functionality that allows. for example, data stewards to flag data in datasets to be treated in a specific manner. For instance, you can flag a column to contain credit card numbers and allow regular users only access to a masked version of the values.

The next example is creating a role, and assigns it to two different user groups.

Note

The following SQL statements are for more hands-on users. Consider managing roles in the WebUI for a code-free way to manage permissions.

Example: Creating a role and assigning it to user groups

CREATE ROLE analyst_role;
GRANT ROLE analyst_role TO GROUP analysts;

Then we can define a policy that grants read (SELECT) access to the table created in the previous section, while implicitly masking the credit card and IP address column values. The policy is assigned to the analyst_role, which means all users that are part of the group analysts will be able to safely access the table.

Example: Creating a policy that allows read access with masking for tagged columns

GRANT SELECT ON TABLE sales.transactions
TRANSFORM IN(pii.credit_card, pii.ip_address) WITH mask()
TO ROLE analyst_role;

Using the GRANT <PRIVILEGE> ... WITH GRANT OPTION variant of the statement allows, like in an RDBMS, to delegate control over a subset of the data to a dedicated set of users (via a role again).

The assigned privileges are stored by the Policy Engine and enforced by ODAS as data is accessed by a client.