Using Glue as a 3rd Party Metadata Catalog

Starting with version 2.0.0, Okera supports using AWS Glue catalog as a 3rd-party metadata catalog. This document explains:

Prerequisites

This section deals with:

General Prerequisites

There are two pre-requisites to integrating ODAS with your AWS Glue metadata catalog:

  1. ODAS version greater than 2.0.0.
  2. IAM and Network policy based access between ODAS and the AWS Glue Catalog * the IAM part of this is discussed in the next section.

General IAM Privileges

In addition to the general AWS prerequisites, the ODAS-Glue integration needs certain IAM privileges on top of the default ones to communicate with the AWS Glue catalog service.

  • The steps to setup IAM are as follows:
  • Setup ODAS cluster role with the default IAM privileges mentioned here.
  • Login to your AWS Web Console, and go to the IAM page.
  • On the right-hand side panel, click on Roles.
  • Using the search bar, find the IAM role you create for ODAS earlier, and click on it.
  • Click on Attach policies.
  • You'll see a list of pre-published default policies. On this screen, enter AWSGlueServiceRole in the search box. Select this policy from the shortlist, and click attach policy at the bottom of the screen.
  • Once this is done, we can move on the installation steps.

Note

For a more fine-grained set of policies to attach to your cluster's role, please look at the advanced topics section.

Configuration

This section deals with:

Setup

The configuration and setup process for ODAS using AWS Glue Metadata as a catalog are as follows:

  • Follow the general steps to configure your ODAS cluster. You can follow the links below to:
  • To use the ODAS-Glue integration, you need to modify your cluster configuration in your configuration file:
    • Look for the property named CATALOG_TYPE.
    • If CATALOG_TYPE does not exist, please add this as a new property.
    • Set the value of CATALOG_TYPE to glue.
    • Save this new config and apply it using okctl.

Cross-region support

Starting version 2.1.0, we allow users to configure their ODAS clusters in one region to communicate with their AWS Glue metadata catalog in different region. The only prerequisite here is that there is IAM and network policy based access between the regions in which ODAS and the AWS Glue catalog are.

To enable cross-region support between ODAS and Glue, modify your configuration file and apply it:

  • Add a new property to the config called GLUE_REGION.
  • Set the value of GLUE_REGION to the AWS region where the target AWS Glue metadata catalog is located.
  • Save this new config and apply it using okctl.

Note

The ODAS-Glue integration works by communicating with the AWS Glue Metastore in real time for a seamless end-user experience. In this mode, there might be additional network latency concerns that end-users may need to be mindful of.

Advanced Topics.

This section deals with the following two topics:

Restrictive IAM privileges

This section covers the following:

Overview

ODAS allows users to execute a whole range of metadata-relation operations against different kinds of objecs in the Glue catalog. These objects include, but are not limited to:

  • Catalog (The full scope of the metadata)
  • Databases
  • Tables
  • Table partitions

The below operations are allowed selectively for a given user based on the user's permissions as defined in ODAS' Policy Engine. As an example, these policies may allow the user to:

  • Create new databases, tables, partitions (examples of metadata objects).
  • Modify the properties on certain metadata objects.
  • Only view the properties on certain metadata objects.
  • Deny access to certain metadata objects.

When using the AWS Glue catalog, the Amazon IAM privileges additionally limit what the Glue-integrated ODAS cluster is able to see in its view of the Glue catalog. This implicitly limits the operations that can be carried out by users using ODAS against the aforementioned visible set of the metadata.

While these restrictions help limit the scope of ODAS to affect crucial metadata that may be shared across multiple teams, some of these privileges are essential for the proper functioning of ODAS. The next sections explore essential properties and restrictive IAM privileges via example.

Essential properties

When restricting privileges of ODAS clusters to a limited set of Glue catalog objects, we have to keep into account certain factors:

  • Glue IAM privileges are hierarchical:

    • If you want certain privileges on a catalog object, we have to make sure its requisite access to the parent is available.
    • The above implies, if we're limiting Glue privileges to a certain set of metadata objects (resources), the catalog resource always has to be included in the Allow set.
    • This is true for all services that interact with the AWS Glue catalog, and is not a limitation specific to Okera.
  • Okera needs read access, and for a non-limiting user experience, write and update access, to certain catalog objects in order to function properly, including the following:

    • Databases starting with _okera and okera. These are used by okera to manage its internal state, and to utilize the Okera crawler.
    • The default database.
    • Tables in the above databases.

Minimal policy for optimal ODAS-Glue experience

ODAS allows users the ability to execute a wide variety of metadata calls against the Glue catalog objects. If end-users want to set up ODAS to work against the entire Glue catalog (in these examples, the Glue catalog is in US-West-2), they could append the Glue IAM policy attached below. Please be mindful that requisite access to respective S3 objects will also be needed to align to the S3 privileges in order to use ODAS to actually scan data.

{       
    "Sid": "VisualEditor10",
    "Effect": "Allow",
    "Action": [
        "glue:BatchCreatePartition",
        "glue:BatchDeleteConnection",
        "glue:BatchDeletePartition",
        "glue:BatchDeleteTable",
        "glue:BatchGetPartition",
        "glue:CreateDatabase",
        "glue:CreatePartition",
        "glue:CreateTable",
        "glue:CreateUserDefinedFunction",
        "glue:DeleteDatabase",
        "glue:DeletePartition",
        "glue:DeleteTable",
        "glue:DeleteUserDefinedFunction",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetTags",
        "glue:GetUserDefinedFunction",
        "glue:GetUserDefinedFunctions",
        "glue:TagResource",
        "glue:UntagResource",
        "glue:UpdateDatabase",
        "glue:UpdatePartition",
        "glue:UpdateTable",
        "glue:UpdateUserDefinedFunction"
    ],
    "Resource": [
        "arn:aws:glue:${Region}:${Account}:*"
    ]
}

Policy for allowing read-only access on the entire Glue catalog

In certain scenarios, if the user wishes to only leverage a certain ODAS cluster for its scanning and metadata manipulation capabilities alone, while limiting any ability to create databases, tables, partitions, or UDFs using ODAS, they can do the following:

{
    "Sid": "VisualEditor10",
    "Effect": "Allow",
    "Action": [
        "glue:BatchGetPartition",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetTags",
        "glue:GetUserDefinedFunction",
        "glue:GetUserDefinedFunctions",
    ],
    "Resource": [
        "arn:aws:glue:${Region}:${Account}:*"
    ]
}

Do keep in mind the above policy limits the ability of ODAS to successfully log auditing information from this cluster in a completely autonomous way.

Policy for allowing optimal ODAS-Glue experience on a limited set of Glue catalog objects

If the user wishes to leverage ODAS but only for a limited set of Glue catalog objects, they can allow the minimal privileges as described above, on the catalog as well as the objects where access is desired.

This looks as follows:

{
    "Sid": "VisualEditor10",
    "Effect": "Allow",
    "Action": [
        "glue:BatchCreatePartition",
        "glue:BatchDeleteConnection",
        "glue:BatchDeletePartition",
        "glue:BatchDeleteTable",
        "glue:BatchGetPartition",
        "glue:CreateDatabase",
        "glue:CreatePartition",
        "glue:CreateTable",
        "glue:CreateUserDefinedFunction",
        "glue:DeleteDatabase",
        "glue:DeletePartition",
        "glue:DeleteTable",
        "glue:DeleteUserDefinedFunction",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetTags",
        "glue:GetUserDefinedFunction",
        "glue:GetUserDefinedFunctions",
        "glue:TagResource",
        "glue:UntagResource",
        "glue:UpdateDatabase",
        "glue:UpdatePartition",
        "glue:UpdateTable",
        "glue:UpdateUserDefinedFunction"
    ],
    "Resource": [
        "arn:aws:glue:${Region}:${Account}:catalog",
        "arn:aws:glue:${Region}:${Account}:database/okera_*",
        "arn:aws:glue:${Region}:${Account}:database/_okera_*",
        "arn:aws:glue:${Region}:${Account}:database/cerebro_*",
        "arn:aws:glue:${Region}:${Account}:database/default",
        "arn:aws:glue:${Region}:${Account}:database/sample_database",
        "arn:aws:glue:${Region}:${Account}:table/okera_*/*",
        "arn:aws:glue:${Region}:${Account}:table/_okera_*/*",
        "arn:aws:glue:${Region}:${Account}:table/cerebro_*/*",
        "arn:aws:glue:${Region}:${Account}:table/default/*",
        "arn:aws:glue:${Region}:${Account}:table/sample_database/*"
    ]
}