Skip to content

Using Glue as a Third-Party Metadata Catalog

Starting with version 2.0.0, Okera supports using the AWS Glue catalog as a third-party metadata catalog. This document provides information about how to configure and use the AWS Glue catalog.

General Prerequisites

There are two prerequisites to integrating Okera with your AWS Glue metadata catalog:

  1. Okera version 2.0.0 or later must be installed. See AWS prerequisites for details.
  2. AWS Identity and Access Management (IAM) and network policy-based access between Okera and the AWS Glue Catalog must be established.

General IAM Privileges

In addition to the general AWS prerequisites, the Okera-Glue integration requires certain IAM privileges in addition to the default ones used to communicate with the AWS Glue catalog service.

To set up IAM:

1, Set up an Okera cluster role with the default IAM privileges mentioned here.

  1. Log in to your AWS Web Console and go to the IAM page.

  2. On the right-hand side panel, select Roles.

  3. Using the search bar, find the IAM role you created for Okera earlier, and select it.

  4. Select Attach policies. You will see a list of existing default policies.

  5. On this screen, enter AWSGlueServiceRole in the search box. Select this policy from the shortlist and select Attach Policy at the bottom of the screen.

  6. After this is done, perform the installation steps.

Note: For a more fine-grained set of policies to attach to your cluster's role, see Advanced Topics.

Setup Process for Okera to Use AWS Glue

The configuration and setup process for Okera using AWS Glue metadata as a catalog are:

  1. Follow the general steps to configure your Okera cluster. Use the links to Configure Okera on EKS.

  2. To use the Okera-Glue integration, modify your cluster configuration in your configuration file:

    • Look for the property named CATALOG_TYPE.
    • If CATALOG_TYPE does not exist, add it as a new property.
    • Set the value of CATALOG_TYPE to glue.
    • Save the updated configuration file and apply it using the Okera Helm chart.

Cross-Region Support

Starting with Okera 2.1.0, users can configure their Okera clusters in one region to communicate with their AWS Glue metadata catalog in different region. The only prerequisite is IAM and network policy-based access between the regions in which Okera and the AWS Glue catalog reside.

To enable cross-region support between Okera and Glue, modify your configuration file:

  • Add a new property to the configuration file called GLUE_REGION.
  • Set the value of GLUE_REGION to the AWS region where the target AWS Glue metadata catalog is located.
  • Save this updated configuration file and apply it using the Okera Helm chart.

Note: The Okera-Glue integration works by communicating with the AWS Glue metastore in real time for a seamless end-user experience. In this mode, additional network latency may occur.

Dataset Counts Problem Resolution

If you receive an InternalServiceException 500 from Glue, while trying to open the Data page in the Okera UI, add the OKERA_GLUE_SILENCE_TBL_PAGINATOR_500 configuration parameter to your Okera configuration file and set it to true.

This is not a common problem, but it does occur in some situations.

Advanced Topics

This section describes restrictive IAM privileges, the required properties essential to limiting access to Glue metadata objects, and some sample Glue policies.

Okera lets you execute a range of metadata-related operations against various objects in the Glue catalog. These objects include, but are not limited to:

  • Catalog (The full scope of the metadata)
  • Databases
  • Tables
  • Table partitions

The operations listed below are allowed selectively for a given user based on the user's permissions as defined in Okera's policy engine. For example, these policies might allow a user to:

  • Create new databases, tables, partitions (examples of metadata objects).
  • Modify the properties on certain metadata objects.
  • Only view the properties on certain metadata objects.
  • Deny access to certain metadata objects.

When using the AWS Glue catalog, Amazon IAM privileges additionally limit what the Glue-integrated Okera cluster can see in its view of the Glue catalog. This implicitly limits the operations that can be conducted by users using Okera.

While these restrictions help limit the scope of Okera to affect crucial metadata that may be shared across multiple teams, some of these privileges are essential for the proper functioning of Okera.

The next sections explore properties essential for Okera and restrictive IAM privileges using examples.

Essential Properties

When restricting privileges of Okera clusters to a limited set of Glue catalog objects, consider these factors:

  • Glue IAM privileges are hierarchical:

    1. If you want certain privileges on a catalog object, make sure its requisite access to the parent is available.
    2. The above is true and if you are limiting Glue privileges to a certain set of metadata objects (resources), the catalog resource must always be included in the Allow set.
    3. This is true for all services that interact with the AWS Glue catalog and is not a limitation specific to Okera.
  • To function properly, Okera needs read access, and for a non-limiting user experience, write and update access, to certain catalog objects. These include:

    1. Databases starting with _okera and okera. These are used by Okera to manage its internal state, and to use the Okera crawler.
    2. The default database.
    3. Tables in the above databases.

Example Policies

Minimal Policy for an Optimal Okera-Glue Experience

Okera gives you the ability to execute a wide variety of metadata calls against the Glue catalog objects. If you want to set Okera up to work against the entire Glue catalog (in these examples, the Glue catalog is in US-West-2), you can append the Glue IAM policy shown below. Bear in mind that to use Okera to scan the data, requisite access to respective Amazon S3 objects also must align to the Amazon S3 privileges.

{       
    "Sid": "VisualEditor10",
    "Effect": "Allow",
    "Action": [
        "glue:BatchCreatePartition",
        "glue:BatchDeleteConnection",
        "glue:BatchDeletePartition",
        "glue:BatchDeleteTable",
        "glue:BatchGetPartition",
        "glue:CreateDatabase",
        "glue:CreatePartition",
        "glue:CreateTable",
        "glue:CreateUserDefinedFunction",
        "glue:DeleteDatabase",
        "glue:DeletePartition",
        "glue:DeleteTable",
        "glue:DeleteUserDefinedFunction",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetTags",
        "glue:GetUserDefinedFunction",
        "glue:GetUserDefinedFunctions",
        "glue:TagResource",
        "glue:UntagResource",
        "glue:UpdateDatabase",
        "glue:UpdatePartition",
        "glue:UpdateTable",
        "glue:UpdateUserDefinedFunction"
    ],
    "Resource": [
        "arn:aws:glue:${Region}:${Account}:*"
    ]
}

Policy for Allowing Read-Only Access on the Entire Glue Catalog

If you wish to leverage a specific Okera cluster for its scanning and metadata manipulation capabilities alone, while limiting any ability to create databases, tables, partitions, or UDFs using Okera, append the Glue IAM policy shown below:

{
    "Sid": "VisualEditor10",
    "Effect": "Allow",
    "Action": [
        "glue:BatchGetPartition",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetTags",
        "glue:GetUserDefinedFunction",
        "glue:GetUserDefinedFunctions",
    ],
    "Resource": [
        "arn:aws:glue:${Region}:${Account}:*"
    ]
}

This policy limits the ability of Okera to successfully log auditing information from the cluster in a completely autonomous way.

Policy for Allowing an Optimal Okera-Glue Experience on a Limited Set of Glue Catalog Objects

If you want to leverage Okera, but only for a limited set of Glue catalog objects, allow the minimal privileges described above on the catalog as well as the objects where access is desired. Here is a sample Glue IAM policy:

{
    "Sid": "VisualEditor10",
    "Effect": "Allow",
    "Action": [
        "glue:BatchCreatePartition",
        "glue:BatchDeleteConnection",
        "glue:BatchDeletePartition",
        "glue:BatchDeleteTable",
        "glue:BatchGetPartition",
        "glue:CreateDatabase",
        "glue:CreatePartition",
        "glue:CreateTable",
        "glue:CreateUserDefinedFunction",
        "glue:DeleteDatabase",
        "glue:DeletePartition",
        "glue:DeleteTable",
        "glue:DeleteUserDefinedFunction",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetTags",
        "glue:GetUserDefinedFunction",
        "glue:GetUserDefinedFunctions",
        "glue:TagResource",
        "glue:UntagResource",
        "glue:UpdateDatabase",
        "glue:UpdatePartition",
        "glue:UpdateTable",
        "glue:UpdateUserDefinedFunction"
    ],
    "Resource": [
        "arn:aws:glue:${Region}:${Account}:catalog",
        "arn:aws:glue:${Region}:${Account}:database/okera_*",
        "arn:aws:glue:${Region}:${Account}:database/_okera_*",
        "arn:aws:glue:${Region}:${Account}:database/cerebro_*",
        "arn:aws:glue:${Region}:${Account}:database/default",
        "arn:aws:glue:${Region}:${Account}:database/sample_database",
        "arn:aws:glue:${Region}:${Account}:table/okera_*/*",
        "arn:aws:glue:${Region}:${Account}:table/_okera_*/*",
        "arn:aws:glue:${Region}:${Account}:table/cerebro_*/*",
        "arn:aws:glue:${Region}:${Account}:table/default/*",
        "arn:aws:glue:${Region}:${Account}:table/sample_database/*"
    ]
}