Integrate With External Catalogs¶

You can integrate other data catalogs, such as Collibra or Alation data catalogs, with Okera.

The Okera catalog integration process is comprised of two Python scripts. One script syncs business metadata from a catalog to Okera and the other script syncs technical metadata from Okera to a catalog. The integration scripts use PyOkera in combination with the respective catalog's API to perform end-to-end synchronization.

Catalog to Okera¶

Synchronizing from a catalog to Okera lets you transfer business metadata in attribute form, either out-of-the-box or custom, from a catalog to Okera. Attributes are applied to the corresponding Okera object as tags. The integration provides you with configuration options to define how catalog attributes map to Okera tags.

Okera to Catalog¶

Synchronizing from Okera to a catalog lets you create technical metadata (databases, datasets and columns) based on the corresponding data objects in Okera. This synchronization relies on a specific structure in each catalog to ensure objects are mapped and created correctly. The accepted catalog structure and hierarchy is explained in each catalog section.

Mapping¶

The script leverages different methods to ensure resiliency in mapping catalog objects to Okera objects. This means that if the names of objects change later in the catalog, they will still be mapped correctly to Okera.

Full Object Name¶

The integration’s primary mapping method is using the full name of the data object. For example for the dob column in the okera_sample.users table the expected full name would need to be: okera_sample.users.dob.

Warning

This script will not successfully sync attributes if the full name is not specified in the format above.

Map Catalog Tables to Okera Using Catalog IDs¶

For catalogs with unique IDs, catalog tables can be mapped to Okera tables by ID. To map an existing table in the catalog to an existing dataset in Okera, the object ID of the catalog table must be added to the table properties of the corresponding dataset in Okera. The catalog object ID property is catalog_obj_id.

The following example creates a table in Okera with the Collibra asset ID specified:

CREATE EXTERNAL TABLE okera_sample.users (
uid STRING COMMENT 'Unique user id',
dob STRING COMMENT 'Formatted as DD-month-YY',
gender STRING,
ccn STRING COMMENT 'Sensitive data, should not be accessible without masking.'
)
COMMENT 'Default okera table.'
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS PARQUET
LOCATION 'file:/opt/data/users'
TBLPROPERTIES ("catalog_obj_id" = "123-abc-123abc-123")

The following example adds an asset ID to the Okera table properties:

ALTER TABLE okera_sample.users SET TBLPROPERTIES ("catalog_obj_id" = "123-abc-123abc-123")

Warning

This mapping is only valid for tables and datasets, not columns or databases. If the full name of the catalog object is not specified in the correct format above for columns, attributes and descriptions will not be successfully synced.

Logging¶

Each catalog integration produces two log files at run time. The location of the log files is set in the respective configuration file of the integration.

Stack Trace Log¶

The stack trace log tracks the synchronization activity (i.e., the operations the scripts are executing). These log files are named <catalog_name>_okera.log or okera_<catalog_name>.log, depending on the direction of the synchronization process. The logging level for this log file is configurable.

Output Log¶

The output log tracks any warning or errors that occur during synchronization. After the synchronization completes, this log provides a summary of the synchronization process, including the number of objects fetched from both sides of the synchronization, the number of objects that were modified and the names of those objects, and finally a full list of all the objects fetched with all of their information (attributes, object type, location, etc.). This log file is named <catalog_name>_output.log.

Collibra Integration¶

Before You Start¶

To ensure a seamless integration between the data objects in Collibra and Okera, the scripts accept the following Collibra asset type hierarchy:

Database
Table
Column

These assets may be stored in any domain or community type.

The integration maps Collibra attributes to Okera tags. You can use preexisting Collibra attributes or custom attributes.

Quick Start¶

config.yaml

Create a config.yaml log file that stores Okera and Collibra connection information, and the locations of the assets file. Here is an example of a quick start configuration for the Collibra integration with only the required fields filled in.

# config.yaml
log_directory: "log_dir"
okera_host: "example.okera.com"
okera_port: 12050
okera_token: "123abc456def"
collibra_dgc: "https://example.collibra.com:443"
collibra_username: "username"
collibra_password: "password"
collibra_password_file: "creds.txt"
collibra_assets: "path/to/assets.yaml"
logging_level: "debug"
mapped_collibra_attributes:
sync_descriptions: False
mapped_collibra_statuses:
  statuses:
  okera_namespace:
full_name_prefixes:

Set the location of the config.yaml file as an environment variable with $ export CONF=path/to/config.yaml.

assets.yaml

Create an assets.yaml file that provides the name and ID of the Collibra community and the domain in which the asset is located or to which you would like to sync the asset.

# assets.yaml
communities:
    - name: "Example Community"
      id: "123-abc-123"
      domains:
        - name: "Example Domain"
          id: "456-def-456"
          databases:
          tables:
            - "my_db.example_table"

Run bootstrap.sh

$ ./bootstrap.sh

Run the Synchronization

To sync from Okera to Collibra:

$ python3 okera_catalog.py collibra

To sync from Collibra to Okera:

$ python3 catalog_okera.py collibra

Configure the Integration¶

The file bootstrap.sh installs all Python3 packages needed to run the script:

PyOkera
thriftpy
requests
PyYaml

User-specific integration information is supplied in config.yaml. In this file, you can define Okera and Collibra connection information and can leverage different configuration options for synchronizing attributes between Collibra and Okera.

Note: The location of config.yaml must be specified as an environment variable using this command: $ export CONF=path/to/config.yaml

Connection Information

# config.yaml
log_directory: "log_dir"
okera_host: "example.okera.com"
okera_port: 12050
okera_token: "123abc456def"
collibra_dgc: "https://example.collibra.com:443"
collibra_username: "username"
collibra_password: "password"
collibra_password_file: "creds.txt"
collibra_assets: "path/to/assets.yaml"
logging_level: "debug"
# …

In the field log_directory, specify the directory to which the log files will be added.

There are multiple ways to set Collibra credentials. They can either be in the configuration file using the fields collibra_username and collibra_password or they can be set as environment variables (see Collibra Environment Variables). If the credentials are defined in config.yaml, you can read the Collibra password from a text file. With this configuration, the field collibra_password would be blank and the collibra_password_file would specify the name and location of the text file containing the password.

In the collibra_assets, specify the path to the yaml file that lists the assets to be synced. See Collibra Assets File.

The logging level can be set to either debug or info, depending on how much information should be logged while running the sync. Running the script automatically creates the two log files, described in Logging.

Attribute Mapping

# config.yaml
mapped_collibra_attributes:
  - { attribute_name: "Collibra Classification",
      attribute_id: "123-abc-123abc-123",
      okera_namespace: "collibra_classification",
      prioritize_column_attribute: True
    }
  - { attribute_name: "Custom Attribute 2",
      attribute_id: "456-def-456def-456",
      okera_namespace: "attribute_2",
      prioritize_column_attribute: False
    }
sync_descriptions: True
# …

Under mapped_collibra_attributes, specify the namespaces that should be mapped to Collibra custom attributes.

If the Okera namespace already exists, the values of that Collibra attribute are added as Okera tags. If it does not exist, a new namespace is created in Okera with the name provided.

The field prioritize_column_attribute lets you further customize the application of column and table attributes. See Prioritize Column Attributes.

For example, suppose you have a custom Collibra attribute called “Collibra Classification,” with its value set to “Restricted.” This creates the collibra_classification namespace, and adds the tag collibra_classification.restricted to the corresponding object in Okera.

The field sync_descriptions defines whether descriptions should be synced. If set to False, the descriptions are not synced to or from Okera.

Status Mapping

# config.yaml
mapped_collibra_statuses:
  statuses:
      - "Accepted"
      - "Reviewed"
  okera_namespace: "collibra_status"
#...

In addition to mapping custom attributes to Okera's tags, you can also map Collibra statuses. Statuses that will be mapped to an Okera tag namespace are listed under mapped_collibra_statuses. You can choose the Okera tag namespace to which you want the status added and the types of statuses that should be mapped.

For example, using the example values above, only the statuses Accepted and Reviewed are synced to Okera and the Okera tag namespace to which these statuses are added is collibra_status. This tags the assets with those statuses as collibra_status.accepted or collibra_status.reviewed.

Prefixes

# config.yaml
full_name_prefixes:
  - "Prefix One"
  - "Prefix Two”

If the full name of an asset in Collibra has a prefix that the asset does not have in Okera, the prefix must be added under full_name_prefixes. Here you can add a list of prefixes that the script splits off to correctly map the name of the asset in Collibra to the name of the asset in Okera.

For example, suppose the database my_db has the prefix abc in Collibra, making it’s full name abc.my_db. The prefix abc should be added to the list full_name_prefixes. This ensures it is mapped correctly to my_db in Okera.

Collibra Environment Variables¶

The script accepts a number of environment variables that allow you to define fields and locations outside of config.yaml. Fields that are set as environment variables can be left empty in config.yaml. The only required environment variable is the location of the config.yaml file.

Okera Token

$ export TOKEN=$TOKEN

Collibra Credentials

$ export collibra_username=$username
$ export collibra_password=$password
$ export collibra_password_file=$password_file

Prioritize Column Attributes¶

If prioritize_column_attribute is set to True, then if a table has a value for an attribute at both the table level and at the column level, the script only applies the column-level attributes. This is because Okera table-level attributes are implicitly applied on all the columns in the table, and you may not want to occur.. See Okera's ABAC docs.

For example, suppose you added the Collibra Classification "Confidential" attribute to a table and "Restricted" to one of its columns. If prioritize_column_attribute is set to True for the Collibra Classification attribute, then in Okera the column is tagged with collibra_classification.restricted and the table is not tagged at all. If prioritize_column_attribute is set to False (which is the default), Collibra attributes are applied in Okera on the table and columns as usual.

Note: Only tags within these specified namespaces are modified when synchronizing from Collibra to Okera. Existing tags in Okera are not affected by the synchronization.

Collibra Assets File¶

The assets to be synced between Collibra and Okera must be defined in the assets.yaml file. You can specify multiple databases and tables from different communities and domains. This file can be renamed and moved as needed, but the structure must be as follows:

communities:
  # name and ID of first community
  - name: "Community 1"
    id: "36fd10bf-f6f3-44a6-807e-915c6a24c521"
    domains:
      # name and ID of first domain
      - name: "Domain 1"
        id: "87a5c7ae-46ca-4e3d-8a6a-b15772fc7264"
        databases:
        tables:
          - "db_1.table_1"
          - "db_2.table_2"
      # name and ID of second domain
      - name: "Domain 2"
        id: "53f54ef8-46ca-4e3d-8a6a-b15772fc7264"
        databases:
          - "db_a"
          - "db_b"
          - "db_c"
        tables:
          - db_d.table_a"
  # name and ID of second community
  - name: "Community 2"
    id: "45fd10bf-f6f3-66a6-807e-915c6a25c521"
    domains:
      - name: "Domain 3"
        id: "98f5c7ae-46ca-4e3d-8a6a-b17652fc7264"
        databases:
          - "db_123abc"
        tables:

In the example above, assets from two different communities are synced between Okera and Collibra.

In the first community, Community 1, assets are synced from two domains, Domain 1 and Domain 2.

In Domain 1, the tables db_1.table_1 and db_2.table_2 are synced. No entire database is synced so the databases: list is left empty.
In Domain 2, the databases db_a, db_b, and db_c are synced along with one table, db_d.table_a.

In Community 2 the database db_123abc from the domain Domain 3"is synced. No tables are synced from Domain 3 so the tables: list is left empty.

Note: When listing databases and tables, the full name of the asset must be entered including any prefixes. If a prefix has been added to the full name in Collibra (e.g. Prefix One.okera_sample.users), they must also be added to config.yaml.

Alation Integration¶

Before You Start¶

To ensure a seamless integration between the data objects in Alation and Okera, the scripts accept the following Alation object type hierarchy:

Data source
Schema
Table
Column

Synchronizing from Okera to Alation does not create any data sources, only schemas that map to Okera databases. The integration maps Alation fields to Okera tags. You can use preexisting Alation fields or custom fields.

Quick Start¶

config.yaml

Create a config.yaml log file that contains Okera and Alation connection information and the locations of the objects file. Here is an example of a quick start configuration for the Alation integration with only the required fields filled in.

# config.yaml
log_directory: "log_dir"
okera_host: "example.okera.com"
okera_port: 12050
okera_token: "123abc456def"
alation_host: "https://example.alationcatalog.com"
alation_token: "678ghi91011jkl"
alation_objects: "path/to/assets.yaml"
logging_level: "debug"
mapped_alation_fields:

Set the location of config.yaml as an environment variable with $ export CONF=path/to/config.yaml.

objects.yaml

Create an objects.yaml file to store the ID of the Alation data source in which the object is located or to which you would like to sync the object.

# objects.yaml
data_sources:
    - id: 123
      schemas:
      tables:
        - "my_schema.example_table"

Run bootstrap.sh

$ ./bootstrap.sh

Run the Synchronization

To sync from Okera to Alation:

$ python3 okera_catalog.py alation

To sync from Alation to Okera:

$ python3 catalog_okera.py alation

Configure the Integration¶

The file bootstrap.sh installs all Python3 packages needed to run the script:

PyOkera
thriftpy
requests
PyYaml

User-specific integration information is supplied in the config.yaml file. In this file you define Okera and Alation connection information and can leverage different configuration options for synchronizing attributes between Alation and Okera.

Note: The location of the config.yaml file must be specified as an environment variable as follows, $ export CONF=path/to/config.yaml

Connection Information

# config.yaml
log_directory: "log_dir"
okera_host: "example.okera.com"
okera_port: 12050
okera_token: "123abc456def"
alation_host: "https://example.alationcatalog.com"
alation_token: "678ghi91011jkl"
alation_objects: "path/to/assets.yaml"
logging_level: "info"
# …

In the log_directory field, specify the directory to which the log files alation_okera.log and okera_alation.log will be added.

In the alation_objects field, specify the yaml file that lists the Alation objets to be synced. See Alation Objects File.

The logging_level can be set to either debug or info, depending on how much information should be logged while running the synchronization. Running the script automatically creates the two log files described in Logging.

Attribute Mapping

# config.yaml
mapped_alation_fields:
    - {name: "Alation Classification",
       okera_namespace: "alation_classification"}
    - {name: "Custom Field 2",
       okera_namespace: "custom_field2"}

Under mapped_alation_fields, specif the namespaces that should be mapped to Alation fields. If the Okera namespace already exists, the value of the Alation field will be added as Okera tags. If it does not exist, a new namespace is created in Okera with the name provided.

For example, suppose you have a custom Alation field called Alation Classification with a value of Confidential. The synchronization will create the alation_classification namespace and add the tag alation_classification.confidential to the corresponding object in Okera.

Environment Variables¶

The script accepts a number of environment variables that allow you to define fields and locations outside of config.yaml. Fields that are set as environment variables can be left empty in config.yaml. The only required environment variable is the location of the config.yaml file.

Okera Token

$ export TOKEN=$TOKEN

Alation Token

$ export alation_token=$alationtoken

Alation Objects File¶

The assets to be synced between Alation and Okera must be defined in the objects.yaml file. You can specify multiple schemas and tables from different data sources. This file can be renamed and moved as needed, but the structure must be as follows:

data_sources:
  - id: 12
    schemas:
      - "schema_1"
      - "schema_2"
    tables:
  - id: 34
    schemas:
    tables:
      - "schema_a.table_a"
      - "schema_b.table_b"
  - id: 56
    schemas:
      - "schema_a1"
    tables:
      - "schema_b2.table_1"

In the example above, objects from three different data sources are syncrhonized between Okera and Alation.

In data source 12, the schemas schema_1 and schema_2 are synced. No tables are synced within this data source.
In data source 34, the tables schema_a.table_a and schema_b.table_b are synced. No entire schema is synced so the schema: list is left empty.
In data source 56, the schema schema_a1 and the table schema_b2.table_1 are synced.

Note: When listing schemas and tables, the full name of the object must be entered.