Integrate With External Catalogs¶
You can integrate other data catalogs, such as Collibra or Alation data catalogs, with Okera.
The Okera catalog integration process is comprised of two Python scripts. One script syncs business metadata from a catalog to Okera and the other script syncs technical metadata from Okera to a catalog. The integration scripts use PyOkera in combination with the respective catalog's API to perform end-to-end synchronization.
Catalog to Okera¶
Synchronizing from a catalog to Okera lets you transfer business metadata in attribute form, either out-of-the-box or custom, from a catalog to Okera. Attributes are applied to the corresponding Okera object as tags. The integration provides you with configuration options to define how catalog attributes map to Okera tags.
Okera to Catalog¶
Synchronizing from Okera to a catalog lets you create technical metadata (databases, datasets and columns) based on the corresponding data objects in Okera. This synchronization relies on a specific structure in each catalog to ensure objects are mapped and created correctly. The accepted catalog structure and hierarchy is explained in each catalog section.
Mapping¶
The script leverages different methods to ensure resiliency in mapping catalog objects to Okera objects. This means that if the names of objects change later in the catalog, they will still be mapped correctly to Okera.
Full Object Name¶
The integration’s primary mapping method is using the full name of the data object.
For example for the dob
column in the okera_sample.users
table the expected full name would need to be: okera_sample.users.dob
.
Warning
This script will not successfully sync attributes if the full name is not specified in the format above.
Map Catalog Tables to Okera Using Catalog IDs¶
For catalogs with unique IDs, catalog tables can be mapped to Okera tables by ID.
To map an existing table in the catalog to an existing dataset in Okera, the object ID of the catalog table must be added to the table properties of the corresponding dataset in Okera.
The catalog object ID property is catalog_obj_id
.
The following example creates a table in Okera with the Collibra asset ID specified:
CREATE EXTERNAL TABLE okera_sample.users (
uid STRING COMMENT 'Unique user id',
dob STRING COMMENT 'Formatted as DD-month-YY',
gender STRING,
ccn STRING COMMENT 'Sensitive data, should not be accessible without masking.'
)
COMMENT 'Default okera table.'
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS PARQUET
LOCATION 'file:/opt/data/users'
TBLPROPERTIES ("catalog_obj_id" = "123-abc-123abc-123")
The following example adds an asset ID to the Okera table properties:
ALTER TABLE okera_sample.users SET TBLPROPERTIES ("catalog_obj_id" = "123-abc-123abc-123")
Warning
This mapping is only valid for tables and datasets, not columns or databases. If the full name of the catalog object is not specified in the correct format above for columns, attributes and descriptions will not be successfully synced.
Logging¶
Each catalog integration produces two log files at run time. The location of the log files is set in the respective configuration file of the integration.
Stack Trace Log¶
The stack trace log tracks the synchronization activity (i.e., the operations the scripts are executing). These log files are named <catalog_name>_okera.log
or okera_<catalog_name>.log
, depending on the direction of the synchronization process. The logging level for this log file is configurable.
Output Log¶
The output log tracks any warning or errors that occur during synchronization. After the synchronization completes, this log provides a summary of the synchronization process, including the number of objects fetched from both sides of the synchronization, the number of objects that were modified and the names of those objects, and finally a full list of all the objects fetched with all of their information (attributes, object type, location, etc.). This log file is named <catalog_name>_output.log
.
Collibra Integration¶
Before You Start¶
To ensure a seamless integration between the data objects in Collibra and Okera, the scripts accept the following Collibra asset type hierarchy:
- Database
- Table
- Column
These assets may be stored in any domain or community type.
The integration maps Collibra attributes to Okera tags. You can use preexisting Collibra attributes or custom attributes.
Quick Start¶
config.yaml
Create a config.yaml
log file that stores Okera and Collibra connection information, and the locations of the assets file.
Here is an example of a quick start configuration for the Collibra integration with only the required fields filled in.
# config.yaml
log_directory: "log_dir"
okera_host: "example.okera.com"
okera_port: 12050
okera_token: "123abc456def"
collibra_dgc: "https://example.collibra.com:443"
collibra_username: "username"
collibra_password: "password"
collibra_password_file: "creds.txt"
collibra_assets: "path/to/assets.yaml"
logging_level: "debug"
mapped_collibra_attributes:
sync_descriptions: False
mapped_collibra_statuses:
statuses:
okera_namespace:
full_name_prefixes:
Set the location of the config.yaml
file as an environment variable with $ export CONF=path/to/config.yaml
.
assets.yaml
Create an assets.yaml
file that provides the name and ID of the Collibra community and the domain in which the asset is located or to which you would like to sync the asset.
# assets.yaml
communities:
- name: "Example Community"
id: "123-abc-123"
domains:
- name: "Example Domain"
id: "456-def-456"
databases:
tables:
- "my_db.example_table"
Run bootstrap.sh
$ ./bootstrap.sh
Run the Synchronization
To sync from Okera to Collibra:
$ python3 okera_catalog.py collibra
To sync from Collibra to Okera:
$ python3 catalog_okera.py collibra
Configure the Integration¶
The file bootstrap.sh
installs all Python3 packages needed to run the script:
- PyOkera
- thriftpy
- requests
- PyYaml
User-specific integration information is supplied in config.yaml
.
In this file, you can define Okera and Collibra connection information and can leverage different configuration options for synchronizing attributes between Collibra and Okera.
Note: The location of
config.yaml
must be specified as an environment variable using this command:$ export CONF=path/to/config.yaml
Connection Information
# config.yaml
log_directory: "log_dir"
okera_host: "example.okera.com"
okera_port: 12050
okera_token: "123abc456def"
collibra_dgc: "https://example.collibra.com:443"
collibra_username: "username"
collibra_password: "password"
collibra_password_file: "creds.txt"
collibra_assets: "path/to/assets.yaml"
logging_level: "debug"
# …
In the field log_directory
, specify the directory to which the log files will be added.
There are multiple ways to set Collibra credentials. They can either be in the configuration file using the fields collibra_username
and collibra_password
or they can be set as environment variables (see Collibra Environment Variables).
If the credentials are defined in config.yaml
, you can read the Collibra password from a text file.
With this configuration, the field collibra_password
would be blank and the collibra_password_file
would specify the name and location of the text file containing the password.
In the collibra_assets
, specify the path to the yaml
file that lists the assets to be synced.
See Collibra Assets File.
The logging level can be set to either debug
or info
, depending on how much information should be logged while running the sync.
Running the script automatically creates the two log files, described in Logging.
Attribute Mapping
# config.yaml
mapped_collibra_attributes:
- { attribute_name: "Collibra Classification",
attribute_id: "123-abc-123abc-123",
okera_namespace: "collibra_classification",
prioritize_column_attribute: True
}
- { attribute_name: "Custom Attribute 2",
attribute_id: "456-def-456def-456",
okera_namespace: "attribute_2",
prioritize_column_attribute: False
}
sync_descriptions: True
# …
Under mapped_collibra_attributes
, specify the namespaces that should be mapped to Collibra custom attributes.
If the Okera namespace already exists, the values of that Collibra attribute are added as Okera tags. If it does not exist, a new namespace is created in Okera with the name provided.
The field prioritize_column_attribute
lets you further customize the application of column and table attributes. See Prioritize Column Attributes.
For example, suppose you have a custom Collibra attribute called “Collibra Classification,” with its value set to “Restricted.” This creates the collibra_classification
namespace, and adds the tag collibra_classification.restricted
to the corresponding object in Okera.
The field sync_descriptions
defines whether descriptions should be synced. If set to False
, the descriptions are not synced to or from Okera.
Status Mapping
# config.yaml
mapped_collibra_statuses:
statuses:
- "Accepted"
- "Reviewed"
okera_namespace: "collibra_status"
#...
In addition to mapping custom attributes to Okera's tags, you can also map Collibra statuses. Statuses that will be mapped to an Okera tag namespace are listed under mapped_collibra_statuses
.
You can choose the Okera tag namespace to which you want the status added and the types of statuses that should be mapped.
For example, using the example values above, only the statuses Accepted
and Reviewed
are synced to Okera and the Okera tag namespace to which these statuses are added is collibra_status
.
This tags the assets with those statuses as collibra_status.accepted
or collibra_status.reviewed
.
Prefixes
# config.yaml
full_name_prefixes:
- "Prefix One"
- "Prefix Two”
If the full name of an asset in Collibra has a prefix that the asset does not have in Okera, the prefix must be added under full_name_prefixes
.
Here you can add a list of prefixes that the script splits off to correctly map the name of the asset in Collibra to the name of the asset in Okera.
For example, suppose the database my_db
has the prefix abc
in Collibra, making it’s full name abc.my_db
.
The prefix abc
should be added to the list full_name_prefixes
. This ensures it is mapped correctly to my_db
in Okera.
Collibra Environment Variables¶
The script accepts a number of environment variables that allow you to define fields and locations outside of config.yaml
.
Fields that are set as environment variables can be left empty in config.yaml
.
The only required environment variable is the location of the config.yaml
file.
Okera Token
$ export TOKEN=$TOKEN
Collibra Credentials
$ export collibra_username=$username
$ export collibra_password=$password
$ export collibra_password_file=$password_file
Prioritize Column Attributes¶
If prioritize_column_attribute
is set to True
, then if a table has a value for an attribute at both the table level and at the column level, the script only applies the column-level attributes.
This is because Okera table-level attributes are implicitly applied on all the columns in the table, and you may not want to occur..
See Okera's ABAC docs.
For example, suppose you added the Collibra Classification "Confidential" attribute to a table and "Restricted" to one of its columns.
If prioritize_column_attribute
is set to True
for the Collibra Classification attribute, then in Okera the column is tagged with collibra_classification.restricted
and the table is not tagged at all.
If prioritize_column_attribute
is set to False
(which is the default), Collibra attributes are applied in Okera on the table and columns as usual.
Note: Only tags within these specified namespaces are modified when synchronizing from Collibra to Okera. Existing tags in Okera are not affected by the synchronization.
Collibra Assets File¶
The assets to be synced between Collibra and Okera must be defined in the assets.yaml
file. You can specify multiple databases and tables from different communities and domains. This file can be renamed and moved as needed, but the structure must be as follows:
communities:
# name and ID of first community
- name: "Community 1"
id: "36fd10bf-f6f3-44a6-807e-915c6a24c521"
domains:
# name and ID of first domain
- name: "Domain 1"
id: "87a5c7ae-46ca-4e3d-8a6a-b15772fc7264"
databases:
tables:
- "db_1.table_1"
- "db_2.table_2"
# name and ID of second domain
- name: "Domain 2"
id: "53f54ef8-46ca-4e3d-8a6a-b15772fc7264"
databases:
- "db_a"
- "db_b"
- "db_c"
tables:
- db_d.table_a"
# name and ID of second community
- name: "Community 2"
id: "45fd10bf-f6f3-66a6-807e-915c6a25c521"
domains:
- name: "Domain 3"
id: "98f5c7ae-46ca-4e3d-8a6a-b17652fc7264"
databases:
- "db_123abc"
tables:
In the example above, assets from two different communities are synced between Okera and Collibra.
In the first community, Community 1
, assets are synced from two domains, Domain 1
and Domain 2
.
-
In
Domain 1
, the tablesdb_1.table_1
anddb_2.table_2
are synced. No entire database is synced so thedatabases:
list is left empty. -
In
Domain 2
, the databasesdb_a
,db_b
, anddb_c
are synced along with one table,db_d.table_a
.
In Community 2
the database db_123abc
from the domain Domain 3"
is synced. No tables are synced from Domain 3
so the tables:
list is left empty.
Note: When listing databases and tables, the full name of the asset must be entered including any prefixes. If a prefix has been added to the full name in Collibra (e.g.
Prefix One.okera_sample.users
), they must also be added toconfig.yaml
.
Alation Integration¶
Before You Start¶
To ensure a seamless integration between the data objects in Alation and Okera, the scripts accept the following Alation object type hierarchy:
- Data source
- Schema
- Table
- Column
Synchronizing from Okera to Alation does not create any data sources, only schemas that map to Okera databases. The integration maps Alation fields to Okera tags. You can use preexisting Alation fields or custom fields.
Quick Start¶
config.yaml
Create a config.yaml
log file that contains Okera and Alation connection information and the locations of the objects file.
Here is an example of a quick start configuration for the Alation integration with only the required fields filled in.
# config.yaml
log_directory: "log_dir"
okera_host: "example.okera.com"
okera_port: 12050
okera_token: "123abc456def"
alation_host: "https://example.alationcatalog.com"
alation_token: "678ghi91011jkl"
alation_objects: "path/to/assets.yaml"
logging_level: "debug"
mapped_alation_fields:
Set the location of config.yaml
as an environment variable with $ export CONF=path/to/config.yaml
.
objects.yaml
Create an objects.yaml
file to store the ID of the Alation data source in which the object is located or to which you would like to sync the object.
# objects.yaml
data_sources:
- id: 123
schemas:
tables:
- "my_schema.example_table"
Run bootstrap.sh
$ ./bootstrap.sh
Run the Synchronization
To sync from Okera to Alation:
$ python3 okera_catalog.py alation
To sync from Alation to Okera:
$ python3 catalog_okera.py alation
Configure the Integration¶
The file bootstrap.sh
installs all Python3 packages needed to run the script:
- PyOkera
- thriftpy
- requests
- PyYaml
User-specific integration information is supplied in the config.yaml
file. In this file you define Okera and Alation connection information and can leverage different configuration options for synchronizing attributes between Alation and Okera.
Note: The location of the
config.yaml
file must be specified as an environment variable as follows,$ export CONF=path/to/config.yaml
Connection Information
# config.yaml
log_directory: "log_dir"
okera_host: "example.okera.com"
okera_port: 12050
okera_token: "123abc456def"
alation_host: "https://example.alationcatalog.com"
alation_token: "678ghi91011jkl"
alation_objects: "path/to/assets.yaml"
logging_level: "info"
# …
In the log_directory
field, specify the directory to which the log files alation_okera.log
and okera_alation.log
will be added.
In the alation_objects
field, specify the yaml
file that lists the Alation objets to be synced. See Alation Objects File.
The logging_level
can be set to either debug
or info
, depending on how much information should be logged while running the synchronization. Running the script automatically creates the two log files described in Logging.
Attribute Mapping
# config.yaml
mapped_alation_fields:
- {name: "Alation Classification",
okera_namespace: "alation_classification"}
- {name: "Custom Field 2",
okera_namespace: "custom_field2"}
Under mapped_alation_fields
, specif the namespaces that should be mapped to Alation fields.
If the Okera namespace already exists, the value of the Alation field will be added as Okera tags.
If it does not exist, a new namespace is created in Okera with the name provided.
For example, suppose you have a custom Alation field called Alation Classification
with a value of Confidential
. The synchronization will create the alation_classification namespace
and add the tag alation_classification.confidential
to the corresponding object in Okera.
Environment Variables¶
The script accepts a number of environment variables that allow you to define fields and locations outside of config.yaml
.
Fields that are set as environment variables can be left empty in config.yaml
.
The only required environment variable is the location of the config.yaml
file.
Okera Token
$ export TOKEN=$TOKEN
Alation Token
$ export alation_token=$alationtoken
Alation Objects File¶
The assets to be synced between Alation and Okera must be defined in the objects.yaml
file. You can specify multiple schemas and tables from different data sources. This file can be renamed and moved as needed, but the structure must be as follows:
data_sources:
- id: 12
schemas:
- "schema_1"
- "schema_2"
tables:
- id: 34
schemas:
tables:
- "schema_a.table_a"
- "schema_b.table_b"
- id: 56
schemas:
- "schema_a1"
tables:
- "schema_b2.table_1"
In the example above, objects from three different data sources are syncrhonized between Okera and Alation.
-
In data source
12
, the schemasschema_1
andschema_2
are synced. No tables are synced within this data source. -
In data source
34
, the tablesschema_a.table_a
andschema_b.table_b
are synced. No entire schema is synced so theschema:
list is left empty. -
In data source
56
, the schemaschema_a1
and the tableschema_b2.table_1
are synced.
Note: When listing schemas and tables, the full name of the object must be entered.