Quick Start Guide: Configuring Auto-Tagger

Introduction

The goal of this document is to provide all of the necessary information needed to configure and verify the Auto-Tagger feature in Okera.

What is Auto-Tagger?

The Auto-Tagger is a feature built into Okera that provides the ability to classify dataset data in a columnar manner. At a high level, the Auto-Tagger samples data in a dataset and suggests Tags that are appropriate based on the sampled content and the evaluation of a set of rules. Each rule defined as a Regular Expression and a tag name that should be set assuming a minimum number of regex matches for a column. These rules can be configured to tag dataset columns as containing contextual data such as Phone Numbers, Social Security Numbers or other PII.

Configuring Okera for Auto-Tagger

The Auto-Tagger is configured using a well-formed JSON document stored in cloud path that is accessible by Okera. The location of this document is set in the cluster’s environment via the OKERA_AUTOTAGGER_CONFIGURATION environment variable.

For example, the following environment variable will cause the Planner to configure the Auto-Tagger feature with the settings specified in the document stored in S3:

export OKERA_AUTOTAGGER_CONFIGURATION=s3://bucket/path/to/autotagger-config.json

Configure Okera for Well-Known Tags

The Okera Team has authored an out-of-the-box configuration file that contains a set of rules for some Well-Known tags. The configuration file can be used as a starting point for customizing the Auto-Tagger for your specific business needs.

The Well-Known tags configuration file can be downloaded from S3 at either of the following locations:

s3://okera-release-useast/1.4.0/autotagger/autotagger-config-wellknowns.json
s3://okera-release-uswest/1.4.0/autotagger/autotagger-config-wellknowns.json

To configure your cluster with this configuration file, perform the following steps:

  1. Download a copy of the autotagger-config-wellknowns.json file
  2. Copy the file to a cloud storage location (e.g. an S3 bucket/key path)
  3. Set the OKERA_AUTOTAGGER_CONFIGURATION environment variable in your cluster’s env.sh and restart the Deployment Manager
  4. Alternatively, you can set the environment variable in the cerebro-planner Kubernetes deployment

Authoring Auto-Tagger Rules

The configuration file for the Auto-Tagger is a well-formed JSON document containing an array name ‘rules’ with the following format:

{
    "rules": [
        { <rule 1 details> },
        { <rule 2 details> },
        ...
    ]
}

A Rule contains the criteria describing when the Auto-Tagger should set a tag on a dataset column. The Auto-Tagger will sample a number of rows from the dataset and evaluate each Rule’s expression to determine whether it matches or not. After sampling all rows, if the percentage of matches is greater than the Rule’s minimumMatchRate, then the Auto-Tagger will set the tag for the dataset column.

Each rule is a dictionary containing the following fields:

- description: a description of what this rule matches
- namespace: a namespace name for the tag; this is good for grouping similar tags
- tag: a tag name; this is the name of the tag that is set when the rule is evaluated
- expression: a Java compatible Regular Expression that is encoded as a JSON string;
- minimumRows: the minimum number of rows to sample before enabling this rule for evaluation; if a dataset does not contain the minimum number of row data then the tag will never be set
- maximumRows: the maximum number of rows to sample; this limits the number of dataset rows sampled
- minimumMatchRate: the percentage (expressed as a number between 0.0 and 1.0) of expression matches that a dataset column requires to warrant setting the tag

An example Auto-Tagger Rule:

{
    "description": "Matches wellformed email addresses",
    "namespace": "autotagger",
    "tag": "email",
    "expression": "^(\\D)+(\\w)*((\\.(\\w)+)?)+@(\\D)+(\\w)*((\\.(\\D)+(\\w)*)+)?(\\.)[a-z]{2,}$",
    "minimumRows": 10,
    "maximumRows": 100,
    "minimumMatchRate": 0.5
}

In the above example, the Auto-Tagger will set the autotagger:email tag on each column in a dataset when that dataset has at least 10 rows of data and the expression evaluates to true for 50% of the sampled rows. The Auto-Tagger will sample a maximum of 100 rows of data from the dataset.

Troubleshooting the Auto-Tagger

The Auto-Tagger will run automatically on datasets discovered by a Crawler if the OKERA_AUTOTAGGER_CONFIGURATION environment variable has been set. The suggested tags are visible from the Data Registration page (click on Edit Schema). If you’re finding that the Auto-Tagger is not setting tags as expected, try the following:

First, create a Crawler and verify that it has discovered some datasets that you expect the Auto-Tagger to tag. Then do the following:

  1. check that Planner logs contain the string Started executing MaintenanceTask: AutotaggerMaintenanceTask: Database.Name. If this is missing from the logs then verify that the OKERA_AUTOTAGGER_CONFIGURATION environment variable exists in the Planner’s environment.

  2. check that the Planner logs do not contain the string Failure in RegexAutotagger loading configuration. If it does, then it means that Okera isn’t able to access the JSON configuration file or the JSON file is not well-formed JSON in the expected format.

  3. check that the Planner logs contain the following string: Finished executing MaintenanceTask: AutotaggerMaintenanceTask: Database.Name. If you find this string for the expected dataset then your rule expression and metrics details need to be reevaluated for correctness.