Configuring Auto-tagging

Introduction

The goal of this document is to provide all of the necessary information needed to configure and verify the Auto-tagging feature in Okera.

What is Auto-tagging?

Auto-tagging provides the ability to automate the tagging of columns based on the data inside them. At a high level, Okera samples the data as it's crawled and suggests tags that are appropriate based on the evaluation of a set of rules. These rules can be configured to tag dataset columns containing sensitive data such as Phone Numbers, Social Security Numbers or other PII.

Enable Auto-tagging

Auto-tagging is configured using the AUTOTAGGER_CONFIGURATION cluster environment variable (it is enabled by default on clusters that leverage the quickstart configuration).

Once enabled, Okera's out of the box auto-tagging rules will run when a new crawler is created on the data registration page.

Creating Auto-tagging Rules

Okera has several out of the box auto-tagging rules that appear under the pii namespace. These tagging rules cannot be edited, however custom user defined regex based auto-tagging rules can be created on the tags page in the UI.

A rule contains the criteria for when to automatically set a tag on a dataset column. Auto-tagging will sample a number of rows from the dataset and evaluate each rule's expression to determine whether it matches or not. After sampling all rows, if the percentage of matches is greater than the Rule's minimumMatchRate, then the tag will be set for the dataset column.

An example Auto-tagging rule: Okera autotagging rule

Each rule contains the following fields:

  • Tag: the name of the tag that is set when the rule is evaluated
  • Description: a description of what this rule matches
  • Apply rule to: choose if this rule should match only cell content, or also on metadata such as column names and comments.
  • Expression: a Java compatible Regular Expression that is encoded as a JSON string;
  • Minimum Rows: the minimum number of rows to sample before enabling this rule for evaluation; if a dataset does not contain the minimum number of row data then the tag will never be set
  • Maximum Rows: the maximum number of rows to sample; this limits the number of dataset rows sampled
  • Minimum Match Rate: the percentage (expressed as a number between 0.0 and 1.0) of expression matches that a dataset column requires to warrant setting the tag

You also have the ability to test your expression against a value.

In the above example, Auto-tagging will set the custom:my_tag tag on each column in a dataset when that dataset has at least 10 rows of data and the expression evaluates to true for 90% of the sampled rows. A maximum of 1000 rows of data will be sampled from the dataset to evaluate this rule.

Troubleshooting Auto-tagging

Auto-tagging will run automatically on datasets discovered by a Crawler if the AUTOTAGGER_CONFIGURATION environment variable has been set. The suggested tags are visible from the Data Registration page (click on Edit Schema). If you're finding that the tags have not been set as expected, try the following:

First, create a Crawler and verify that it has discovered some datasets that you expect Auto-tagging to be applied to. Then do the following:

  1. Check that Planner logs contain the string Started executing MaintenanceTask: AutotaggerMaintenanceTask: Database.Name. If this is missing from the logs then verify that the OKERA_AUTOTAGGER_CONFIGURATION environment variable exists in the Planner's environment.

  2. Check that the Planner logs do not contain the string Failure in RegexAutotagger loading configuration. If it does, then it means that Okera isn't able to access the JSON configuration file or the JSON file is not well-formed JSON in the expected format.

  3. Check that the Planner logs contain the following string: Finished executing MaintenanceTask: AutotaggerMaintenanceTask: Database.Name. If you find this string for the expected dataset then your rule expression and metrics details need to be reevaluated for correctness.