The goal of this document is to provide all of the necessary information needed to configure and verify the Auto-tagging feature in Okera.
What is Auto-tagging?¶
Auto-tagging provides the ability to automate the tagging of columns based on the data inside them. At a high level, Okera samples the data as it's crawled and suggests tags that are appropriate based on the evaluation of a set of rules. These rules can be configured to tag dataset columns containing sensitive data such as Phone Numbers, Social Security Numbers or other PII.
Auto-tagging is configured using the
AUTOTAGGER_CONFIGURATION cluster environment variable (it is enabled by default on clusters that leverage the quickstart configuration).
Once enabled, Okera's out of the box auto-tagging rules will run when a new crawler is created on the data registration page.
Creating Auto-tagging Rules¶
Okera has several out of the box auto-tagging rules that appear under the
These tagging rules cannot be edited, however custom user defined regex based auto-tagging rules can be created on the tags page in the UI.
A rule contains the criteria for when to automatically set a tag on a dataset column.
Auto-tagging will sample a number of rows from the dataset and evaluate each rule's expression to determine whether it matches or not.
After sampling all rows, if the percentage of matches is greater than the Rule's
minimumMatchRate, then the tag will be set for the dataset column.
An example Auto-tagging rule:
Each rule contains the following fields:
- Tag: the name of the tag that is set when the rule is evaluated
- Description: a description of what this rule matches
- Apply rule to: choose if this rule should match only cell content, or also on metadata such as column names and comments.
- Expression: a Java compatible Regular Expression that is encoded as a JSON string;
- Minimum Rows: the minimum number of rows to sample before enabling this rule for evaluation; if a dataset does not contain the minimum number of row data then the tag will never be set
- Maximum Rows: the maximum number of rows to sample; this limits the number of dataset rows sampled
- Minimum Match Rate: the percentage (expressed as a number between 0.0 and 1.0) of expression matches that a dataset column requires to warrant setting the tag
You also have the ability to test your expression against a value.
In the above example, Auto-tagging will set the
custom:my_tag tag on each column in a dataset when that dataset has at least 10 rows of data and the expression evaluates to true for 90% of the sampled rows.
A maximum of 1000 rows of data will be sampled from the dataset to evaluate this rule.
Auto-tagging will run automatically on datasets discovered by a Crawler if the
AUTOTAGGER_CONFIGURATION environment variable has been set. The suggested tags are visible from the Data Registration page (click on
Edit Schema). If you're finding that the tags have not been set as expected, try the following:
First, create a Crawler and verify that it has discovered some datasets that you expect Auto-tagging to be applied to. Then do the following:
Check that Planner logs contain the string
Started executing MaintenanceTask: AutotaggerMaintenanceTask: Database.Name. If this is missing from the logs then verify that the
OKERA_AUTOTAGGER_CONFIGURATIONenvironment variable exists in the Planner's environment.
Check that the Planner logs do not contain the string
Failure in RegexAutotagger loading configuration. If it does, then it means that Okera isn't able to access the JSON configuration file or the JSON file is not well-formed JSON in the expected format.
Check that the Planner logs contain the following string:
Finished executing MaintenanceTask: AutotaggerMaintenanceTask: Database.Name. If you find this string for the expected dataset then your rule expression and metrics details need to be reevaluated for correctness.