Configuring Auto-tagging

Introduction

The goal of this document is to provide all of the necessary information needed to configure and verify the Auto-tagging feature in Okera.

What is Auto-tagging?

Auto-tagging provides the ability to automate the tagging of columns based on the data inside them. At a high level, Okera samples the data as it’s crawled and suggests tags that are appropriate based on the evaluation of a set of rules. These rules can be configured to tag dataset columns containing sensitive data such as Phone Numbers, Social Security Numbers or other PII.

Enable Auto-tagging

Auto-tagging is configured using a well-formed JSON document stored in the cloud at a path that is accessible by Okera. The location of this document is set in the cluster’s environment via the OKERA_AUTOTAGGER_CONFIGURATION environment variable.

For example, the following environment variable will cause the Planner to configure the Auto-Tagging feature with the settings specified in the document stored in S3:

export OKERA_AUTOTAGGER_CONFIGURATION=s3://bucket/path/to/autotagger-config.json

Once enabled, the Auto-tagging rules specified in the json document will run when a new crawler is created on the data registration page.

Quickstart: Out-of-the-box Auto-tagging

The Okera Team has authored an out-of-the-box configuration file that contains a set of rules for detecting common sensitive data. The configuration file can be used as a starting point for customizing the Auto-tagging for your specific business needs.

The Well-Known tags configuration file can be downloaded from S3 at either of the following locations (replace $VERSION with your version of ODAS):

s3://okera-release-useast/$VERSION/autotagger/autotagger-config-wellknowns.json
s3://okera-release-uswest/$VERSION/autotagger/autotagger-config-wellknowns.json
s3://okera-release-euwest/$VERSION/autotagger/autotagger-config-wellknowns.json

To configure your cluster with this configuration file, perform the following steps:

  1. Download a copy of the autotagger-config-wellknowns.json file
  2. Copy the file to a cloud storage location (e.g. an S3 bucket/key path)
  3. Set the OKERA_AUTOTAGGER_CONFIGURATION environment variable in your cluster’s env.sh and restart the Deployment Manager
      export OKERA_AUTOTAGGER_CONFIGURATION=s3://bucket/path/to/autotagger-config.json
    

You can then edit or update the rules in this configuration file as you want, and the registration crawler will evaluate all specified rules upon any new crawl.

Creating Auto-tagging Rules

The configuration file for Auto-tagging is a well-formed JSON document containing an array name ‘rules’ with the following format:

{
    "rules": [
        { <rule 1 details> },
        { <rule 2 details> },
        ...
    ]
}

A Rule contains the criteria describing when to automatically set a tag on a dataset column. Auto-tagging will sample a number of rows from the dataset and evaluate each Rule’s expression to determine whether it matches or not. After sampling all rows, if the percentage of matches is greater than the Rule’s minimumMatchRate, then the tag will be set for the dataset column.

Note: If the namespace or tag specified doesn’t already exist, it will be automatically created.

Each rule is a dictionary containing the following fields:

  • description: a description of what this rule matches
  • namespace: a namespace name for the tag; this is good for grouping similar tags
  • tag: a tag name; this is the name of the tag that is set when the rule is evaluated
  • expression: a Java compatible Regular Expression that is encoded as a JSON string;
  • minimumRows: the minimum number of rows to sample before enabling this rule for evaluation; if a dataset does not contain the minimum number of row data then the tag will never be set
  • maximumRows: the maximum number of rows to sample; this limits the number of dataset rows sampled
  • minimumMatchRate: the percentage (expressed as a number between 0.0 and 1.0) of expression matches that a dataset column requires to warrant setting the tag

An example Auto-tagging rule:

{
    "description": "Matches wellformed email addresses",
    "namespace": "autotagger",
    "tag": "email",
    "expression": "^(\\D)+(\\w)*((\\.(\\w)+)?)+@(\\D)+(\\w)*((\\.(\\D)+(\\w)*)+)?(\\.)[a-z]{2,}$",
    "minimumRows": 10,
    "maximumRows": 100,
    "minimumMatchRate": 0.5
}

In the above example, Auto-tagging will set the autotagger:email tag on each column in a dataset when that dataset has at least 10 rows of data and the expression evaluates to true for 50% of the sampled rows. A maximum of 100 rows of data will be sampled from the dataset to evaluate this rule.

Troubleshooting Auto-tagging

Auto-tagging will run automatically on datasets discovered by a Crawler if the OKERA_AUTOTAGGER_CONFIGURATION environment variable has been set. The suggested tags are visible from the Data Registration page (click on Edit Schema). If you’re finding that the tags have not been set as expected, try the following:

First, create a Crawler and verify that it has discovered some datasets that you expect Auto-tagging to be applied to. Then do the following:

  1. check that Planner logs contain the string Started executing MaintenanceTask: AutotaggerMaintenanceTask: Database.Name. If this is missing from the logs then verify that the OKERA_AUTOTAGGER_CONFIGURATION environment variable exists in the Planner’s environment.

  2. check that the Planner logs do not contain the string Failure in RegexAutotagger loading configuration. If it does, then it means that Okera isn’t able to access the JSON configuration file or the JSON file is not well-formed JSON in the expected format.

  3. check that the Planner logs contain the following string: Finished executing MaintenanceTask: AutotaggerMaintenanceTask: Database.Name. If you find this string for the expected dataset then your rule expression and metrics details need to be reevaluated for correctness.