Skip to content

Crawler Overview

Crawlers search a data store to discover its datasets and infer their schemas. After the datasets have been discovered, they can be registered to an Okera database. This topic describes how to create and run a crawler. For information about registering datasets, see Register Structured Data (Datasets).

Note: Unstructured data URIs are not necessarily registered using a crawler. See Register Unstructured Data URIs.

To see the programmatic SQL commands for crawlers, see Programmatic Registration.

You must have an Okera connection created for the data store you wish to crawl. See Connect to Data Sources.

Warning

Okera does not recommend registering the same dataset twice from the same connection. Doing so might lead to confusing results when permissions the dataset are conflicting.

Automated tagging, or autotagging, can reduce the manual work of tagging by detecting when a column of data is likely to contain a certain type of formatted data, such as a phone number, and then applying the relevant tag to that column. If autotagging is enabled on the Okera cluster, autotags are applied on newly discovered datasets containing data that matches the specified autotag rules. After the dataset has been registered to the Okera catalog, you can see that the tags are applied to the data. For more information on configuring autotagging, see Configure Autotagging.

See the following sections for more information on crawlers: