Skip to content

Registering Data With Crawlers

This document outlines how to use Okera's crawlers to register data from source connections.

The Data Registration feature allows users to bulk crawl connections, automatically discover datasets, infer their schemas, and easily register them in the catalog.

Who Has Access to Crawlers?

  • Users who can create crawlers (CREATE_CRAWLER_AS_OWNER on CATALOG)
  • Users who have access to see metadata about a crawler (Any access level on CRAWLER scope)

Create a Crawler

A crawler automatically connects to your data source and discovers any datasets there.

  1. Select on the Registration page.
    Registration page
  2. You will need to input these parameters to create your crawler:

    Create crawler

    • Choose a name for your crawler
    • Select the connection you wish to crawl. For more info on connections see Connections Overview.
    • Specify either the database schema information for relational data connections, or for Object Storage, specify the path for the bucket or folder you wish to crawler.

      The following cloud object storage URI structures are supported:

      • AWS - s3://mybucket/

      • Azure ADLS Gen2 - abfss://<file_system>@<account_name>.dfs.core.windows.net/mypath/

        Note: Okera supports Azure Blob Filesystem Storage (abfs) dfs URIs (*.dfs.core.windows.net), but does not support blob URIs (*.blob.core.windows.net).

      • Google - gs://mybucket/

      Configuring Object Storage Crawlers
      By default, a crawler assumes that datasets in their own directories contain many files (following the Hive convention), but can be configured instead to assume that each file in a directory is its own dataset.

      configure object storage crawler

  3. After you select Run data crawler, the new crawler appears in the list of crawlers with the status Crawling…. See Errors During Crawler Creation to diagnose any common errors during crawler creation.

Register Datasets

Once the crawler has finished crawling, its status changes to ‘Crawl complete'. Discovered datasets are now ready to be registered.

Crawl complete status

Select the crawler to see a list of unregistered datasets on the Unregistered datasets tab.

Unregistered tables

To see the datasets from the connection that have been registered in the catalog select the Registered datasets tab.

Note: Object storage crawlers automatically filter out registered datasets. If you wish to re-register an already registered path as a new dataset, you will need to do it manually through the Workspace.

Registered tables

Select the datasets you wish to register using the checkboxes and select a database to register them to. If you want to register datasets to a brand new database, you can create one by simply entering a new database name and choosing 'Create new..' option in the modal.

Create database

Before registering, you can also edit the dataset name and description. See Edit Schemas next.

View SQL Select the View SQL () button associated with a dataset to see the SQL CREATE table statement for a dataset. You can then copy this statement to the clipboard and run it manually in the Workspace.

View SQL

Object Name Size Limitations

The maximum length of database, dataset (table), and column names in Okera is 128 characters.

Edit Schemas

Select the name of the dataset to which you would like to make changes. The data schema for the dataset appears.

Within the schema view, you can select the pencil icon next to various fields to edit them.

Edit Schema

Select Save or the checkmark to exit editing mode and apply your changes.

Edit name

For CSV files, you can edit the column names and types within the schema, as well as set the ‘skip n lines’ option. You can also view the inferred delimiter. If the inferred delimiter is incorrect see here.

If a dataset has been detected as partitioned, you will also see the partitioning columns listed here. If your partitions were not correctly discovered, see here.

Preview
Select Preview to preview a dataset's contents.

Preview table

Complete Registration

After you have selected the datasets you wish to register, select the icon associated with each dataset or select the Register Selected button at the top of the page to register all of the selected datasets. A dialog appears prompting you to select the database for the registered datasets.

Register Tables

You have two database options:

  • Select Existing database to register the datasets to an existing database. Select the database name in the drop-down list.

  • Select Create new to create a new database for the registered datasets. The dialog expands and prompts you for the new database name and description.

After selecting the database option and supplying any required information, select Register dataset to register the datasets.

Automated Tagging

Automated tagging can reduce the manual work of tagging by detecting when a column is likely to contain a certain type of formatted data, such as a phone number, and then applying the relevant tag to that column. If auto-tagging has been enabled on the cluster, you will see the auto-tags applied on newly discovered datasets containing data that matches the specified rules. For more information on configuring auto-tagging, see here.

Verifying Auto-Tags

When a crawler is finished crawling and is marked Crawl complete, select it to view its unregistered datasets. Then select the dataset name of any unregistered dataset to view its schema. If autotagging has detected data it thinks requires tagging, it will have automatically tagged the columns. In the example below, the street address and zip code have been auto-tagged.

Auto-tags in registration

To delete an auto-tag that has been applied to a column, select the edit icon () to the right of the tag. Then select the X next to any tags you want to delete and select the checkmark to save these changes.

After the dataset has been registered, you will see that the tags have been applied.

Troubleshooting and Common Questions

Errors During Crawling

In some cases a crawler may not be able to crawl all datasets from the specified connection. This will appear as a crawler level error. An example of this might be if the underlying connection has a table name that contains an unsupported character e.g., $.

Errors While Registering Relational Metadata

When loading table definitions from a connection, any columns that error out are skipped by default and the table is still registered. The skipped columns are annotated with a prefix jdbc.unsupported.column- and can be seen in the describe formatted <table> output in the Table Properties section.

For example,

jdbc.unsupported.column-NCHAR_COL   Unsupported JDBC type NCHAR
In the Okera Portal UI the table will show a warning message Unsupport type error(s).

Errors During Registration

Errors may occur during registration (for example, a table already exists with the same name). The registration for the dataset will fail and you will be notified in the post registration dialog. You can see the specific error by selecting the error icon so you can try to rectify it before attempting to register the dataset again.

Registration Error Dialog

Unregistered table Error icon

My File Wasn't Discovered by the Crawler

There may be some cases in which the crawler is not able to discover a dataset or infer its schema correctly. Currently, the crawler is only able to detect Parquet, Avro, JSON and CSV file formats. Verify that the filename has the file format specified e.g., sample.json.

The Inferred Delimiter Is Incorrect

In some instances, the crawler may not infer the correct delimiter for a CSV file. You have two choices: * You can manually register the dataset by copying the SQL statement and editing the line ...ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'. * You can alter the dataset to use the correct delimiter after registration by running the following command in the workspace ALTER TABLE <dbname.tablename> SET SERDEPROPERTIES ('field.delim'='.');.

My Dataset Was Not Discovered as Partitioned Correctly

The Okera crawler uses the folder directory naming structure in order to automatically detect partitions. Make sure your data files have been separated into partitioning folders and have been named correctly. Once datasets are in Okera, the partitioning columns are not in an editable state.

An example:

s3://bucket/year=YYY1/
s3://bucket/year=YYY2/
s3://bucket/year=YYY3/

How Is Crawling Handled?

When a new crawler is created, a crawler task gets added to the top of a background maintenance task queue for that cluster. There is a continuous thread running that picks up background maintenance tasks in this queue – automatic partition recovery is also run on this same thread.

How Do I Recrawl a Path?

Select the Rerun button on the crawler.

Note: If you delete a dataset that was previously discovered by that crawler in the underlying storage, that dataset will still appear after selecting Rerun, only new datasets will be added.

Errors During Crawler Creation

Crawler Source Path 's3a://.../' Is Not Accessible

Okera does not have read access to this bucket. Verify that the correct bucket policy has been added.

Crawler Already Exists

The crawler name you’ve chosen already exists. Choose a unique crawler name.

Syntax Error

The crawler name cannot contain spaces or special characters, except underscores e.g. !"#$%&'()*+,-./:;<=>?@[\]^`{|}~

Bucket Does Not Exist

This bucket has not been found to exist in S3. Verify that the bucket path has been input correctly.

Errors During Dataset Registration

Duplicate Database Names

The dataset could not be registered because a dataset with this name already exists in the database.

Dataset Failed to Register

One or more of the datasets failed to register.