Skip to content

Registering data with crawlers

This document outlines how to use Okera's crawlers to register data from source connections.

The Data Registration feature allows users to bulk crawl connections, automatically discover datasets, infer their schemas, and easily register them in the catalog.

Who has access to crawlers?

  • Users who can create crawlers (CREATE_CRAWLER_AS_OWNER on CATALOG)
  • Users who have access to see metadata about a crawler (Any access level on CRAWLER scope)

Create a crawler

A crawler automatically connects to your data source and discovers any datasets there.

  • Select Create Crawler button on Registration Page
    Create crawler button
  • You will need to input these parameters to create your crawler:

    Create crawler

    • Choose a name for your crawler
    • Select the connection you wish to crawl. For more info on connections see Connections Overview.
    • Input either the database schema information for relational data connections, or for Object Storage, specify the path for the bucket or folder you wish to crawler.

    Note

    Supported cloud object storage URI structure

    AWS - s3://mybucket/
    Azure ADLS Gen2 - abfss://<file_system>@<account_name>.dfs.core.windows.net/mypath/
    Azure ADLS Gen1 - adls://<data_lake_store_name>.azuredatalakestore.net/mypath/
    Google - gs://mybucket/
    

    Configuring object storage crawlers By default a crawler will find individual datasets as their own directories (following the HDFS convention), but can be configured to instead treat each individual file as its own dataset.

    configure object storage crawler

  • Once you click Run data crawler, the new crawler will appear in the list of crawlers with the status "Crawling…". See here to diagnose any common errors during crawler creation.

Register datasets

Once the crawler has finished crawling, its status will change to ‘Crawl complete'. Discovered datasets are now ready to be registered.

Crawl complete status

You can click into the crawler to see a list of unregistered datasets found.

Unregistered tables

You can see the datasets from that connection that have already been registered in the catalog by toggling the view to ‘Registered’.

Note

Object storage crawlers automatically filter out registered datasets. If you wish to re-register an already registered path as a new dataset, you will need to do it manually through the Workspace.

Registered tables


Select the datasets you wish to register using the checkboxes and select a database to register them to. If you want to register datasets to a brand new database, you can create one by simply entering a new database name and choosing 'Create new..' option in the modal.

Create database

Before registering, you can also edit the dataset name and description by clicking the dataset you would like to modify. This will take you to the schema view for this dataset where you can make changes to the names, tags, and more.

Unregistered tables

Actions

Edit schema
Click the dataset name you would like to make changes to. This will open up the data schema for this dataset so you can verify it is correct.

Within the schema view, you can click on the pencil icon next to various field shown below to enter editing mode.

Edit Schema

Select the save or checkmark button to exit editing mode once you are ready to apply these changes as shown below.

Edit name

For CSV files, you can edit the column names and types within the schema, as well as set the ‘skip n lines’ option. You can also view the inferred delimiter. If the inferred delimiter is incorrect see here.

If a dataset has been detected as partitioned, you will also see the partitioning columns listed here. If your partitions were not correctly discovered, see here.

Preview
Click the 'Preview' action to preview a dataset's contents.

Preview table

View SQL
Click the 'View SQL' action to see the SQL CREATE table statement for a dataset. You can then copy this statement to the clipboard and run it manually in the workspace.

View SQL

Completing registration

Once you have selected the datasets you wish to register and the databases you wish to register them to, click the 'Register' icon next to each dataset or the 'Register Selected' option at the top of the page to register the datasets. You will then a see a dialog listing the datasets that were successfully registered.

Register Tables

Automated tagging

Automated tagging can reduce the manual work of tagging by detecting when a column is likely to contain a certain type of formatted data, such as a phone number, and then applying the relevant tag to that column. If auto-tagging has been enabled on the cluster, you will see the auto-tags applied on newly discovered datasets that have data matching the specified rules. For more information on configuring auto-tagging, see here.

Verifying auto-tags

When a crawler is finished crawling and is marked ‘Crawl complete’, click on it to view the unregistered datasets. Then click the dataset name on any unregistered dataset to view its schema. If the Auto-Tagger has detected certain formatted data, it will have automatically tagged the columns with that data as shown below.

Auto-tags in registration

To delete an auto-tag that has been applied to a column, click the ‘X’ on any tags you wish to delete and select the checkbox to save these changes in the Schema view.

When you register the dataset, you will see that the tags have been applied.

Troubleshooting and common questions

Errors during crawling

In some cases a crawler may not be able to crawl all datasets from the specified connection. This will appear as a crawler level error. An example of this might be if the underlying connection has a table name that contains an unsupported character e.g. $

Crawler Error

Errors while registering relational metadata

When loading table definitions from a connection, any columns that error out are skipped by default and the table is still registered. The skipped/error columns are annotated with a prefix jdbc.unsupported.column- and can be seen in the describe formatted <table> output in the Table Properties section.

For example,

jdbc.unsupported.column-NCHAR_COL   Unsupported JDBC type NCHAR
In the Okera WebUI the table will show a warning message Unsupport type error(s).

Errors during registration

In some cases there might be an error during registration e.g. Table already exists with the same name. The registration for that dataset will fail and you will be notified in the post registration dialog. You can see the specific error by clicking the error icon so you can try to rectify it before attempting to register the dataset again.

Registration Error Dialog

Unregistered table Error icon

My file wasn't discovered by the crawler

There may be some cases where the crawler is not able to discover a dataset or infer its schema correctly. Currently the crawler is only able to detect Parquet, Avro, JSON and CSV file formats. Please check that the filename has the file format specified e.g sample.json.

The inferred delimiter is incorrect

In some instances the crawler may not infer the correct delimiter for a CSV file. You can choose to either register the dataset manually by copying the SQL statement and editing the line ...ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' or alter the dataset to use the correct delimiter after registration by running the following command in the workspace ALTER TABLE <dbname.tablename> SET SERDEPROPERTIES ('field.delim'='.');.

My dataset was not discovered as partitioned correctly

The Okera crawler uses the folder directory naming structure in order to automatically detect partitions. Please make sure your data files have been separated into partitioning folders and have been named correctly. Once datasets are in Okera, the partitioning columns are not in an editable state.

An example:

s3://bucket/year=YYY1/
s3://bucket/year=YYY2/
s3://bucket/year=YYY3/

How is crawling handled?

When a new crawler is created, a crawler task gets added to the top of a background maintenance task queue for that cluster. There is a continuous thread running that picks up background maintenance tasks in this queue – automatic partition recovery is also run on this same thread.

How do I re-crawl a path?

You can click the Rerun button on the crawler. Note: if you delete a dataset that was previously discovered by that crawler in the underlying storage, that dataset will still appear after selecting Rerun, only new datasets will be added.

Errors during crawler creation

Crawler source Path 's3a://.../' is not accessible

Okera does not have read access to this bucket, you will need to make sure the correct bucket policy has been added.

Crawler already exists

The crawler name you’ve chosen already exists. You will need to choose a unique crawler name.

Syntax error

Crawler name cannot contain spaces or special characters, except underscores e.g. !"#$%&'()*+,-./:;<=>?@[\]^`{|}~

Bucket does not exist

This bucket has not been found to exist in S3. Please check if the bucket path has been input correctly.

Errors during dataset registration

Duplicate Database names

Could not register this dataset because a dataset with this name already exists in [database name]

Dataset failed to register

1 or more of the datasets failed to register