Registering Data With Crawlers¶
This document outlines how to use Okera's crawlers to register data from source connections.
The Data Registration feature allows users to bulk crawl connections, automatically discover datasets, infer their schemas, and easily register them in the catalog.
- Create a Crawler
- Register Data From a Crawler
- Enable Autotagging
- Troubleshooting and Common Questions
Who Has Access to Crawlers?¶
- Users who can create crawlers (
- Users who have access to see metadata about a crawler (Any access level on
Create a Crawler¶
A crawler automatically connects to your data source and discovers any datasets there.
- Select on the Registration page.
You will need to input these parameters to create your crawler:
- Choose a name for your crawler
- Select the connection you wish to crawl. For more info on connections see Connections Overview.
Specify either the database schema information for relational data connections, or for Object Storage, specify the path for the bucket or folder you wish to crawler.
The following cloud object storage URI structures are supported:
Azure ADLS Gen2 -
Note: Okera supports Azure Blob Filesystem Storage (abfs)
*.dfs.core.windows.net), but does not support
Configuring Object Storage Crawlers By default, a crawler assumes that datasets in their own directories contain many files (following the Hive convention), but can be configured instead to assume that each file in a directory is its own dataset.
After you select Run data crawler, the new crawler appears in the list of crawlers with the status
Crawling…. See Errors During Crawler Creation to diagnose any common errors during crawler creation.
Once the crawler has finished crawling, its status changes to ‘Crawl complete'. Discovered datasets are now ready to be registered.
Select the crawler to see a list of unregistered datasets on the Unregistered datasets tab.
To see the datasets from the connection that have been registered in the catalog select the Registered datasets tab.
Note: Object storage crawlers automatically filter out registered datasets. If you wish to re-register an already registered path as a new dataset, you will need to do it manually through the Workspace.
Select the datasets you wish to register using the checkboxes and select a database to register them to. If you want to register datasets to a brand new database, you can create one by simply entering a new database name and choosing 'Create new..' option in the modal.
Before registering, you can also edit the dataset name and description. See Edit Schemas next.
View SQL Select the View SQL () button associated with a dataset to see the SQL CREATE table statement for a dataset. You can then copy this statement to the clipboard and run it manually in the Workspace.
Object Name Size Limitations¶
The maximum length of database, dataset (table), and column names in Okera is 128 characters.
Select the name of the dataset to which you would like to make changes. The data schema for the dataset appears.
Within the schema view, you can select the pencil icon next to various fields to edit them.
Select Save or the checkmark to exit editing mode and apply your changes.
For CSV files, you can edit the column names and types within the schema, as well as set the ‘skip n lines’ option. You can also view the inferred delimiter. If the inferred delimiter is incorrect see here.
If a dataset has been detected as partitioned, you will also see the partitioning columns listed here. If your partitions were not correctly discovered, see here.
Select Preview to preview a dataset's contents.
After you have selected the datasets you wish to register, select the icon associated with each dataset or select the Register Selected button at the top of the page to register all of the selected datasets. A dialog appears prompting you to select the database for the registered datasets.
You have two database options:
Select Existing database to register the datasets to an existing database. Select the database name in the drop-down list.
Select Create new to create a new database for the registered datasets. The dialog expands and prompts you for the new database name and description.
After selecting the database option and supplying any required information, select Register dataset to register the datasets.
Automated tagging can reduce the manual work of tagging by detecting when a column is likely to contain a certain type of formatted data, such as a phone number, and then applying the relevant tag to that column. If auto-tagging has been enabled on the cluster, you will see the auto-tags applied on newly discovered datasets containing data that matches the specified rules. For more information on configuring auto-tagging, see here.
When a crawler is finished crawling and is marked Crawl complete, select it to view its unregistered datasets. Then select the dataset name of any unregistered dataset to view its schema. If autotagging has detected data it thinks requires tagging, it will have automatically tagged the columns. In the example below, the street address and zip code have been auto-tagged.
To delete an auto-tag that has been applied to a column, select the edit icon () to the right of the tag. Then select the X next to any tags you want to delete and select the checkmark to save these changes.
After the dataset has been registered, you will see that the tags have been applied.
Troubleshooting and Common Questions¶
Errors During Crawling¶
In some cases a crawler may not be able to crawl all datasets from the specified connection.
This will appear as a crawler level error.
An example of this might be if the underlying connection has a table name that contains an unsupported character e.g.,
Errors While Registering Relational Metadata¶
When loading table definitions from a connection, any columns that error out are skipped by default and the table is still registered.
The skipped columns are annotated with a prefix
jdbc.unsupported.column- and can be seen in the
describe formatted <table> output in the
Table Properties section.
jdbc.unsupported.column-NCHAR_COL Unsupported JDBC type NCHAR
Unsupport type error(s).
Errors During Registration¶
Errors may occur during registration (for example, a table already exists with the same name). The registration for the dataset will fail and you will be notified in the post registration dialog. You can see the specific error by selecting the error icon so you can try to rectify it before attempting to register the dataset again.
My File Wasn't Discovered by the Crawler¶
There may be some cases in which the crawler is not able to discover a dataset or infer its schema correctly. Currently, the crawler is only able to detect Parquet, Avro, JSON and CSV file formats. Verify that the filename has the file format specified e.g.,
The Inferred Delimiter Is Incorrect¶
In some instances, the crawler may not infer the correct delimiter for a CSV file. You have two choices:
* You can manually register the dataset by copying the SQL statement and editing the line
...ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'.
* You can alter the dataset to use the correct delimiter after registration by running the following command in the workspace
ALTER TABLE <dbname.tablename> SET SERDEPROPERTIES ('field.delim'='.');.
My Dataset Was Not Discovered as Partitioned Correctly¶
The Okera crawler uses the folder directory naming structure in order to automatically detect partitions. Make sure your data files have been separated into partitioning folders and have been named correctly. Once datasets are in Okera, the partitioning columns are not in an editable state.
s3://bucket/year=YYY1/ s3://bucket/year=YYY2/ s3://bucket/year=YYY3/
How Is Crawling Handled?¶
When a new crawler is created, a crawler task gets added to the top of a background maintenance task queue for that cluster. There is a continuous thread running that picks up background maintenance tasks in this queue – automatic partition recovery is also run on this same thread.
How Do I Recrawl a Path?¶
Select the Rerun button on the crawler.
Note: If you delete a dataset that was previously discovered by that crawler in the underlying storage, that dataset will still appear after selecting Rerun, only new datasets will be added.
Errors During Crawler Creation¶
Crawler Source Path 's3a://.../' Is Not Accessible¶
Okera does not have read access to this bucket. Verify that the correct bucket policy has been added.
Crawler Already Exists¶
The crawler name you’ve chosen already exists. Choose a unique crawler name.
The crawler name cannot contain spaces or special characters, except underscores e.g.
Bucket Does Not Exist¶
This bucket has not been found to exist in S3. Verify that the bucket path has been input correctly.
Errors During Dataset Registration¶
Duplicate Database Names¶
The dataset could not be registered because a dataset with this name already exists in the database.
Dataset Failed to Register¶
One or more of the datasets failed to register.