Data Registration via the crawler¶
Only System Admins (users who have ‘ALL’ access at the ‘CATALOG’ scope) have access to Data Registration.
The Data Registration feature allows data administrators to crawl paths, automatically discover datasets, infer their schemas, and easily register them in the catalog. When configured, tags can also automatically be applied during data discovery and registration to aid in categorizing, searching, and applying access controls to newly registered datasets.
Create a crawler¶
A crawler automatically connects to your data store and discovers the datasets there.
To create a crawler, use the inputs on the Data Registration page. Choose a name for your crawler and then specify the path for your bucket or folder.
You can also configure your crawler based on the structure of your files. By default a crawler will find individual datasets as their own directories (following the HDFS convention), but can be configured to instead treat each individual file as its own dataset.
Once you click ‘Create’, the new crawler will appear in the list of crawlers with the status ‘Crawling…’. See here to diagnose any common errors during crawler creation.
Please make sure Okera already has read access to your bucket. You won’t be able to create a crawler on a path Okera does not have read access to.
Once the crawler has finished crawling, its status will change to ‘Crawl complete'. Discovered datasets are now ready to be registered.
You can click into the crawler to see a list of unregistered datasets found. These paths have not yet been registered in the catalog.
Currently the crawler is only able to detect Parquet, Avro, JSON and CSV file formats.
You can see the paths that were found to have already been registered in the catalog by toggling the view to ‘Registered’.
If you wish to register an already registered path as a new dataset, you will need to do it manually through the workspace.
Select the datasets you wish to register using the checkboxes and select a database to register them to. If you want to register datasets to a brand new database, you can create one by simply entering a new database name and choosing 'Create (your new database name)' from the dropdown.
Before registering, you can also edit the dataset name and description.
Click the 'Edit schema' action to view the dataset schema and verify it is correct.
Since Parquet, Avro and JSON files are self-describing, you cannot edit their schemas during registration. If you wish to make a change to the schema for one of these file types, you must edit the schema definition in the file.
For CSV files, you can edit the column names and types within the schema, as well as set the ‘skip n lines’ option. You can also view the inferred delimiter. If the inferred delimiter is incorrect see here.
Once you’re done editing a schema, click the ‘Save’ button to apply your changes.
If a dataset has been detected as partitioned, you will also see the partitioning columns listed here. If your partitions were not correctly discovered, see here.
Click the 'Preview' action to preview a dataset's contents.
Click the 'View SQL' action to see the SQL CREATE table statement for a dataset. You can then copy this statement to the clipboard and run it manually in the workspace.
Once you have selected the datasets you wish to register and the databases you wish to register them to, click the 'Register Tables' button to register the datasets. You will then a see a dialog listing the datasets that were successfully registered.
Automated tagging can reduce the manual work of tagging by detecting when a column is likely to contain a certain type of formatted data, such as a phone number, and then applying the relevant tag to that column. If auto-tagging has been enabled on the cluster, you will see the auto-tags applied on newly discovered datasets that have data matching the specified rules. For more information on configuring auto-tagging, see here.
When a crawler is finished crawling and is marked ‘Crawl complete’, click on it to view the unregistered datasets. Then click ‘Edit Schema’ on any unregistered dataset to view its schema. If Auto-Tagger has detected certain formatted data, it will have automatically tagged the columns with that data.
To delete an auto-tag that has been applied to a column, click the ‘X’ on any tags you wish to delete. When you have finished deleting tags, click ‘Save’. When you register the dataset, you will see that the tags have been applied.
Troubleshooting and common questions¶
Errors during registration¶
In some cases there might be an error during registration e.g. Table already exists with the same name. The registration for that dataset will fail and you will be notified in the post registration dialog. You can see the specific error by clicking the error icon so you can try to rectify it before attempting to register the dataset again.
My file wasn't discovered by the crawler¶
There may be some cases where the crawler is not able to discover a dataset or infer its schema correctly. Currently the crawler is only able to detect Parquet, Avro, JSON and CSV file formats. Please check that the filename has the file format specified e.g
The inferred delimiter is incorrect¶
In some instances the crawler may not infer the correct delimiter for a CSV file. You can choose to either register the dataset manually by copying the SQL statement and editing the line
...ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
or alter the dataset to use the correct delimiter after registration by running the following command in the workspace
ALTER TABLE <dbname.tablename> SET SERDEPROPERTIES ('field.delim'='.');.
My dataset was not discovered as partitioned correctly¶
The Okera crawler uses the folder directory naming structure in order to automatically detect partitions. Please make sure your data files have been separated into partitioning folders and have been named correctly.
s3://bucket/year=YYY1/ s3://bucket/year=YYY2/ s3://bucket/year=YYY3/
How is crawling handled?¶
When a new crawler is created, a crawler task gets added to the top of a background maintenance task queue for that cluster. There is a continuous thread running that picks up background maintenance tasks in this queue – automatic partition recovery is also run on this same thread.
How do I re-crawl a path?¶
You can click the
Rerun button on the crawler. Note: if you delete a dataset that was previously discovered by that crawler in the underlying storage, that dataset will still appear after selecting
Rerun, only new datasets will be added.
Errors during crawler creation¶
Crawler source Path 's3a://.../' is not accessible¶
Okera does not have read access to this bucket, you will need to make sure the correct bucket policy has been added.
Crawler already exists¶
The crawler name you’ve chosen already exists. You will need to choose a unique crawler name.
Crawler name cannot contain spaces or special characters, except underscores e.g.
Bucket does not exist¶
This bucket has not been found to exist in S3. Please check if the bucket path has been input correctly.