Register Datasets (Structured Data)¶
When a crawler has finished crawling a data source, its status changes to
The datasets (structured data) discovered by the crawler are listed on the Registration page after you select the crawler name. After datasets are discovered by the crawler, they can be registered to an Okera database. Registering a dataset to an Okera database must occur to add the dataset to the Okera catalog so Okera policies can be enforced.
Note: Before registering a dataset, you can edit its name and description as well as other fields. Note that the maximum length of dataset (table) and column names in Okera is 128 characters. See Edit Schemas.
Okera does not recommend registering the same dataset twice from the same connection. Doing so might lead to confusing results when permissions the dataset are conflicting.
To register a dataset to the Okera catalog:
Select Register on the UI menu to access the Registration page. All the crawlers created in the Okera environment are listed.
Alternatively, you can select the button on the Call-to-Action Home page in the Registered datasets section and follow the prompts.
Select the crawler name to see a list of its unregistered datasets on the Unregistered datasets tab.
To see the datasets from the crawler that have already been registered in the catalog, select the Registered datasets tab.
Note: Object storage crawlers automatically filter out registered datasets. If you wish to re-register an already registered path as a new dataset, you will need to do it manually using Workspace.
Use the checkboxes on the Unregistered datasets tab to select the datasets you wish to register and then select the button. Alternatively, select the option for an individual dataset. In either case, the Register selected datasets dialog appears.
Select Use Existing database on the dialog to register the datasets to an existing a database or select Create database to create a new Okera database for the registered datasets.
If you select the Use Existing database option, select the name of an existing database from the Database drop-down list.
If you select the Create database option, the Register selected datasets dialog expands so you can supply a unique name and description for the new Okera database.
Select on the dialog to register the datasets.
Select the View SQL () button associated with a dataset to see the
SQL CREATE EXTERNAL TABLE statement for a dataset. You can then copy this statement to the clipboard and run it manually in the Workspace or another SQL editor.
Edit Schemas after Registration¶
After selecting a crawler on the Registration page, select the name of the dataset to which you want to make changes. The Schema tab for the dataset appears. In schema view, you can select the pencil icon next to various fields to edit them. This includes the dataset name, description, and tags.
For CSV files, you can also edit the column names and column types. In addition, on the Details tab for a CSV file, you can change the
skip n lines option and view the inferred delimiter for the CSV file. If the inferred delimiter is not correct, see The Inferred Delimiter Is Incorrect.
The following screenshots depict changing the name of the
baby_names dataset to
Select Save to save your changes.
If a dataset has been detected as partitioned, the partitioning columns are listed on the Details tab. If your partitions were not correctly discovered, see The Dataset Partitions Were Not Discovered Correctly.
Column Errors During Registration¶
When loading table definitions from a relational data store connection, any columns that error out are skipped by default and the table is still registered. The skipped columns are annotated with a prefix
jdbc.unsupported.column- and can be seen in the
describe formatted <table> output in the
Table Properties section.
jdbc.unsupported.column-NCHAR_COL Unsupported JDBC type NCHAR
Unsupported type error(s).
Other Registration Errors¶
Errors may occur during datset registration. When an error occurs, dataset registration fails and you are notified in the post-registration dialog.
Details about the specific error can be obtained by selecting the error icon so you can rectify the error before attempting to register the dataset again.
Here are some errors that may occur.
Duplicate Dataset Names¶
The dataset could not be registered because a dataset with this name already exists in the database.
Dataset Failed to Register¶
One or more of the datasets failed to register.