Skip to content

Register Datasets (Structured Data)

When a crawler has finished crawling a data source, its status changes to Crawl complete.

Crawl complete status

The datasets (structured data) discovered by the crawler are listed on the Registration page after you select the crawler name. After datasets are discovered by the crawler, they can be registered to an Okera database. Registering a dataset to an Okera database must occur to add the dataset to the Okera catalog so Okera policies can be enforced.

Note: Before registering a dataset, you can edit its name and description as well as other fields. Note that the maximum length of dataset (table) and column names in Okera is 128 characters. See Edit Schemas.

Warning

Okera does not recommend registering the same dataset twice from the same connection. Doing so might lead to confusing results when permissions the dataset are conflicting.

To register a dataset to the Okera catalog:

  1. Select Register on the UI menu to access the Registration page. All the crawlers created in the Okera environment are listed.

    Alternatively, you can select the button on the Call-to-Action Home page in the Registered datasets section and follow the prompts.

  2. Select the crawler name to see a list of its unregistered datasets on the Unregistered datasets tab.

    Unregistered tables

    To see the datasets from the crawler that have already been registered in the catalog, select the Registered datasets tab.

    Note: Object storage crawlers automatically filter out registered datasets. If you wish to re-register an already registered path as a new dataset, you will need to do it manually using Workspace.

    Registered tables

  3. Use the checkboxes on the Unregistered datasets tab to select the datasets you wish to register and then select the button. Alternatively, select the option for an individual dataset. In either case, the Register selected datasets dialog appears.

    Register selected datasets dialog

  4. Select Use Existing database on the dialog to register the datasets to an existing a database or select Create database to create a new Okera database for the registered datasets.

    • If you select the Use Existing database option, select the name of an existing database from the Database drop-down list.

    • If you select the Create database option, the Register selected datasets dialog expands so you can supply a unique name and description for the new Okera database.

  5. Select on the dialog to register the datasets.

See Column Errors During Registration and Other Registration Errors for information on errors that might occur during registration.

View SQL

Select the View SQL () button associated with a dataset to see the SQL CREATE EXTERNAL TABLE statement for a dataset. You can then copy this statement to the clipboard and run it manually in the Workspace or another SQL editor.

View SQL

Edit Schemas after Registration

After selecting a crawler on the Registration page, select the name of the dataset to which you want to make changes. The Schema tab for the dataset appears. In schema view, you can select the pencil icon next to various fields to edit them. This includes the dataset name, description, and tags.

For CSV files, you can also edit the column names and column types. In addition, on the Details tab for a CSV file, you can change the skip n lines option and view the inferred delimiter for the CSV file. If the inferred delimiter is not correct, see The Inferred Delimiter Is Incorrect.

The following screenshots depict changing the name of the baby_names dataset to baby_names1.

Edit Schema

Select Save to save your changes.

Edit name

If a dataset has been detected as partitioned, the partitioning columns are listed on the Details tab. If your partitions were not correctly discovered, see The Dataset Partitions Were Not Discovered Correctly.

Column Errors During Registration

When loading table definitions from a relational data store connection, any columns that error out are skipped by default and the table is still registered. The skipped columns are annotated with a prefix jdbc.unsupported.column- and can be seen in the describe formatted <table> output in the Table Properties section.

For example,

jdbc.unsupported.column-NCHAR_COL   Unsupported JDBC type NCHAR
In the Okera UI, the table will show the warning message Unsupported type error(s).

Other Registration Errors

Errors may occur during datset registration. When an error occurs, dataset registration fails and you are notified in the post-registration dialog.

Registration Error Dialog

Details about the specific error can be obtained by selecting the error icon so you can rectify the error before attempting to register the dataset again.

Unregistered table Error icon

Here are some errors that may occur.

Duplicate Dataset Names

The dataset could not be registered because a dataset with this name already exists in the database.

Dataset Failed to Register

One or more of the datasets failed to register.