Skip to content

Crawler Questions and Answers

In some cases a crawler may not be able to crawl all datasets from the specified connection. This will appear as a crawler-level error. For example, this might happen when the underlying connection has a table name that contains an unsupported character (for example, $).

This section describes some of the errors that you might encounter.

File Is Not Discovered by the Crawler

The crawler may not be able to discover a dataset or infer its schema correctly. Currently, the crawler is only able to detect Parquet, Avro, JSON and CSV file formats. Verify that the filename has the file format specified (for example, sample.json).

The Inferred Delimiter Is Incorrect

The crawler may not infer the correct delimiter for a CSV file. To resolve this, you have two choices:

  • You can manually register the dataset by copying the SQL statement and altering the delimiter specified in the ...ROW FORMAT section.

  • You can alter the dataset to use the correct delimiter after registration by running the following command in Workspace: ALTER TABLE <dbname.tablename> SET SERDEPROPERTIES ('field.delim'='.');.

The Dataset Partitions Were Not Discovered Correctly

The Okera crawler uses the folder directory naming structure to automatically detect partitions. Make sure your data files have been separated into partitioning folders and have been named correctly. Once datasets are in Okera, the partitioning columns are not in an editable state.

An example:

s3://bucket/year=YYY1/
s3://bucket/year=YYY2/
s3://bucket/year=YYY3/

How Is Crawling Handled?

When a new crawler is created, a crawler task gets added to the top of a background maintenance task queue for the Okera cluster. A continuously running thread picks up and runs background maintenance tasks in this queue. Automatic partition recovery is also run on this same thread.

Crawler Source Path Is Not Accessible

Okera does not have read access to this source path or bucket. Verify that the correct source path or bucket policy has been added.

Crawler Already Exists

The specified crawler name already exists. Choose a unique crawler name.

Syntax Error

The crawler name cannot contain spaces or special characters, except underscores (for example, it cannot contain these characters: !"#$%&'()*+,-./:;<=>?@[\]^`{|}~).

Bucket Does Not Exist

This bucket does not exist in S3. Verify that the bucket path has been specified correctly.