Skip to content

Crawler Questions and Answers

In some cases a crawler may not be able to crawl all datasets from the specified connection. This will appear as a crawler-level error. For example, this might happen when the underlying connection has a table name that contains an unsupported character (for example, $).

This section describes some of the errors that you might encounter.

File Is Not Discovered by the Crawler

The crawler may not be able to discover a dataset or infer its schema correctly. Currently, the crawler is only able to detect Parquet, Avro, JSON and CSV file formats. Verify that the filename has the file format specified (for example, sample.json).

The Inferred Delimiter Is Incorrect

The crawler may not infer the correct delimiter for a CSV file. To resolve this, you have two choices:

  • Manually register the dataset by copying the SQL statement and altering the delimiter specified in the ...ROW FORMAT section.

  • Alter the dataset to use the correct delimiter after registration by running the following command in Workspace: ALTER TABLE <dbname.tablename> SET SERDEPROPERTIES ('field.delim'='.');.

File Columns That Cannot Be Crawled

In file types such Parquet or CSV, you might have columns with names in the raw data that Okera does not support. Okera fails to crawl files if they contain column names it does not support. The list of characters Okera will not allow in column names includes periods (.), colons (:), backticks or backward quotation marks (`), and exclamation points (!). For example, if you have a file containing a column name with a period in it, Okera will fail to crawl the file.

The Dataset Partitions Were Not Discovered Correctly

The Okera crawler uses the folder directory naming structure to automatically detect partitions. Make sure your data files have been separated into partitioning folders and have been named correctly. Once datasets are in Okera, the partitioning columns are not in an editable state.

An example:

s3://bucket/year=YYY1/
s3://bucket/year=YYY2/
s3://bucket/year=YYY3/

How Is Crawling Handled?

When a new crawler is created, a crawler task gets added to the top of a background maintenance task queue for the Okera cluster. A continuously running thread picks up and runs background maintenance tasks in this queue. Automatic partition recovery is also run on this same thread.

How Can I Increase the Number of Crawlers That Can Run Concurrently?

Crawlers for different connections can run concurrently. To control how many run concurrently, use the MAINTENANCE_TASKS_THREAD_LIMIT configuration parameter. This parameter specifies the maximum number of Okera background tasks (which includes crawlers) that can run concurrently. Valid values are positive integers. The default is 1. If you need to set this value higher than 8, please contact Okera technical support.

Crawler Source Path Is Not Accessible

Okera does not have read access to this source path or bucket. Verify that the correct source path or bucket policy has been added.

Crawler Already Exists

The specified crawler name already exists. Choose a unique crawler name.

Valid Crawler Names

The crawler name cannot contain spaces or special characters, except underscores (for example, it cannot contain these characters: !"#$%&'()*+,-./:;<=>?@[\]^`{|}~).

Bucket Does Not Exist

This bucket does not exist in Amazon S3. Verify that the bucket path has been specified correctly.