Crawler Questions and Answers¶
In some cases a crawler may not be able to crawl all datasets from the specified connection. This will appear as a crawler-level error. For example, this might happen when the underlying connection has a table name that contains an unsupported character (for example, $
).
This section describes some of the errors that you might encounter.
File Is Not Discovered by the Crawler¶
The crawler may not be able to discover a dataset or infer its schema correctly. Currently, the crawler is only able to detect Parquet, Avro, JSON and CSV file formats. Verify that the filename has the file format specified (for example, sample.json
).
The Inferred Delimiter Is Incorrect¶
The crawler may not infer the correct delimiter for a CSV file. To resolve this, you have two choices:
-
Manually register the dataset by copying the SQL statement and altering the delimiter specified in the
...ROW FORMAT
section. -
Alter the dataset to use the correct delimiter after registration by running the following command in Workspace:
ALTER TABLE <dbname.tablename> SET SERDEPROPERTIES ('field.delim'='.');
.
File Columns That Cannot Be Crawled¶
In file types such Parquet or CSV, you might have columns with names in the raw data that Okera does not support. Okera fails to crawl files if they contain column names it does not support. The list of characters Okera will not allow in column names includes periods (.), colons (:), backticks or backward quotation marks (`), and exclamation points (!). For example, if you have a file containing a column name with a period in it, Okera will fail to crawl the file.
The Dataset Partitions Were Not Discovered Correctly¶
The Okera crawler uses the folder directory naming structure to automatically detect partitions. Make sure your data files have been separated into partitioning folders and have been named correctly. Once datasets are in Okera, the partitioning columns are not in an editable state.
An example:
s3://bucket/year=YYY1/
s3://bucket/year=YYY2/
s3://bucket/year=YYY3/
How Is Crawling Handled?¶
When a new crawler is created, a crawler task gets added to the top of a background maintenance task queue for the Okera cluster. A continuously running thread picks up and runs background maintenance tasks in this queue. Automatic partition recovery is also run on this same thread.
How Can I Increase the Number of Crawlers That Can Run Concurrently?¶
Crawlers for different connections can run concurrently. To control how many run concurrently, use the MAINTENANCE_TASKS_THREAD_LIMIT
configuration parameter. This parameter specifies the maximum number of Okera background tasks (which includes crawlers) that can run concurrently. Valid values are positive integers. The default is 1
. If you need to set this value higher than 8
, please contact Okera technical support.
Crawler Source Path Is Not Accessible¶
Okera does not have read access to this source path or bucket. Verify that the correct source path or bucket policy has been added.
Crawler Already Exists¶
The specified crawler name already exists. Choose a unique crawler name.
Valid Crawler Names¶
The crawler name cannot contain spaces or special characters, except underscores (for example, it cannot contain these characters: !"#$%&'()*+,-./:;<=>?@[\]^`{|}~
).
Bucket Does Not Exist¶
This bucket does not exist in Amazon S3. Verify that the bucket path has been specified correctly.