Hive Bucketing (Clustered) Tables¶

Bucketing or clustering is a feature by which large data is bucketed by one or more bucketing columns. Each bucket or group of data is stored in one file (for hive bucketing) and/or more files with similar name (in case of Spark bucketing). The file name is derived from a hash of the bucketing column values. Okera currently supports Hive bucketing where there is single file per bucket. For more information about bucketing, see Hive bucketing.

Users can choose the number of such buckets to create as part of the bucketed table definition. When generating the data via a Hive client, the data automatically lands into one of the bucketed files, based on the hash of the bucketing column.

Create Bucketed Tables¶

Okera supports SQL syntax for bucketed tables. This syntax is very similar to Hive CLUSTERED BY syntax. A sample bucketed table for San Francisco Bay Area population statistics by city is shown below. (This data is representational and was derived from this link.)

CREATE EXTERNAL TABLE IF NOT EXISTS buckets_test.sfbayarea_popl_stats_bucketed (
  name            STRING,
  type            STRING,
  county          STRING,
  population      BIGINT,
  landareasqmi    DECIMAL(5,2),
  landareasqkm    DECIMAL(5,2)
)
CLUSTERED BY (name) INTO 10 BUCKETS
STORED AS PARQUET
LOCATION 's3://buckets_test/sfbaystats_bucketed/';

The CLUSTERED BY command identifies the bucketing column or columns. The commad is followed by a list of bucketing columns in parentheses. For example, here the bucketing column is name and so the SQL syntax is CLUSTERED BY (name). Multiple columns can be specified as bucketing columns. When using Apache Hive to insert and update the data in this dataset, by default, the bucketed files are named based on the hash of the bucketing columns.
Users can choose the number of buckets in which they want the data to be bucketed (grouped). This is done by specifying INTO <n> BUCKETS after the CLUSTURED BY (<bucketing columns>) command. For example, when n=10 in the above example, the syntax is INTO 10 BUCKETS.
Together the full syntax for bucketing is CLUSTERED BY (name) INTO 10 buckets. This denotes, the table is bucketed by the name column into 10 buckets.

Efficient Bucketed Joins¶

A primary reason to use bucketed tables is to support joins between two large datasets.

For example, consider a scenario where we have two large tables, Table A and Table B. Both tables are 1TB in size. Without bucketing, broadcast joins are used for Table B to match each row of Table A. This can result in potential out-of-memory errors or might require an infrastructure with a huge amount of memory to support the broadcast join. With bucketing, only the join between the buckets needs to be performed and the memory requirements are much reduced. You do not need to worry about broadcasting/streaming Table B for each Table A split.

Okera has some significant improvements for supporting bucketed table joins, although restrictions apply. See (Important Restrictions for Bucketed Table Joins). Table A, in our example, can be scanned in multiple splits by bucket. For example, if there are 100 buckets, Table A can be scanned by 100 splits (each split spread across one task per worker). When joining with Table B, only the corresponding buckets of Table B are streamed and joined for a split from Table A, instead of reading all the data from Table B for each split. For example if a split is processing Bucket 20 of Table A, only Bucket 20 of Table B is scanned/streamed to that node for the join. This not only results in efficient join condition evaluation in terms of memory, it also results in faster query evaluation due to the reduced number of join row evaluations.

Important Restrictions for Bucketed Table Joins¶

To support bucketed table joins, the following dataset restrictions must be met.

The datasets must have similar bucketing columns on tables.
The datasets must have matching data types in the bucketing columns.
The datasets must be generated using the same client application, with the same bucketing scheme. For example, a bucketing table generated by Hive cannot be used with Spark-generated bucketing tables. This is because Spark uses a different bucketing mechanism than Hive.
The order of the bucketing columns should match between the tables.

Consider these two tables:

CREATE EXTERNAL TABLE IF NOT EXISTS buckets_test.sfbayarea_popl_stats_bucketed (
  name            STRING,
  type            STRING,
  county          STRING,
  population      BIGINT,
  landareasqmi    DECIMAL(5,2),
  landareasqkm    DECIMAL(5,2)
)
CLUSTERED BY (name) INTO 10 BUCKETS
STORED AS PARQUET
LOCATION 's3://buckets_test/sfbaystats_bucketed/';

CREATE EXTERNAL TABLE IF NOT EXISTS buckets_test.sfbayarea_housing_stats_bucketed (
  city_name            STRING,
  type            STRING,
  county          STRING,
  number_of_houses      BIGINT,
)
CLUSTERED BY (city_name) INTO 10 BUCKETS
STORED AS PARQUET
LOCATION 's3://buckets_test/sfbay_housing_stats_bucketed/';

In order for Okera bucketed join support to work between these two tables:

The data for the tables should be generated using the same client (for example, Hive or Spark, but not both).
When using Hive, the table datasets should each have 10 files that match the 10 buckets.
The bucketing columns used in both tables must both be of the same data type. In this example, they are both of type STRING. Note that column names do not match, but they have similar data. The sfbayarea_popl_stats_bucketed table has name as its bucketing column and sfbayarea_housing_stats_bucketed has city_name as its bucketing column, but oth columns contain the city name. This is a very typical scenario where the columns are similar but a primarykey in one table is a foreignkey with a different name in another table. Hence, this scenario is valid and supported by Okera as long as the data type matches.

Some invalid scenarios for bucketed joins,

If the sfbayarea_popl_stats_bucketed has name as its bucketing column and sfbayarea_housing_stats_bucketed has number_of_houses as its bucketing column, the bucketing join will not occur. This is because the bucketing columns have different data types.
Similarly, if sfbayarea_popl_stats_bucketed has name as its bucketing column and sfbayarea_housing_stats_bucketed has (city_name, type) as its bucketing columns, the bucketing join will not occur. This is because the bucketing columns are varying for the two tables.
Finally, if sfbayarea_popl_stats_bucketed has (name, type) as its bucketing columns and sfbayarea_housing_stats_bucketed has (number_of_houses, city_name) as its bucketing columns, the bucketing join will not occur. This is because the data types and column ordering do not match.

Note: Because Okera does not restrict on the column name for bucketing (but instead enforces on data types), be sure to specify the right bucketing column. Otherwise, an incorrect join evaluation may restul. For example, bucketing column name from sfbayarea_popl_stats_bucketed and bucketing column county from sfbayarea_housing_stats_bucketed will result in an incorrect bucketing join even though the data types and the number of bucketing columns match. However, the data contained in those columns does not match. So be careful not to select incorrect bucketing columns.

FAQs¶

Hive bucketing is supported well in Okera. This is because with a Hive client to generate the data, there is exactly one file per bucket and Okera can support joining between those files for two tables.
Spark bucketing is also supported, as long as the datasets for the tables are generated with the exact Spark clients and that results in the exact number of files for the tables data. Spark generates multiple files per bucket and the file names are not an exact hash of the bucketing columns.
The automatic crawler detects bucketed datasets as regular tables. To convert it to a bucketed table, recreate the table with appropriate keywords for bucketing (CLUSTERED BY) along with the bucketing information (INTO n BUCKETS).