Cluster Sizing Guide

The purpose of this document is to help in planning for the resources needed to run an Okera cluster. It is useful for system architects and related roles that are responsible for determining how many servers are needed for a ODAS setup, and the respective size of these machines in respect to their role.

General Information

an Okera ODAS cluster is foremost a data-access platform. It has active components that run on managed hosts. Some of these components handle metadata, such as database and dataset information, access policies and cluster state. Some perform data processing and migration between storage backends and frontend consumers.

Operation of a ODAS cluster is mostly automated, with a management and orchestration layer, and the actual ODAS services hosted within.

Deployment Manager (DM)

This service bootstraps the rest of the ODAS cluster. It has requirements to store the managed cluster details into a relational database. The DM then installs the ODAS software artifacts on the remaining cluster machines. If necessary, and used in the case of infrastructure-as-a-service (IaaS) providers, the DM is also responsible for executing an instance-provisioning script, which typically use the providers’ command-line tools to bring up the necessary infrastructure.

One instance of the DM can manage multiple ODAS clusters.

Cluster Host Roles

Each host in the cluster is responsible to run one or more of several roles. Likewise, some role services, such as Planner and Worker, run on all cluster nodes.

Okera Catalog

The Catalog is required to serve the database and dataset information, including the dataset schemas and access policies. All of this information is stored in a related database and served internally to the remaining ODAS components as necessary.

ODAS Planner

The Planner receives the initial request to process a query. It uses the Catalog and cluster information to generate a plan which can be distributed and executed across the ODAS worker machines. The Planner is lightweight in comparison and does not handle any data from the available datasets.

ODAS Worker

The worker role is responsible for the heavy lifting. It moves data between the calling client and the federated storage systems that underlie the platform. No data is stored on disk during this task, leaving the onus on network, memory, and CPU resources.

ODAS REST Service

This service provides access for non-native clients and Okera’s own Web UI to internal resources, such as the Catalog. The REST service responds to an API that allows HTTP clients to access datasets. This is not the recommended way for high-throughput data processing; it is most suited for accessing or updating the Catalog metadata.

ODAS Web UI

The Web UI provides access to the Catalog information, and enables users to discover and browse databases and datasets. It is an interactive tool that is mostly concerned with metadata.

Cluster Technical Roles

Other node roles are required to orchestrate a ODAS cluster. None of the these roles is part of the data path. Therefore, they require only moderate resources.

Kubernetes Master

The ODAS roles are managed by Kubernetes, a tool for automated deployment, scaling, and management of containerized applications. It has a master/worker architecture, where the master is responsible for the so-called Kubernetes control plane, which is the main controlling unit of the managed machines. The master provides essential services, such as the controller manager, scheduler, and REST API server.

Kubernetes Worker

The worker role is where the ODAS components are hosted. That is, based on the scheduling and assignment of the Kubernetes master, the worker nodes will execute the containerized services provided by ODAS. All cluster services (everything except the Deployment Manger) run on the Kubernetes workers.

Canary

ODAS also comes with its own monitoring for service availability and status. The Canary role is an auxiliary service that knows how to check the main ODAS services. It is responsible for updating the Deployment Manager regarding the status of all managed ODAS clusters.

ZooKeeper

Finally, for various services there is a need to employ a distributed and reliable consensus and membership subsystem, which is provided by ZooKeeper.

Cluster Node Types

Since the above roles are assigned to shared nodes, we can combine the resource requirements into the following major classes:

Deployment Manager (DM)

This is a host for the Deployment Manager role. One DM instance can manage multiple ODAS clusters. It is not actually part of the cluster, and it is not a resource-intensive application.

Requirements:

  • Moderate resources
  • Database access

Kubernetes Master (KM)

Also a single role machine, hosting the Kubernetes control plane.

Requirements:

  • Moderate resources

Worker Node (Worker)

These machines run all the remaining roles, as scheduled and assigned by the Deployment Manager and Kubernetes master subsequently.

Requirements:

  • Elevated resources
  • Database access for certain roles (ODAS Catalog)

Dependent on the use of IaaS, for example in form of a cloud provider, or bare metal, the node types require specific features. This includes, but is not limited to, number of CPUs and cores, network bandwidth and connectivity, storage capacity and throughput, and available main memory.

Most node types require only moderate resources, and it is often enough to ensure that any possible virtualization (for example, when running a cloud environment) is not causing any undue strain on the basic resources. An exception to the rule is the worker, which needs memory when executing server-side joins (see Supported SQL for details on what joins are executed within ODAS). These joins need to load the joined table into memory, after any projection and filtering of that table has taken place.

For example, assume you have a table with 100 million rows and 100 columns that is used to filter records from other user tables. Further assume a particular join is reducing the number of rows of the filter table to one million and the column count to two as part of the JOIN statement in the internal Okera view. One of the columns is a LONG data type with 4 bytes, while the other column is a STRING with an average length of 10 bytes. You will need 1M * (4 + 10) bytes, or roughly 14MB for the filter table in the worker memory.

In addition, if you are running about 10 queries concurrently, using the same filter table and assumed projection and filtering applied to it, then you will need 10 * 14MB, or 140MB of available memory on each worker machine. (This assumes all workers are equally used in the query execution).

The default memory per worker per query is set to 128MB, which suffices in this example, though the concurrency is driving the combined memory usage up. The more concurrent queries you are running on a worker, the more memory is needed.

Cluster Types

Broadly speaking, in practice there are often two types of environments set up with and for ODAS nodes. Development and testing implementations use minimal resources, while production environments maximize resource allocation and utilization in line with business goals.

Proof-of-Concept/Development Environment (PoC/DEV)

Typically, minimal resources are assigned to a development or testing environment. For provisioning that assigns coarser-grained resource classes, it is important to not under-provision the system resources. For example, it has been observed that using a shared development database with heavily limited concurrent connection may cause I/O errors that may severely reduce the ODAS functionality, or could lead to service outages.

Production Environment (PROD)

The resources needed for production are a factor of the requirements of the business users. In other words, each resource has a certain maximum capacity, and only parallelization is going to solve this limitation. Each node should not be overloaded, either. That is, its workload should only use up to 80% of the theoretical maximum, leaving crucial resources for the operating system itself. Otherwise, the node could become unresponsive or fail completely.

Conversely, assigning too many resources may drive up cost, which is most likely a cause for concern in the long term.

We will discuss the various offerings and resource types below in the context of these cluster types, setting boundaries for allocation ranges (where applicable).

Cloud Providers

Cloud providers typically do not offer configuration of the low-level system resources directly, such as for memory or CPU cores. Instead they bundle these attributes into some kind of resource classes. For example, the selection of a (virtual) server type will define the available level of the essential system resources implicitly.

Commonly for cloud services, compute and storage (which includes object and database storage) features are configured separately. Furthermore, other related resources, such as load balancing or DNS, can be configured as needed. For the purpose of this document, we will primarily look into compute and storage resources but mention others where necessary.

For now, the only cloud-services provider explored here is Amazon’s AWS suite.

Amazon Web Services

Amazon Web Services (AWS) is well known for a plethora of hosted offerings, making it a popular choice for companies and enterprises that want to forfeit on managing their own IT resources, or where the pace of innovation has overtaken the internal procurement and provisioning cycles. Spinning up virtual machines is available at the click of a button in EC2, AWS’s Elastic Compute Cloud product.

Regarding ODAS, it is important to select the appropriate EC2 instance types for the mentioned cluster and node types. This way, low-intensity resources are available either at a minimal level, or are not wasted due to over-provisioning. The selection also varies based on the environment ODAS is installed into.

The following discusses compute and storage requirements separately.

Compute Instance Types

For production-level ODAS clusters on EC2, the following is recommended (especially for cluster nodes):

Hardware Virtual Machines (HVM): Select instance types and AMIs that use HVM virtualization instead of the slower paravirtual machine (PV) option.

Avoid EBS Only: For lowest latencies, don’t rely exclusively on Amazon’s Elastic Block Store. Use instance types that have local storage – solid-state drives (SSD) in most cases. This will reduce for any operating system-level tasks to have an impact on ODAS services.

Placement Groups: Put all instances into a Placement Group, which optimizes the network connectivity between those nodes. While instances usually observe up to 10Gbps, placing the instances into a placement group, up to 25Gbps are possible.

Enhanced Networking: Use instance types that allow to make use of Enhanced Networking, which improves on the observed packet per second (PPS) performance, and lowers network jitter and latencies.

S3 Storage: From a cost/performance standpoint, using the AWS Simple Storage Service (S3) is the best option. Using instance storage, like EBS or ephemeral storage, is imposing more risk of complicating data management, compared to the disadvantage of S3 being slightly slower. Tests have shown that S3 can be read at 600 MB/s per EC2 instance. This is 60% of the theoretical limit of a 10GigE network uplink.

As far as ODAS using EC2 is concerned, worker nodes require larger JOIN operations to have access to as much memory as possible, especially for situations where concurrency is high (that is, many queries being executed at the same time). The M5 class EC2 instances, while always using EBS as their instance storage, can be sufficiently scaled to accommodate these requirements. In addition, the EBS connectivity is optimized and better than other instance/EBS combinations.

Conversely, the M3 class instances, which offer SSD instance storage, are limited in their total memory to only 30GB, which is too restrictive. They also do not support enhanced networking.

Another option are the R3/R4 type instances. These are memory-optimized and scale up to 488GB of memory. The r4.2xlarge and r4.4xlarge types are a good starting point for production clusters, giving ODAS twice the amount of memory over the M5 types with the same number of cores.

On the other hand, the C5 class types place an emphasis on more cores to memory. They shift the ratio the other direction, giving half as much memory compared to M5 instances with the same number of cores.

For ODAS, most of the work done is network based, which means that CPU cores are important, but so is memory for caching of intermediate data in memory.

As a starting point, we offer the following recommendations.

PoC/DEV Cluster
Service Node Type Instance Type
EC2 DM t2.medium
EC2 Worker t2.xlarge
PROD Cluster
Service Node Type Instance Type
EC2 DM m5.xlarge
EC2 Worker m5.2xlarge

This setup offers the option to increase (or decrease) the instance sizes or switch to a more memory- or core-heavy instance type.

Storage Instance Types

In practice, most data lakes in AWS follow the above recommendations and use S3 as their storage layer. The additional advantage is that you can suspend or stop the compute resources at any time, for the purpose of saving cost without the danger of losing data, or so it can be accessed without a running compute cluster.

The following lists the common choices for both compute local (albeit being network-attached storage) and database related instance types.

Ephemeral instance storage: The local filesystem in an instance is only used by the operating system. Should an instance fail, its ephemeral storage is reset on restart (hence the name).

Elastic Block Store (EBS) instance storage: For persistent storage across EC2 instance restarts, EBS is commonly used to attach storage containers that hold their data until they are permanently deleted. EBS is often used as scratch space for running instances and the services they host. Some EC2 instances only support EBS volumes, in which case it is the only option to run such a virtual machine.

As discussed earlier, when the utmost performance is wanted, choose an EC2 instance that supports SSD-backed EBS volumes and offers EBS-optimized network connectivity.

Simple Storage Service (S3) instance storage:

As discussed above, S3 is a separate, object-store-based service that can act as a storage source and target for EC2 instances. The difference is that S3 can be accessed by other services too, without the need to run any additional compute resources. S3 itself comes in two types: the standard offering and a version optimized for longterm storage. Only the former is recommended as far as ODAS is concerned.

Relational Database Service (RDS) for DB storage:

The number of concurrent connections to an RDS instance is determined by the amount of memory the underlying EC2 instance has. In more general terms, RDS is using EC2 virtual machines to provision a service, which is the difference between IaaS (infrastructure-as-a-service) and PaaS (platform-as-a-service). Since concurrent connections to a database require a certain amount of memory, the available EC2 memory is used to compute the maximum connection count at provisioning time. It is possible to modify that setting for a given instance, for example increasing or decreasing it to match the requirements. However, the side effects of doing so are not predictable, and therefore it is not recommended practice.

For example, an RDS db.t2.small instance, which is a based on t2.small EC2 instance, has 2 GB of memory available, limiting the number of connections to about 170. The formula is documented and discussed online:

Serverfault: “The current RDS MySQL max_connections setting is default by {DBInstanceClassMemory/12582880}, if you use t2.micro with 512MB RAM, the max_connections could be (512 * 1024 * 1024)/12582880 ~= 40, and so on.”

The recommendation is to determine the number of connections needed for a database instance (which includes any other shared usage) and select an instance type for RDS that allows for at least 1.5 times as many connections.