Okera Platform Memory Usage

In general, there are many reasons for a process to manage its own memory, instead of relying on the operating system. Requesting and freeing memory from the operating system carries a substantial performance trade-off. Therefore, Okera minimizes the number of calls to the operating system to deal with allocating or freeing memory. We use the [tcmalloc library] (http://goog-perftools.sourceforge.net/doc/tcmalloc.html), in addition to internal memory pooling for efficiency.

Internal Memory Pooling Behavior

In its default configuration, ODAS assumes it is not running on the same VMs as other resource intensive processes so returning back to the OS doesn’t provide much benefit. Even with other resource-intensive processes, there are other ways to tune the process; constantly trading memory back and forth can lead to unpredictability, which is generally not good (we prefer VM level isolation).

For monitoring purposes, memory usage is not proxy for service load. CPU usage and network IO are much better metrics to be tracking.

Only the memory usage of the workers is expected to be high. All the other services maintain very little state.

The general calling pattern for a scan is:

A user makes a request to the Planner, which returns a list of tasks. Since this only hits the Planner, this has minimal memory usage on the Planner. Expect to see high IO and CPU load, as generating the splits can be compute- and IO-intensive for large datasets.

For large datasets, the list of tasks will be far larger than the number of works or processor cores in the compute cluster. For example, we may return 1,000 tasks, but the EMR cluster only runs 64 (based on EMR cluster size, etc) at a time. Continuing with this example, the 64 tasks (run from EMR) get run on all the workers, distributed randomly. In a 10-node ODAS cluster, each node would be expected to see 6-7 concurrent tasks, but you will see some variance in practice due to randomness. It would be unsurprising if the range across the workers was 4-10 concurrent tasks.

When each task begins running, it starts using memory. When it is done, it frees that memory (to the worker, not the OS as above), and EMR will schedule the next task, which does the same thing. In this case, the peak memory required would be equal the memory requirement per task multiplied by the peak number of concurrent tasks (in this example: 10).

The memory requirement per task is typically low (maintaining a few buffers that read data from S3), but it is workload dependent. This is typically in the range of 10-100MB per task. In this case, perhaps a peak usage of 1GB. In scenarios with large numbers of concurrent users, we may have 200 concurrent tasks per worker (so 2000 in a 10-node ODAS cluster), so more in the 20GB peak range, which we consider acceptable for typical VMs.