Okera Platform Architecture Overview

The Metadata Services Overview and ODAS Overview documents introduce the major components of the Okera Active Data Access Platform. This document takes the next step, explaining the Platform's architecture and showing the components in action for various client requests. It also shows more details on the possible deployment architectures, which can vary significantly, based on the user requirements.

Platform Architecture

On a more technical level, the Okera Platform is divided into services that are accessible for clients and a few other services that are only accessible internally by administrators or other services. This is intentionally the case to reduce complexity and concentrate access for clients to just a few locations. The following diagram shows all of the platform services in context.

Architecture Overview

First we will introduce how clients access the Okera Platform, and then break up the services shown in the diagram based on their scope.

Client Access

The diagram starts in the top right corner with clients of varying kind communicating with the platform. There are three major access points:

Okera Portal Web UI

Clients can use Okera Portal to interact with the underlying platform. The web UI provides access to the Metadata Services, such as the Schema Registry and the Policy Engine, and enables self-service dataset discovery. It also offers a workspace to issue SQL statements against the platform. Behind the scenes, this web application uses the REST API gateway service to interact with your platform's services, just the way your bespoke or customized applications can.

REST Gateway

Many clients have support for the ubiquitous REST protocol, avoiding most of the issues that arise when dealing with custom protocols. The REST gateway service gives access to the all necessary client API calls. This includes both the Schema Registry for creating, altering, or dropping various objects, as well as the Policy Engine for granting or revoking access related to all registered objects. See the Metadata Services API documentation for the list of supported REST calls.

Client Libraries

Finally, Okera provides a set of libraries for interfacing directly with the Metadata and ODAS services, including Planners and the Workers. This is the traditional way to integrate with the Metadata and ODAS services. See Deployment Architecture for more on this topic.

Client Accessible Components

The Platform architecture diagram has a dashed line that separates the external services, which provide the client connectivity, from the internal support and management services. The following discusses the external and internal components separately, and how they work together to provide the services to all clients alike.

The external components in the architecture provide the major functionality of the platform, which is the data access and metadata management, as needed by clients such as interactive users to automated applications.

Note how the above referred to services, such as the REST API, Planner, and Metadata services. For fail-over purposes and continuous availability of these services, it is common in practice to have each service running multiple times. This is explained in greater detail in the High Availability documentation. In short, you can use the provided administrative functionality to instruct the platform to add redundant instances of the services (expressed as multiple, stacked boxes in the diagram) so that in case of a node failure the system will continue to operate.

The Planner service is also responsible for exposing the Metadata Services: the Schema Registry and Policy Engine. In effect, the clients use the REST API or client libraries as proxies (as shown in the diagram) to these two Metadata Services, provided by the Planner instances. The clients also communicate with the Planner as part of a query execution (see Communication below).

The ODAS client libraries are used by many of the provided higher-level client integrations, including Hive, Presto, Spark, and others. This integration support is needed to hide Okera as much as possible behind common tools, with which users are already familiar. For example, enabling the Hive integration allows for existing Hive setups to switch from direct usage of underlying data sources to use ODAS instead. There is no need to alter any metadata at all.

For Python, there are two different ways of communicating with an ODAS cluster: which is using the REST API or the native PyOkera API.

Tip

Wherever possible, use PyOkera, as it performs markedly better than REST.

Communication

Once the cluster is operational, a client can communicate with the cluster services. Refer to the Authentication documentation for more details on how to secure the interactions between clients and an ODAS cluster.

Read Request

The following sequence diagram shows how Presto, as an example interacts with ODAS after Okera integration is enabled (see Hadoop ecosystem tools integration).

Read Request Sequence Diagram

Depicted is a read request issued by Zeppelin (an interactive notebook service), showing the various calls made between the client and the services. Once a query is submitted in Zeppelin, it will use the Presto JDBC driver to plan and execute the request. The ODAS Planner is used to retrieve the dataset schema and instruct Presto how to distribute the query across its workers. Each worker is handed a list of tasks to perform, which is querying a subset of the resulting records.

Deployment Architecture

Depending on how clients interact with data, the ODAS platform is flexible enough to adjust for the most ideal integration. This is important to strike the best balance between functionality, performance and, eventually, cost.

Overview

In practice, ODAS is often provisioned on separate (virtual) machines, which incur operational costs. ODAS is highly optimized for I/O: The core engine is written in C++ using the latest hardware acceleration functionality in a massively-parallel-processing (MPP) architecture. This means that a low amount of ODAS cluster nodes can serve compute engine clusters two magnitudes in size.

Having said that, there are use-cases where the resource requirements are even lower. For instance, there are highly capable compute engine that already include many access control related features. If a client operation only requires features that the engine provides, ODAS can take a step back and - after rewriting the query - leave the operation execution to the engine for best performance.

Conversely, when the compute engine is ephemeral, that is, it is provisioned for a short amount of time but at a very large scale (think end-of-month analytics and report generation), then provisioning a matching ODAS cluster might become more cumbersome. If those engines run on provisioned machines that allow co-location of auxiliary services (such as offered by AWS EC2 or Azure VMs) then ODAS can be co-located on during the provisioning step and avoid any existing ODAS cluster resizing beforehand, or incurring extra cost for separate ODAS cluster machines.

Common Deployment Architecture

The following diagram shows a common deployment architecture, where clients talk to multiple ODAS clusters that serve different workloads (such as ad-hoc versus data pipeline clients) but share the same metadata.

Common ODAS Deployment

This setup has many flexible parts, for instance, you can run the ODAS clusters - which are managed by Kubernetes - in a load-dependent manner. ODAS clusters export operational metrics, which can be used with the infrastructure services to react to workloads. One of those is using AWS Auto Scale Groups to scale clusters out and in automatically.

You can also set up clusters across multiple datacenters (or regions) as well as across different infrastructure providers for any level of redundancy and fail-over safety. For high performance, the ODAS clusters are commonly deployed as close to the data as possible. The more lightweight Metadata Services can be shared, as shown in the Multi and Hybrid Cloud section of this document.

Separated Worker Architecture

The first change to the common setup is to separate out the components that see the highest saturation. For the ODAS clusters, this is the Worker service. Doing so, the shared Metadata Services, provided by the Planners, can be operated more reliably and significantly reduce the impact of scaling clusters. For the latter, only the Worker clusters need to be scaled out or in, while the metadata-related services stay the same.

Separated Worker Deployment

In addition, the Worker clusters also can be operated with fewer access permissions related to the storage services. Instead, the shared Metadata Services will provide temporary access to the required resources as necessary. For instance, in AWS this is using pre-signed URLs to access S3 files. This increases security overall as fewer services need infrastructure level permissions.

Co-located Worker Architecture

Building on the previous Worker separation, it is now possible to co-locate the Workers on the compute engine nodes. This works for all services that allow auxiliary processes being deployed during node provisioning. For instance, in AWS EC2 this is done using bootstrap scripts that are executed while the virtual machine is set up.

Co-located Worker Deployment

This has the mentioned advantages of Workers running in a non-elevated security context. It also solves the cost concerns that may arise when operating ODAS separately and being forced to match the scale out of the ODAS cluster with the temporary computer cluster.

Multi and Hybrid Cloud

On top of running ODAS clusters redundantly within a single cluster using service scale out, and the ability to scale using multiple ODAS clusters, there is also the option to run ODAS clusters across inherently heterogeneous environments. More specifically this comprises the following:

  • Multi-Cloud Support

    This deployment architecture is used for enterprises that are conscious about the dependency on a single infrastructure provider. There are many reasons that come to mind why diversifying infrastructure is a reasonable choice, not just for complying to laws and regulations.

    In practice, this means that services are spread out across multiple vendors, for example Amazon Web Services (AWS), Microsoft Azure, Google Cloud, and so on.

  • Hybrid-Cloud Support

    Another heterogeneous architecture is the combination of on-premises services with public cloud ones. This may be due to the same reasons of compliance to regulations, but also for enterprises that are in the process to migrate into the cloud.

    ODAS clusters are platform neutral, since they are provided as packaged container images that can be executed in common container environments. For on-premises, this may be RedHat OpenShift, which also uses Kubernetes under the hood to manage the services.

The following diagram shows how ODAS clusters share Metadata Services across multiple, distinct infrastructure services.

Multi-Cloud ODAS Deployment

The deployment architectures shown above give great freedom on how to provision the Okera Platform. Please reach out to us if you have any questions.