Okera Installation - AWS Guide

This document addresses all the Amazon Web Services (AWS) specific steps that are required to set up a working Okera Data Access Services cluster. The general installation instructions are linked where applicable.

Introduction

The Okera Data Access Platform (ODAP) consists of two components:

  • Deployment Manager (DM)
  • Okera Data Access Services (ODAS)

The DM is a specific application that is used to manage one or more ODAS cluster. It is typically the first thing that is installed and is used to handle all further management tasks. The state of the DM, which includes the list of deployed and managed clusters, is stored in a transactional database management system (RDBMS). For ODAS clusters, the state is also persisted in an RDBMS, which is often the same that is used for the DM (but can also be configured to use a different instance). Both DM and ODAS need industry-standard server hard- and software, as outlined in the Installation Prerequisites.

When deployed in AWS, Okera makes use of many of the provided AWS services, which are explained in the next section.

AWS Architecture Overview

The most basic setup of ODAP within AWS is shown in this diagram:

The basic setup has the following properties:

  • Both DM and ODAS are deployed as AWS EC2 instances, running in the customer VPC. For resiliency, the ODAS instances are (optionally) maintained as an Auto Scale Group (ASG).
  • Both DM and ODAS instances are assigned their own subnet(s) and security group(s).
  • A shared RDBMS is provided by AWS RDS, where MySQL (or compatible, such as Aurora) is used, with the ability to replicate data to a slave in another availability zone (AZ). This is for resilience against AZ failures, so that a copy of the state data is accessible.
  • The shared RDBMS runs in the same subnet as ODAP.
  • Both DM and ODAS instance types are assigned a special AWS IAM role. For the DM the role includes the ability to talk to RDS and S3, as well as being able to start EC2 instances. For the ODAS machines the role includes access to RDS, S3, and EC2.
  • The ODAS machines may access user and group membership information, which may be provided by AWS Directory Services.
  • Other AWS services such as EMR can be configured to use ODAS as a data access service.
  • Third-party vendor services may also be configured to access ODAS, through some means of network connectivity, such as VPC Peering.
  • DNS, by means of AWS Route53, may be used to assign other host names to the EC2 instances.

For more complex setups, there are advanced installation features:

  • When using Amazon Aurora, data should be replicated across multiple AWS regions to keep the Okera state related data safe and available.
  • For failover safety, it is also recommended to have S3 buckets replicated across the same multiple Regions.
  • Multiple subnets may be used to divide external facing services from internal ones. This also applies to security groups, which should be split between administrative and service-level access.
  • A Network Load balancer (NLB) can be used to decouple services endpoints from internal IP addresses or ephemeral hostnames.

This guide will discuss the basic setup only. Experienced administrators can use the above advanced choices to implement a resilient and highly available service.

Note: Okera does not need any AWS keys or secrets to work.

Prerequisites

AWS Prerequisites

Before installing Okera, the following AWS services must be set up. The AWS services include an S3 Bucket, an RDS instance, and IAM credentials.

  • S3 Bucket

    Okera uses a dedicated S3 bucket to store log files and to stage intermediate configuration files. The S3 bucket is set to readable and writable by all instances running Okera components. The S3 bucket is referred to as OKERA_S3_STAGING_DIR throughout the documentation.

  • RDS Instance

    Okera is backed by a relational database, either MySQL 5.6, or newer, or Aurora (MySQL 5.6+ compatible), provisioned with the configuration of your choice. Okera instances require read and write access to the OKERA_DB_URL database (which includes the ability to create the tables, insert and delete records, and so on).

  • IAM Credentials

    In general, all Okera components (that is DM and ODAS) require read and write access to the S3 bucket set by OKERA_S3_STAGING_DIR, and RDS instance set by OKERA_DB_URL. On top of that, the following is needed:

    • The DM requires the ability to provision EC2 instances.
    • The ODAS cluster nodes require (at least) read credentials to data stored in S3 buckets.

    These credentials can be one IAM profile with all the credentials, or two separate roles (assumed to be the default for this document). If encryption is being used with keys in KMS, these roles will also need KMS access (see S3 Encryption Support). The IAM role for the Deployment Manager (DM) is referred to as IAM_MANAGER and the IAM role for the cluster nodes is referred to as IAM_CLUSTER.

See Configure AWS Resources, which addresses the required resources in details.

Okera Host Prerequisites

The following are specific AWS related host requirements. See Installation Prerequisites for general host requirements.

  • Deployment Manager Host Requirements

    • Assign the IAM_MANAGER IAM role to the EC2 instance.
    • Minimum instance type should be t2.medium. Also see Cluster Sizing for more details on instance sizing.
    • Ensure the awscli command-line tool is installed (see below).
  • Install the AWS CLI Tool

    Install the awscli command-line tool on a newly provisioned Deployment Manager host, using these commands:

      sudo yum install wget
      sudo wget https://bootstrap.pypa.io/get-pip.py
      sudo python get-pip.py
      sudo pip install awscli
      aws configure
    
  • Cluster Node Requirements

    • Use a CentOS 7/RHEL 7 AMI. The AMI can be a stock RHEL image or one maintained by your enterprise with additional components pre-installed.
    • Assign the IAM_CLUSTER IAM role to the EC2 instances.
    • Minimum instance type should be t2.large with a minimum of 40GB local storage. Also see Cluster Sizing for more details on instance sizing.

Configure AWS Resources

The following details the setup of the necessary AWS resources, as listed above. Most steps require administrative (or otherwise privileged) access to the AWS Console.

AWS IAM Roles

Following the general steps to create an IAM role, as outlined in the AWS Documentation, perform these steps to create the necessary Okera roles:

  1. Open the IAM console at https://console.aws.amazon.com/iam/.

  2. In the navigation pane, choose Roles, Create role.

  3. On the Select role type page, choose AWS Service and the EC2 use case. Choose Next: Permissions.

  4. On the Attach permissions policy page, select the following AWS managed policies:

    • AmazonRDSFullAccess
    • AmazonS3FullAccess
    • AmazonEC2FullAccess

    Then click on Next: Tags button.

  5. Add any tags you want, though none are needed by Okera. Choose Next: Review.

  6. On the Review page, type a name for the role, such as OkeraDmRole and choose Create role.

This screenshot shows the review page, with the role for the DM:

Repeat the steps to create another IAM role, for the ODAS cluster nodes, while instead selecting these permissions:

  • AmazonRDSFullAccess
  • AmazonS3FullAccess

Enter OkeraOdasRole for the role name and choose Create role.

Notes:

  • If encryption is being used with keys in KMS, these roles will also need KMS access (see S3 Encryption Support).
  • End users do not need these roles to use the Okera services.
  • The is no need to create IAM users or any other IAM resource.

AWS Security Groups

Following the general steps to create a Security Group, as outlined in the AWS Documentation, perform these steps to create the necessary Okera security groups:

  1. Open the Amazon VPC console at https://console.aws.amazon.com/vpc/.

  2. In the navigation pane, choose Security Groups.

  3. Choose Create Security Group.

  4. Enter a name of the security group, such as OkeraOdasServices and provide a description of your choice.

  5. Select the ID of your VPC from the VPC menu.

  6. Use the Add Rule button to add the following Inbound rules that pertain to machines outside of the security group:

    Note: This assumes the default ports were chosen for the ODAS services, and that they should be accessible by machines outside of the current security group. Also see Port Configuration.

    Type Protocol Port Range Source Description
    SSH TCP 22 0.0.0.0/0 SSH access
    Custom TCP TCP 10000 0.0.0.0/0 Grafana Dashboard UI
    Custom TCP TCP 10001 0.0.0.0/0 Kubernetes Web UI
    Custom TCP TCP 5000 0.0.0.0/0 ODAS REST Server
    Custom TCP TCP 8083 0.0.0.0/0 ODAS Web UI
    Custom TCP TCP 11050 0.0.0.0/0 ODAS Planner Web UI
    Custom TCP TCP 11051 0.0.0.0/0 ODAS Worker Web UI
    Custom TCP TCP 12050 0.0.0.0/0 ODAS Planner
    Custom TCP TCP 13050 0.0.0.0/0 ODAS Worker

    Note: If you use 0.0.0.0/0 as the source value, you enable all IPv4 addresses to access these ports. To restrict access, enter a specific IP address or range of addresses.

  7. Add any other existing Security Groups to this new group as needed.

  8. Choose the Create button to finalize the new security group.

The following shows the *Create group** dialog as the Okera group is created:

Note: Typically setting any Outbound rules is not required.

Next, edit the new group and add additional rules that only apply to machines within the same security group, including the access to the MySQL RDS instance. Click on the Inbound tab and select Edit. Then, for each entry in the following table, click on Add Rule and enter the given details:

Type Protocol Port Range Source Description
MYSQL/Aurora TCP 3306 <security-group-id> MySQL RDS instance
Custom TCP TCP 6443 <security-group-id> Kubernetes API Server
Custom TCP TCP 6783 <security-group-id> Kubernetes Weave (TCP)
Custom TCP TCP 10250 <security-group-id> Kubernetes Kubelet Server
Custom TCP TCP 8085 <security-group-id> ODAS Agent
Custom TCP TCP 9083 <security-group-id> ODAS HMS
Custom TCP TCP 9091 <security-group-id> ODAS ZooKeeper
Custom TCP TCP 9098 <security-group-id> ODAS Canary
Custom TCP TCP 11060 <security-group-id> ODAS Sentry

Replace <security-group-id> with the ID of the newly created security group.

Note: Okera does not require the definition of access control lists (ACLs).

AWS Resource Tagging

Both the DM and ODAS machines support tagging of the used resources. Since the DM is usually set up using the EC2 Wizard, there is a step included to tag the machine as needed. For ODAS machines, the resource tagging is part of the EC2 launch scripts. Okera ships with examples for regular EC2 and ASG launching. Here is the section from the regular EC2 launch script that allows setting custom tags for the created EC2 machines:

...
 ################################################################################
# USER:
# Add any tags to the launched instances. This is an example of how to add tags.
################################################################################
add-tag $instance_id LaunchedBy Okera

wait-for-instance-state $instance_id "running"
debug "$instance_id running."
...

Once the AWS resources are provisioned and configured, continue with the Installation of the Deployment Manager.

Operations

The next subsections address specific topics regarding the operations of ODAS clusters.

Health Checks

Okera comes with its own health check system, whose state is exposed by internal APIs. Administrators can use the Deployment Manager and the provided ocadm CLI tool to determine the state of the DM and each managed ODAS cluster. For example:

$ ocadm status
Service available. Status: Running

$ ocadm clusters status 1
READY

$ ocadm clusters status 1 --detail
{
    "timestamp": "2019-01-29 18:31:35.485",
    "message": "All services running.",
    "elapsedTime": "3h12m41s997ms",
    "code": "READY"
}

The API and CLI tool can be used to automate the monitoring and alerting needed to detect service problems. In addition, ODAS clusters can be enabled to provision graphing of metrics using Grafana as the dashboard, and showing service details using the Kubernetes dashboard. See Monitoring for details.

Software Updates

Updating ODAP requires two steps:

  1. Update DM to the new version

    Newer versions of DMs are backwards compatible (unless stated otherwise in the release notes). That means, the new DM instance can manage the running ODAS cluster from a previous version.

  2. Create replacement ODAS clusters

    The easiest way to update ODAS clusters is to replace them with newer ones. This has the advantage that the new clusters can be tested and prepared before being activated. Since each ODAS cluster based on a shared nothing, massively parallel processing architecture, they can be replaced without any detrimental effect.

Failure Scenarios

The following failure scenarios discuss the implications:

  • Instance Failure

    This applies when one of the EC2 fails. The administrator can add new instance as needed, or in case of using the ASG launch script, rely on the ASG to backfill the instance.

  • AZ or Region Failure

    All the state of the ODAP components is stored in an RDBMS, which should be configured in a replicated setup, across multiple AWS Regions. That way, losing entire clusters, such as during an AZ or Region failure, does not cause loss of service. It does require monitoring and automated deployment of new ODAS clusters in the failover locations.

  • Storage Capacity Issues

    ODAS does not store local data. Log files are periodically uploaded to the S3 staging bucket for the cluster. This means that storage on the EC2 side is not an issue. For S3 there is no concern about storage limitations, as the majority of the ODAS cluster files are small in comparison, and S3 is not bound in size.

Service Recovery

In case of a large failure, such as a region outage, the following steps can be taken:

  1. Decide which region to migrate to.
  2. Activate the database replica in that region, making it the new master.
  3. Provision a DM machine.
  4. Deploy new ODAS clusters.
  5. Update DNS or LB target groups to match the new location.

The steps about provisioning a DM and ODAS clusters can be highly automated. A standby DM instance can be kept running or stopped, until it is needed. Deploying ODAS clusters is a matter of minutes.

This should keep the RPO of the state data to what is offered by the AWS database replication service, and the RTO to mere minutes.

Perform Recovery Testing

It is advised to test failover scenarios on a regular basis. Following the steps in the previous subsection, a customer can stand up a replacement DM and ODAS cluster in another region, test its functionality, and then terminate (or stop in case of the DM) it once the test is complete.

Additional Notes

Okera Cost Model

Okera offers an annual subscription rate that is based on an annual platform fee, plus a variable fee based on the size of the data lake. Please contact info@okera.com for more information.

AWS Billable Services

The following table list the AWS resources used by Okera and states if they are required and billable:

Service Required Billable Notes
EC2 Yes Yes Free tier instance types are not recommended by Okera.
EBS Yes Yes Many instance types only offer EBS as instance storage.
RDS/Aurora Yes Yes Free tier instance types are not recommended by Okera.
S3 Yes Yes Okera needs a storage bucket for logs and staging files. Cost is dependent on how many of these objects are retained.
VPC Yes No Only additional services for VPCs, like a NAT Gateway, are billable.
CloudTrail No Yes Only if enabled for related services, such as EC2 instances.
CloudWatch No Yes Only if enabled for related services, such as EC2 instances.
Route 53 No Yes Only if custom hostnames are configured as DNS records.
Directory Service No Yes Only if used in combination with ODAS.

Okera Support Levels

See the “How to get Support” FAQ entry for details on how to get in contact with the Okera Support department.

Note: Support is bundled with the annual subscription fees and does not incur additional costs.

The following table lists how Okera is prioritizing the severity of issues that are reported, along with the service-level agreement (SLA) for each. The latter describes how quickly issues will be addressed and the amount of time spent to resolve them.

Severity Description SLA
1 – Critical
  • The issue brings production system down and makes it unusable for all or most users.
  • There is no known workaround available.
  • Respond within 1 business hour if outage occurs during business hours Pacific Time.
  • Respond within 3 hours if outage occurs outside of business hours Pacific Time.
  • Work round the clock till resolved with updates every 4 hours.
2 – High
  • The issue impacts production systems and results in a degraded or unusable experience for all or most users.
  • This could include excessively slow system response times.Some functionality may be available or can be restored with a workaround.
  • Respond within 3 business hours.
  • Work during business hours till resolved.
3 – Medium
  • Most parts of the system are functional.
  • This issue does not cause material impact on customer experience and has no significant risk to the business.
  • Respond within 1 business day
4 – Low
  • Cosmetic defects that don’t impact system functionality.
  • This issue does not cause downtime.
  • Respond within 5 business days.