Skip to content

OkeraEnsemble nScale Mode Deployment in Amazon EMR Environments

You can elect to deploy the OkeraEnsemble access proxy in nScale mode, so its workload is distributed across your cluster nodes and scales up and down with your clusters. To do this, the OkeraEnsemble access proxy retrieves AWS credentials from the Okera Policy Engine (planner). To communicate with the Okera cluster, the access proxy generates its own system token if it is configured with the JWT private key used by the Okera cluster (using the JWT_PRIVATE_KEY configuration property). This is done for you if you use the odas-emr-bootstrap.sh script with the --install-jwt-key argument (specifying the Amazon S3 path to the key). You must use the same private key used by the Okera cluster.

Alternatively, the access proxy uses the system token specified by the SYSTEM_TOKEN configuration parameter).

If both the the JWT_PRIVATE_KEY and SYSTEM_TOKEN configuration parameters are specified, the JWT_PRIVATE_KEY takes precedence and is used, by default, to generate the required JWT token.

The following diagram depicts the processing flow for OkeraEnsemble with Amazon S3 in nScale mode.

OkeraEnsemble with Amazon EMR in nScale mode

In nScale mode, the OkeraEnsemble access proxy:

  1. Services a request from an Amazon S3 client that is bound for Amazon S3.
  2. Validates the user's permission against Okera's policies to access data requested.
  3. If permission is validated, it modifies the request and resigns the Amazon S3 authorization header using its AWS credentials.

Installation Instructions

The following instructions explain how to provision the OkeraEnsemble access proxy service in Amazon EMR using Okera’s odas-emr-boostrap.sh provisioning script. To install using Okera’s odas-emr-bootstrap.sh script, you must make some modifications to the standard Amazon EMR provisioning steps described in Amazon Web Services (AWS) EMR Integration.

  1. Follow the instructions for the Amazon EMR node bootstrap script.

  2. Follow the instructions for Setting Up Spark, but add the following property to spark-defaults:

    "spark.extraListeners":"com.okera.recordservice.spark.OkeraSparkListener"
    
  3. Add another map after the one representing spark-defaults, but this one for Hadoop’s core-site.xml. Note the addition of fs.s3a.endpoint:

     "Classification": "core-site",
     "Properties": {
       "fs.s3bfs.impl": "org.apache.hadoop.fs.s3.S3FileSystem",
       "fs.s3a.aws.credentials.provider": "com.okera.recordservice.hadoop.OkeraCredentialsProvider",
       "recordservice.token-provisioner": "https://<Okera REST server host>:8083",
       "fs.s3a.connection.ssl.enabled": "true",
       "fs.s3a.s3.client.factory.impl": "com.okera.recordservice.hadoop.OkeraS3ClientFactory",
       "okerafs.default.region": "us-west-2",
       "okerafs.<mybucket>.region": "us-east-1",
       "fs.s3a.endpoint": "http://localhost:5010",
       "fs.s3a.path.style.access": "true"
       }     
    

    When deploying OkeraEnsemble nScale in an Amazon EMR 5 environment, set the fs.s3a.s3.client.factory.impl flag to org.apache.hadoop.fs.s3a.OkeraS3ClientFactory. When deploying OkeraEnsemble in an Amazon EMR 6 environment, set the fs.s3a.s3.client.factory.impl flag to com.okera.recordservice.hadoop.OkeraS3ClientFactory.

    The fs.s3a.endpoint setting must be set to localhost. In nScale mode, Hadoop filesystems that make requests to Amazon S3 use the OkeraEnsemble access proxy on the local host rather than in the Okera cluster.

    Note: The example fs.s3a.endpoint setting above is suitable in situations where the access proxy is not configured to listen for SSL/TLS connections. If you want to activate SSL/TLS for the Amazon S3 client-to-proxy connection, configure a DNS A name rule that associates a subdomain (compatible with the SSL certificate with which the access proxy is configured) with an IP of the host loopback interface, such as 127.0.0.1. This allows clients to accept secure connections to the access proxy on the same host.

    Make a configuration okerafs.mybucket.region for each <mybucket> that resides in a region different than the default. Property okerafs.default.region defines the default. When that configuration is not defined, the default will be the AWS default us-east-1.

  4. Follow the instructions in Step 3: Set your cluster name and bootstrap scripts, but append the following arguments in the Okera libraries bootstrap script:

       --rest-server-hostports <Okera REST server host>:8083
       --access-proxy-hostports <Okera REST server host>:5010
       --aws-cli-autoconfig-omit-users <emr-username1>[,<emr-username2>]...
       --use-access-proxy-aws-cli
       --install-jwt-key “<s3 path to JWT private key>”
    

    When --access-proxy-hostports is passed to odas-emr-boostrap.sh, the bootstrap script sets the Amazon S3 environment variable that activates the access proxy, running on port 5010.

    The --install-jwt-key “<s3 path to JWT private key>” argument specifies the Amazon S3 path to the JWT private key to install on the Amazon EMR host and to provision for use by the Okera access proxy running in nScale mode on the Amazon EMR host. You must specify the same private key used by the Okera cluster.

    The aws-cli-autoconfig-omit-users argument specifies a list of Amazon EMR host usernames for which the AWS CLI should not be configured to route through Okera for authorization. When this argument is not specified, only the root user is included in this list. If you specify this argument, be sure to include root in the list, if it is needed. The aws-cli-autoconfig-omit-users argument must be specified before the use-access-proxy-aws-cli argument.

    When the odas-emr-boostrap.sh script runs with the --use-access-proxy-aws-cli setting and these other parameter settings, it installs and configures the Okera AWS CLI plugin and creates the ~/.aws/config file changes necessary to integrate it with the Okera cluster. That file also provides information that the CLI needs to authenticate to Okera (see Credential Processing). When --authserver <algorithm> arguments are passed to odas-emr-boostrap.sh, the AWS CLI sets the token_source value in its ~/.aws/config configurations to be authserver, and the AWS CLI uses authserver as its source for the users’ JSON Web Tokens. The odas-emr-boostrap.sh also sets up some /etc/profile.d scripts that configure the Okera plugin and AWS CLI automatically for new users of a multitenant Amazon EMR cluster.

Amazon S3 Bucket Access Considerations

Okera can grant you access to Amazon S3 buckets, including Amazon S3 buckets that are defined for Amazon's assume secondary role feature as well as buckets to which the Okera cluster is granted access using its IAM permission.

Since nScale deploys the OkeraEnsemble access proxy with least-privilege access to Amazon EMR, it has no IAM permissions naturally and retrieves its credentials to sign Amazon S3 requests from the Okera Policy Engine (Planner). Consequently, when you deploy OkeraEnsemble in nScale mode, you must provide access to the Amazon S3 buckets using either of two methods:

  1. Using Amazon S3's assume secondary role feature. For Amazon S3 buckets that use assume secondary roles (bucket role map), the OkeraEnsemble access proxy retrieves the AWS Security Token Service (STS) credentials associated with the Amazon Resource Name (ARN) for the Amazon S3 bucket.

  2. By setting the OKERA_SYSTEM_IAM_ROLE_ARN configuration parameter in the Okera configuration file to the IAM Amazon Resource Name (ARN) associated with the Okera cluster. When this is activated, Okera can grant OkeraEnsemble nScale users access to buckets to which the Okera cluster has access by permission through its IAM role. When OKERA_SYSTEM_IAM_ROLE_ARN is set to a role ARN, Okera adds a trust relationship with the role itself. For example, if the role ARN is arn:aws:iam::1234567890:role/odap-iam-role, the trust relationship would be:

    {
      "Effect": "Allow",
      "Principal": {
          "AWS": "arn:aws:iam::1234567890:role/odap-iam-role"
       },
       "Action": "sts:AssumeRole"
    }
    

Referencing Amazon S3 Objects in Okera Configuration Parameters in nScale Amazon EMR Deployments

You can reference objects stored in Amazon S3 as Okera configuration parameters for odas-emr-bootstrap. Okera pulls the objects referenced in the configuration parameters and mounts them in the nScale container, making the Amazon S3 paths available to Okera for processing. For example, this is helpful when configuring the SSL certificate and key required to start the OkeraEnsemble Amazon EMR access proxy in TLS/SSL mode:

--external-objects-to-container SSL_CERTIFICATE_FILE=s3://bucket/certificate-object, SSL_KEY_FILE=s3://bucket/key-object

If SSL_CERTIFICATE_FILE specifies the path to the SSL certificates file in Amazon S3 and SSL_KEY_FILE specifies the path to the SSL key file in Amazon S3, these paths can be used by the OkeraEnsemble access proxy for any necessary TLS/SSL processing.

OkeraEnsemble nScale System Token Duration Controls

You can specify the duration, in minutes, of the JWT system token for OkeraEnsemble nScale processing. The SYSTEM_TOKEN_DURATION_MIN configuration parameter, can be set on the nScale container using the Okera Amazon EMR odas-emr-bootstrap script to configure the duration of the Okera system token. For example, passing the following arguments with the odas-emr-bootstrap.sh script will configure the system token duration time to 300 minutes. Valid values are positive integers. The default value is equivalent to one day (1440 minutes).

--local-worker-env-vars "-e SYSTEM_TOKEN_DURATION_MIN=300"

This configuration setting only works when the nScale proxy is configured using JWT_PRIVATE_KEY and not with SYSTEM_TOKEN. When configured using JWT_PRIVATE_KEY, the nScale access proxy generates its own token and the SYSTEM_TOKEN_DURATION_MIN setting determines how long that token is good for. When configured with SYSTEM_TOKEN, the SYSTEM_TOKEN_DURATION_MIN setting has no effect because the JWT token identified by the SYSTEM_TOKEN path includes an embedded expiration time that cannot be governed by SYSTEM_TOKEN_DURATION_MIN setting. If both JWT_PRIVATE_KEY and SYSTEM_TOKEN are specified, the JWT_PRIVATE_KEY is used and the SYSTEM_TOKEN is ignored.