Skip to content

OkeraFS nScale Mode Deployment in EMR Environments

You can elect to deploy the OkeraFS access proxy in nScale mode, so its workload is distributed across your cluster nodes and scales up and down with your clusters. To do this, the OkeraFS access proxy retrieves AWS credentials from the Okera Policy Engine (planner). To communicate with the Okera cluster, the access proxy generates its own system token if it is configured with the JWT private key used by the Okera cluster (via the JWT_PRIVATE_KEY configuration property). This is done for you if you use the odas-emr-bootstrap.sh script with the --install-jwt-key argument (specifying the S3 path to the key). You must use the same private key used by the Okera cluster.

The following diagram depicts the processing flow for OkeraFS with AWS S3 in nScale mode.

OkeraFS with Amazon EMR in nScale mode

In nScale mode, the OkeraFS access proxy:

  1. Services a request from an S3 client that is bound for AWS S3.
  2. Validates the user's permission against Okera's policies to access data requested.
  3. If permission is validated, it modifies the request and resigns the AWS S3 authorization header using its AWS credentials.

Installation Instructions

The following instructions explain how to provision the OkeraFS access proxy service in Amazon EMR using Okera’s odas-emr-boostrap.sh provisioning script. To install using Okera’s odas-emr-bootstrap.sh script, you must make some modifications to the standard EMR provisioning steps described in Amazon Web Services (AWS) EMR Integration.

  1. Follow the instructions for the EMR node bootstrap script.

  2. Follow the instructions for Setting Up Spark, but add the following property to spark-defaults:

    "spark.extraListeners":"com.okera.recordservice.spark.OkeraSparkListener"
    

  3. Add another map after the one representing spark-defaults, but this one for Hadoop’s core-site.xml. Note the addition of fs.s3a.endpoint:

     "Classification": "core-site",
     "Properties": {
       "fs.s3bfs.impl": "org.apache.hadoop.fs.s3.S3FileSystem",
       "fs.s3a.aws.credentials.provider": "com.okera.recordservice.hadoop.OkeraCredentialsProvider",
       "recordservice.token-provisioner": "https://<Okera REST server host>:8083",
       "fs.s3a.connection.ssl.enabled": "true",
       "fs.s3a.s3.client.factory.impl": "com.okera.recordservice.hadoop.OkeraS3ClientFactory",
       "okerafs.default.region": "us-west-2",
       "okerafs.<mybucket>.region": "us-east-1",
       "fs.s3a.endpoint": "http://localhost:5010",
       "fs.s3a.path.style.access": "true"
     }
    

    The fs.s3a.endpoint setting must be set to localhost. In nScale mode, Hadoop filesystems that make requests to S3 use the OkeraFS access proxy on the local host rather than in the Okera cluster.

    Note: The example fs.s3a.endpoint setting above is suitable in situations where the access proxy is not configured to listen for SSL/TLS connections. If you want to activate SSL/TLS for the S3 client-to-proxy connection, configure a DNS A name rule that associates a subdomain (compatible with the SSL certificate with which the access proxy is configured) with an IP of the host loopback interface, such as 127.0.0.1. This allows clients to accept secure connections to the access proxy on the same host.

    Make a configuration okerafs.mybucket.region for each <mybucket> that resides in a region different than the default. Property okerafs.default.region defines the default. When that configuration is not defined, the default will be the AWS default us-east-1.

  4. Follow the instructions in Step 3: Set your cluster name and bootstrap scripts, but append the following arguments in the Okera libraries bootstrap script:

       --rest-server-hostports <Okera REST server host>:8083
       --access-proxy-hostports <Okera REST server host>:5010
       --aws-cli-autoconfig-omit-users <emr-username1>[,<emr-username2>]...
       --use-access-proxy-aws-cli
       --install-jwt-key “<s3 path to JWT private key>”
    

    When --access-proxy-hostports is passed to odas-emr-boostrap.sh, the bootstrap script sets the S3 environment variable that activates the access proxy, running on port 5010.

    The --install-jwt-key “<s3 path to JWT private key>” argument specifies the S3 path to the JWT private key to install on the EMR host and to provision for use by the Okera access proxy running in nScale mode on the EMR host. You must specify the same private key used by the Okera cluster.

    The aws-cli-autoconfig-omit-users argument specifies a list of EMR host usernames for which the AWS CLI should not be configured to route through Okera for authorization. When this argument is not specified, only the root user is included in this list. If you specify this argument, be sure to include root in the list, if it is needed. The aws-cli-autoconfig-omit-users argument must be specified before the use-access-proxy-aws-cli argument.

    When the odas-emr-boostrap.sh script runs with the --use-access-proxy-aws-cli setting and these other parameter settings, it installs and configures the Okera AWS CLI plugin and creates the ~/.aws/config file changes necessary to integrate it with the Okera cluster. That file also provides information that the CLI needs to authenticate to Okera (see Credential Processing). When --authserver <algorithm> arguments are passed to odas-emr-boostrap.sh, the AWS CLI sets the token_source value in its ~/.aws/config configurations to be authserver, and the AWS CLI uses authserver as its source for the users’ JSON Web Tokens. The odas-emr-boostrap.sh also sets up some /etc/profile.d scripts that configure the Okera plugin and AWS CLI automatically for new users of a multitenant EMR cluster.

S3 Bucket Access Considerations

Okera can grant you access to S3 buckets, including S3 buckets that are defined for Amazon's assume secondary role feature as well as buckets to which the Okera cluster is granted access using its IAM permission.

Since n-scale deploys the OkeraFS access proxy with least-privilege access to EMR, it has no IAM permissions naturally and retrieves its credentials to sign S3 requests from the Okera Policy Engine (Planner). Consequently, when you deploy OkeraFS in nScale mode, you must provide access to the S3 buckets using either of two methods:

  1. Using S3's assume secondary role feature. For S3 buckets that use assume secondary roles (bucket role map), the OkeraFS access proxy retrieves the AWS Security Token Service (STS) credentials associated with the Amazon Resource Name (ARN) for the S3 bucket.

  2. By setting the OKERA_SYSTEM_IAM_ROLE_ARN configuration parameter in the Okera configuration file to the IAM Amazon Resource Name (ARN) associated with the Okera cluster. When this is activated, Okera can grant OkeraFS nScale users access to buckets to which the Okera cluster has access by permission through its IAM role. When OKERA_SYSTEM_IAM_ROLE_ARN is set to a role ARN, Okera adds a trust relationship with the role itself. For example, if the role ARN is arn:aws:iam::1234567890:role/odap-iam-role, the trust relationship would be:

    {
      "Effect": "Allow",
      "Principal": {
          "AWS": "arn:aws:iam::1234567890:role/odap-iam-role"
       },
       "Action": "sts:AssumeRole"
    }
    

OkeraFS nScale System Token Duration Controls

You can specify the duration, in minutes, of the JWT system token for OkeraFS nScale processing. The SYSTEM_TOKEN_DURATION_MIN configuration parameter, can be set on the nScale container using the Okera EMR odas-emr-bootstrap script to configure the duration of the Okera system token. For example, passing the following arguments with the odas-emr-bootstrap.sh script will configure the system token duration time to 300 minutes. Valid values are positive integers. The default value is equivalent to one day (1440 minutes).

--local-worker-env-vars "-e SYSTEM_TOKEN_DURATION_MIN=300"

This configuration setting only works when the nScale proxy is configured using JWT_PRIVATE_KEY and not with SYSTEM_TOKEN. When configured using JWT_PRIVATE_KEY, the nScale access proxy generates its own token and the SYSTEM_TOKEN_DURATION_MIN setting determines how long that token is good for. When configured with SYSTEM_TOKEN, the SYSTEM_TOKEN_DURATION_MIN setting has no effect because the JWT token identified by the SYSTEM_TOKEN path includes an embedded expiration time that cannot be governed by SYSTEM_TOKEN_DURATION_MIN setting. If both JWT_PRIVATE_KEY and SYSTEM_TOKEN are specified, the JWT_PRIVATE_KEY is used and the SYSTEM_TOKEN is ignored.