nScale Enforcement Fleet Workers¶

Okera supports running Okera Enforcement Fleet worker nodes collocated with Amazon EMR or Dataproc worker nodes (Okera nScale). The steps described below for starting up and configuring nScale workers need to be followed in addition to the steps above for standard Okera-integrated Amazon EMR or Dataproc environments.

We can break the steps down as follows:

Review encryption settings for nScale.
Review deferred URI signing settings for nScale
Make bootstrap changes for nScale. The updated bootstrap will include the Presto configuration for Amazon EMR with nScale Okera workers.
Make Hive configuration changes for nScale.
Make Spark configuration changes for nScale.

Encryption Settings for nScale¶

You can encrypt and decrypt tasks using Advanced Encryption Standard (AES). To specify the key, set the encryption_key_path option for RS_ARGS in Okera's configuration file to the path of the file that contains the key. There are no constraints on the size of this file (except that it cannot be empty). Okera uses SHA256 on the contents to get a 32-byte key (256-bit).

Another option for specifying the key uses two environment variables:

ENABLE_TASK_ENCRYPTION can be set to true or false.
TASK_ENCRYPTION_KEY specifies the path of the file that contains the key.

If ENABLE_TASK_ENCRYPTION is set to true but TASK_ENCRYPTION_KEY is not set, Okera attempts to use the JWT_PRIVATE_KEY as the encryption key if it is present. If it is not present, an error occurs.

Finally, when integrating with Amazon EMR (but not Dataproc), you can add the --local-worker-encryption-key argument to odas-emr-bootstrap.sh to allow someone to set the path to the encryption key for use on EMR. This does not apply to Dataproc environments.

Deferred URI Signing Settings for nScale¶

An optional performance-tuning nScale setting called recordservice.task.plan.defer-signing-urls indicates whether presigning URIs in all tasks should be deferred for nScale requests initiated from Spark and Hive. Valid values are true (defer presigning URIs in requests) and false (continue presigning URIs in requests). The default is false.

The following server-side RS_ARGS options can be used to configure deferred URI signing for nScale:

Use defer_signed_urls_initial_num_tasks to specify the default number of nScale tasks for which deferred signing should be performed. The default is 64.
Use defer_signed_urls_initial_percent_tasks to specify the percentage of nScale tasks for which deferred signing should be performed. The default is 25%.

Okera defers signing for either the number of tasks specified by defer_signed_urls_initial_num_tasks or for the percentage of tasks specified by defer_signed_urls_initial_percent_tasks, whichever is lower.

All requests to refresh tasks containing presigned URIs are logged in the audit log.

Bootstrap Changes for nScale¶

To enable nScale, add some extra flags to your Amazon EMR Node Bootstrap script or your Dataproc metadata.

The following options are available.

Option	Supported in	Description
`--external-objects-to-container`	EMR	Reference objects stored in Amazon S3 as Okera configuration parameters. Okera pulls the objects referenced in the configuration parameters and mounts them in the nScale container, making the Amazon S3 paths available to Okera for processing. For example, this is helpful when configuring the SSL certificate and key required to start the OkeraEnsemble Amazon EMR access proxy in TLS/SSL mode: `--external-objects-to-container SSL_CERTIFICATE_FILE=s3://bucket/certificate-object, SSL_KEY_FILE=s3://bucket/key-object`
`--init <initScript>`	EMR	The Amazon S3 path `init` script to be run as part of bootstrap. This is useful to set environment variables such as `HTTP_PROXY` and `OKERA_BITS_REGION`.
`--local-worker-args <space-separated arguments>`	EMR	Use this to pass `RS_ARGS` to the nScale worker.
`--local-worker-audit-dir <location>`	EMR Dataproc	The Amazon S3 URI location or local storage location (for Dataproc) to which container's audit logs should be uploaded. With that, one way to call the script now becomes: `2.18.1 --planner-hostports <hostport> --local-worker-webui-port <webui-port> --local-worker-port <worker-port> hive presto spark-2.x`
`--local-worker-encryption-key`	EMR	Specifies the worker encryption key. See Encryption Settings for nScale.
`--local-worker-env-vars`	EMR	Environment variables for the nScale worker. These are supplied in the following format (space-separated with a `-e` before each key-value pair): `-e envKey1=value -e envKey2=value`.
`--local-worker-log-dir <location>`	EMR Dataproc	The Amazon S3 URI location or local storage location (for Dataproc) to which container's logs should be uploaded.
`--local-worker--port <port>`	EMR Dataproc	This option should be the last option set, right before the `<list of components>`.
`--local-worker-version <version>`	EMR Dataproc	By default, nScale workers start up with 2.18.1 workers. This option allows you to bootstrap with a custom version of nScale Okera workers.
`--local-worker-webui-port <port>`	EMR	The port on which the nScale workers debug UI is exposed. By default, the UI is not exposed.
`--planner-hostports <hostports>`	EMR Dataproc	Link to the `cerebro_planner:planner` endpoint.

Hive Configuration Changes for nScale¶

Note: Hive configuration is only supported for nScale in Amazon EMR environments. It is not supported for nScale in Dataproc environments.

To interact with the nScale worker, Hive requires that the recordservice.workers.local-port configuration setting be specified. In hive-site, specify the recordservice.workers.local-port key, with the local worker port (<local-worker-port>) as its value. With this change, the hive-site.xml configuration settings look like this:

[
  {
    "Classification": "hive-site",
    "Properties": {
      "hive.fetch.task.conversion": "minimal",
      "hive.metastore.rawstore.impl": "com.cerebro.hive.metastore.CerebroObjectStore",
      "recordservice.planner.hostports": "<planner-host>:<planner-port>",
      "recordservice.workers.local-port" "<local-worker-port>"
    }
  }
]

Spark Configuration Changes for nScale¶

Note: Spark configuration changes are only supported for nScale in Amazon EMR environments. They are not necessary for nScale in Dataproc environments.

To interact with the nScale worker, Spark requires that two configuration settings be specified:

In spark-defaults, specify the spark.recordservice.workers.local-port key, with the local worker port number (<local-worker-port>) as its value.
In spark-hive-site, specify the recordservice.workers.local-port key, with the local worker port number (<local-worker-port>) as its value.

With these updates, the configuration settings become:

[
  {
    "Classification":"spark-defaults",
    "Properties": {
       "spark.recordservice.planner.hostports":"odas-planner-1.internal.net:12050",
       "spark.recordservice.workers.local-port":"<local-worker-port>"
     }
  },
  {
    "Classification":"spark-hive-site",
    "Properties":{
      "recordservice.planner.hostports":"odas-planner-1.internal.net:12050",
      "recordservice.workers.local-port":"<local-worker-port>"
    }
  }
]