nScale Enforcement Fleet Workers¶
Okera supports running Okera Enforcement Fleet worker nodes collocated with Amazon EMR or Dataproc worker nodes (Okera nScale). The steps described below for starting up and configuring nScale workers need to be followed in addition to the steps above for standard Okera-integrated Amazon EMR or Dataproc environments.
We can break the steps down as follows:
- Review encryption settings for nScale.
- Review deferred URI signing settings for nScale
- Make bootstrap changes for nScale. The updated bootstrap will include the Presto configuration for Amazon EMR with nScale Okera workers.
- Make Hive configuration changes for nScale.
- Make Spark configuration changes for nScale.
Encryption Settings for nScale¶
You can encrypt and decrypt tasks using Advanced Encryption Standard (AES). To specify the key, set the encryption_key_path
option for RS_ARGS in Okera's configuration file to the path of the file that contains the key. There are no constraints on the size of this file (except that it cannot be empty). Okera uses SHA256 on the contents to get a 32-byte key (256-bit).
Another option for specifying the key uses two environment variables:
ENABLE_TASK_ENCRYPTION
can be set totrue
orfalse
.TASK_ENCRYPTION_KEY
specifies the path of the file that contains the key.
If ENABLE_TASK_ENCRYPTION
is set to true
but TASK_ENCRYPTION_KEY
is not set, Okera attempts to
use the JWT_PRIVATE_KEY
as the encryption key if it is present. If it is not present, an error occurs.
Finally, when integrating with Amazon EMR (but not Dataproc), you can add the --local-worker-encryption-key
argument to odas-emr-bootstrap.sh
to allow someone to set the path to the encryption key for use on EMR. This does not apply to Dataproc environments.
Deferred URI Signing Settings for nScale¶
An optional performance-tuning nScale setting called recordservice.task.plan.defer-signing-urls
indicates whether presigning URIs in all tasks should be deferred for nScale requests initiated from Spark and Hive. Valid values are true
(defer presigning URIs in requests) and false
(continue presigning URIs in requests). The default is false
.
The following server-side RS_ARGS options can be used to configure deferred URI signing for nScale:
-
Use
defer_signed_urls_initial_num_tasks
to specify the default number of nScale tasks for which deferred signing should be performed. The default is 64. -
Use
defer_signed_urls_initial_percent_tasks
to specify the percentage of nScale tasks for which deferred signing should be performed. The default is 25%.
Okera defers signing for either the number of tasks specified by defer_signed_urls_initial_num_tasks
or for the percentage of tasks specified by defer_signed_urls_initial_percent_tasks
, whichever is lower.
All requests to refresh tasks containing presigned URIs are logged in the audit log.
Bootstrap Changes for nScale¶
To enable nScale, add some extra flags to your Amazon EMR Node Bootstrap script or your Dataproc metadata.
The following options are available.
Option | Supported in | Description |
---|---|---|
--external-objects-to-container |
EMR | Reference objects stored in Amazon S3 as Okera configuration parameters. Okera pulls the objects referenced in the configuration parameters and mounts them in the nScale container, making the Amazon S3 paths available to Okera for processing. For example, this is helpful when configuring the SSL certificate and key required to start the OkeraEnsemble Amazon EMR access proxy in TLS/SSL mode: --external-objects-to-container SSL_CERTIFICATE_FILE=s3://bucket/certificate-object, SSL_KEY_FILE=s3://bucket/key-object |
--init <initScript> |
EMR | The Amazon S3 path init script to be run as part of bootstrap. This is useful to set environment variables such as HTTP_PROXY and OKERA_BITS_REGION . |
--local-worker-args <space-separated arguments> |
EMR | Use this to pass RS_ARGS to the nScale worker. |
--local-worker-audit-dir <location> |
EMRDataproc | The Amazon S3 URI location or local storage location (for Dataproc) to which container's audit logs should be uploaded. With that, one way to call the script now becomes: 2.16.0 --planner-hostports <hostport> --local-worker-webui-port <webui-port> --local-worker-port <worker-port> hive presto spark-2.x |
--local-worker-encryption-key |
EMR | Specifies the worker encryption key. See Encryption Settings for nScale. |
--local-worker-env-vars |
EMR | Environment variables for the nScale worker. These are supplied in the following format (space-separated with a -e before each key-value pair): -e envKey1=value -e envKey2=value . |
--local-worker-log-dir <location> |
EMRDataproc | The Amazon S3 URI location or local storage location (for Dataproc) to which container's logs should be uploaded. |
--local-worker--port <port> |
EMRDataproc | This option should be the last option set, right before the <list of components> . |
--local-worker-version <version> |
EMRDataproc | By default, nScale workers start up with 2.16.0 workers. This option allows you to bootstrap with a custom version of nScale Okera workers. |
--local-worker-webui-port <port> |
EMR | The port on which the nScale workers debug UI is exposed. By default, the UI is not exposed. |
--planner-hostports <hostports> |
EMRDataproc | Link to the cerebro_planner:planner endpoint. |
Hive Configuration Changes for nScale¶
Note: Hive configuration is only supported for nScale in Amazon EMR environments. It is not supported for nScale in Dataproc environments.
To interact with the nScale worker, Hive requires that the recordservice.workers.local-port
configuration setting be specified.
In hive-site
, specify the recordservice.workers.local-port
key, with the local worker port (<local-worker-port>
) as its value. With this change, the hive-site.xml
configuration settings look like this:
[
{
"Classification": "hive-site",
"Properties": {
"hive.fetch.task.conversion": "minimal",
"hive.metastore.rawstore.impl": "com.cerebro.hive.metastore.CerebroObjectStore",
"recordservice.planner.hostports": "<planner-host>:<planner-port>",
"recordservice.workers.local-port" "<local-worker-port>"
}
}
]
Spark Configuration Changes for nScale¶
Note: Spark configuration changes are only supported for nScale in Amazon EMR environments. They are not necessary for nScale in Dataproc environments.
To interact with the nScale worker, Spark requires that two configuration settings be specified:
- In
spark-defaults
, specify thespark.recordservice.workers.local-port
key, with the local worker port number (<local-worker-port>
) as its value. - In
spark-hive-site
, specify therecordservice.workers.local-port
key, with the local worker port number (<local-worker-port>
) as its value.
With these updates, the configuration settings become:
[
{
"Classification":"spark-defaults",
"Properties": {
"spark.recordservice.planner.hostports":"odas-planner-1.internal.net:12050",
"spark.recordservice.workers.local-port":"<local-worker-port>"
}
},
{
"Classification":"spark-hive-site",
"Properties":{
"recordservice.planner.hostports":"odas-planner-1.internal.net:12050",
"recordservice.workers.local-port":"<local-worker-port>"
}
}
]