Skip to content

OkeraFS Deployment on AWS S3 (Preview Feature)

This document describes the deployment and use of OkeraFS in AWS S3 environments. The following diagram depicts the processing flow for OkeraFS with AWS S3.

OkeraFS with AWS S3

Credential Processing

The AWS CLI requires a JSON web token (JWT) specified in one of the following ways:

  • Specify the JWT in the token field of ~/.aws/config.

  • Specify token_source = authserver in ~/.aws/config to indicate that an authentication server will be used.

  • Specify the JWT in the user's $HOME/.okera/token setting.

The following sequence of events occur:

  1. When the CLI starts, it calls the refresh function (refresh()) for the Okera credentials provider. This uses the user name to obtain the user’s JWT from the authentication server (or looks for the token in ~/.aws/config).

  2. After the JWT is obtained, the credentials provider connects to the Okera REST server and retrieves an AWS-like access key and secret key. These are not valid AWS credentials, but are automatically generated by Okera. They are unique to the user and conform to the alphanumeric syntax required to run the AWS S3 CLI.

    The credentials provider returns the AWS-like credentials to the AWS CLI.

  3. The S3 command or request is then routed to S3 using the Okera access proxy.

  4. The access proxy uses the AWS-like credentials to authenticate the user with Okera and verifies (using the Okera Planner) that the user is authorized to issue their command or request.

  5. If the user is authorized, the access proxy re-signs the command or request using the Okera system credentials (so the user has the permissions they need) and then sends the command or request to S3 for processing.

System Requirements

The following system requirements must be met.

  • Okera 2.9 or later must be installed.

  • When running on EMR, versions 5.2 and 6.1 are supported.

  • The AWS CLI V1 must be installed, using either Python 2.7 or Python 3.4+.

OkeraFS Installation for AWS

During installation:

  • AWS CLI client hosts are updated with a plugin to the AWS CLI, a configuration in ~/.aws/config, and the plugin on PYTHONPATH.

  • EMR hosts are updated with updates to Okera JAR files and new Hadoop (/etc/hadoop/core-site.xml) and Spark (/usr/lib/spark/conf/spark-defaults.conf) configurations.

Note: After the installation, be sure to activate the access proxy and port. See Activate the Access Proxy and Port.

To install a new single-node cluster on an EC2 instance, see Deploy Okera on EC2. To upgrade an existing cluster, be sure to read Upgrade Okera first and then use the Okera okctl utility. Enter:

./okctl upgrade --version <okera-version>

Then enter okctl status to obtain the status of the upgrade.

The following instructions explain how to provision Amazon EMR using Okera’s odas-emr-boostrap.sh provisioning script. To install using Okera’s odas-emr-bootstrap.sh script, you must make some modifications to the standard EMR provisioning steps described in Amazon Web Services (AWS) EMR Integration.

  1. Follow the instructions for the EMR node bootstrap script.

  2. Follow the instructions for Setting Up Spark, but add the following property to spark-defaults:

    "spark.extraListeners":"com.okera.recordservice.spark.OkeraSparkListener"
    

  3. Add another map after the one representing spark-defaults, but this one for Hadoop’s core-site.xml:

    "Classification": "core-site",
    "Properties": {
      "fs.s3bfs.impl": "org.apache.hadoop.fs.s3.S3FileSystem",
      "fs.s3a.aws.credentials.provider": "com.okera.recordservice.hadoop.OkeraCredentialsProvider",
      "recordservice.token-provisioner": "https://<ODAS REST server host>:8083",
      "fs.s3a.connection.ssl.enabled": "true",
      "fs.s3a.s3.client.factory.impl": "com.okera.recordservice.hadoop.OkeraS3ClientFactory",
      "okerafs.default.region": "us-west-2",
      "okerafs.<mybucket>.region": "us-east-1",
      "fs.s3a.endpoint": "https://<ODAS REST Server Host:5010",
      "fs.s3a.path.style.access": "true"
    }
    

    Make a configuration okerafs.mybucket.region for each <mybucket> that resides in a region different than the default. Property okerafs.default.region defines the default. When that configuration is not defined, the default will be the AWS default us-east-1.

  4. Follow the instructions in Step 3: Set your cluster name and bootstrap scripts, but append the following arguments in the Okera libraries bootstrap script:

    --rest-server-hostports <ODAS REST server host>:8083
    --access-proxy-hostports <ODAS REST server host>:5010
    --aws-cli-autoconfig-omit-users <emr-username1>[,<emr-username2>]...
    --use-access-proxy-aws-cli
    

    The aws-cli-autoconfig-omit-users argument specifies a list of EMR host usernames for which the AWS CLI should not be configured to route through Okera for authorization. When this argument is not specified, only the root user is included in this list. If you specify this argument, be sure to include root in the list, if it is needed. The aws-cli-autoconfig-omit-users argument must be specified before the use-access-proxy-aws-cli argument.

    When the odas-emr-boostrap.sh script runs with these parameter settings, it installs and configures the Okera AWS CLI plugin and creates the ~/.aws/config file changes necessary to integrate it with the Okera cluster. That file also provides information that the CLI needs to authenticate to Okera (see Credential Processing). If the --authserver <algorithm> arguments are passed to odas-emr-boostrap.sh, the AWS CLI sets the token_source value in its ~/.aws/config configurations to be authserver, and the AWS CLI uses authserver as its source for the users’ JSON Web Tokens. The odas-emr-boostrap.sh also sets up some /etc/profile.d scripts that configure the Okera plugin and AWS CLI automatically for new users of a multitenant EMR cluster.

Manually Install the AWS CLI on an EC2 (or EMR) Instance

To install and configure the AWS CLI on an EC2 or EMR instance manually, use the code in the following sections.

In AWS CLI V1 and Python 2.7 (Amazon Linux 2)

  1. Run the following as a user with sudo access to install the CLI plugin:

    INSTALLATION=/usr/lib/okera/python27/site-packages/
    sudo mkdir -p $INSTALLATION
    sudo aws s3 cp s3://okera-release-useast/<version>/client/awscli/awscli_plugin.tar/tmp/awscli_plugin.tar
    sudo tar xf /tmp/awscli_plugin.tar -C $INSTALLATION
    sudo yum -y install python-pip
    sudo pip install -r $INSTALLATION/requirements.txt
    

  2. Set up PYTHONPATH for users on this host to make use of the CLI:

    sudo echo "export PYTHONPATH=$PYTHONPATH:$INSTALLATION" > /tmp/0_okera_cli_setup.sh
    sudo mv /tmp/0_okera_cli_setup.sh /etc/profile.d/0_okera_cli_setup.sh
    

In AWS CLI V1 and Python 3.6+

  1. Install a Python 3 virtual environment and a new AWS CLI V1.

    sudo amazon-linux-extras install python3
    sudo yum groupinstall -y "Development Tools"
    sudo yum install -y python3-devel
    curl "https://s3.amazonaws.com/aws-cli/awscli-bundle.zip" -o "awscli-bundle.zip"
    sudo unzip awscli-bundle.zip
    PYTHON3="/usr/bin/env python3"
    sudo $PYTHON3 ./awscli-bundle/install -i $AWS_CLI_INSTALL -b /usr/local/bin/aws
    

  2. Install the Okera plugin.

    INSTALLATION=/usr/lib/okera/python27/site-packages/
    sudo mkdir -p $INSTALLATION
    sudo aws s3 cp s3://okera-release-useast/<version>/client/awscli/awscli_plugin.tar/tmp/awscli_plugin.tar/tmp/awscli_plugin.tar
    sudo tar xvf /tmp/awscli_plugin.tar -C $INSTALLATION
    sudo echo "export PYTHONPATH=$PYTHONPATH:$INSTALLATION" > /tmp/0_okera_cli_setup.sh
    sudo mv /tmp/0_okera_cli_setup.sh /etc/profile.d/0_okera_cli_setup.sh
    

  3. Install PyOkera and its dependencies.

    PIP="sudo $AWS_CLI_INSTALL/bin/pip"
    $PIP install -r $INSTALLATION/requirements.txt
    

Configure the AWS CLI Plug-In Manually

If you install the AWS CLI plug-in manually (rather than on EMR using the bootstrapping script), you need to configure the AWS CLI to use the plug-in by adding the following to the $HOME/.aws/config for any users who use it. The token_source property indicates whether the user’s token should be retrieved from the Okera authserver if that is set up on the host.

# add to ~/.aws/config
[profile okera]
okera =
    proxy = https://<CDAS rest server host>:5010
    rest = https://<CDAS rest server host>:8083
    token_source = <'authserver' or empty>
    token = <user’s JWT token if token_source is not ‘authserver’>
[plugins]
okera = okera_fs_aws
# example, AWS CLI retrieves token from authserver: [profile okera]
okera =
    proxy = https://10.1.10.99:5010
    rest = https://10.1.10.99:8083
    token_source = authserver
[plugins]
okera = okera_fs_aws
# example, using explicit token:
[profile okera]
okera =
    proxy = https://10.1.10.99:5010
    rest = https://10.1.10.99:8083
    token = eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZ (etc)
[plugins]
okera = okera_fs_aws

Setting AWS_PROFILE=okera causes the AWS CLI to use OkeraFS, by default. Otherwise, AWS CLI commands will need --profile okera to activate those features.

Activate the Access Proxy and Port (Required)

After OkeraFS is installed, the access proxy must be activated and port 5010 must be opened on the REST server.

  1. Add the following configuration parameter to the YAML configuration file for the cluster:

    REST_SERVER_ENABLE_ACCESS_PROXY: true
    

  2. Enter: okctl update --config=<config file> to apply the new configuration.

  3. Patch the REST server container to open port 5010:

    NEW_PORT='{ "metadata": { "managedFields": [ { "fields": { "f:spec": { "f:ports": { "k:{\"port\":5010,\"protocol\":\"TCP\"}": { ".": null, "f:name": null, "f:nodePort": null, "f:port": null, "f:protocol": null, "f:targetPort": null } } } } } ] } }'
    
    kubectl patch svc cdas-rest-server --patch "$NEW_PORT"
    
    kubectl patch svc cdas-rest-server --type='json' -p='[{"op": "add", "path": "/spec/ports/-", "value":{ "name": "access", "nodePort": 5010, "port": 5010, "protocol": "TCP", "targetPort": 5010 }}]'
    

Map Okera Access Permissions

To S3 Actions

The following table maps Okera access permissions to S3 actions.

Okera Access
Supported S3 Actions
Notes
ALL All of the actions below User can perform any of the supported S3 actions for objects/paths under the URI.
SELECT GetObject
HeadObject
CopyObject
PutObject
User can read files, folders, and buckets. Read access is provided for the source when a copy action is requested (CopyObject and putObject).

Note: Verify the correct privileges have been assigned to perform S3 actions for the URI. For example, if you intend to use a URI to create an external table, be sure that you have SELECT privileges, otherwise, the attempt to create the table will fail. See Access Levels.
INSERT CompleteMultipartUpload
UploadPart
AbortMultipartUpload
CopyObject
PutObject
User can write to files, folders, and buckets. Write access is also provided for the destination when a copy action is requested (CopyObject and putObject).
SHOW GetBucketLocation
HeadBucket
ListObjects
ListObjectsV2
User can perform metadata retrieval for files, folders, and buckets.
DELETE DeleteObject User can delete files.

To AWS CLI Commands

The following table maps AWS CLI commands to Okera permissions.

CLI Command
Okera Permissions
Equivalent S3 Actions
aws s3 sync pathA pathB SELECT pathA
INSERT pathB
SHOW pathB
CopyObject
CopyObject
HeadObject
aws s3 cp pathA pathB SELECT pathA
INSERT pathB
CopyObject
CopyObject
aws s3 mv pathA pathB SELECT pathA
INSERT pathB
SELECT pathA
DELETE pathA
HeadObject
CopyObject
CopyObject
DeleteObject
aws s3api copy-object pathA pathB SELECT copy-source
INSERT key
CopyObject
CopyObject
aws s3 is pathA SHOW pathA ListObjects
aws s3api create-multipart-upload INSERT key CreateMultipartUpload
aws s3api complete-multipart-upload INSERT key CompleteMultipartUpload
aws s3api abort-multipart-upload INSERT key AbortMultipartUpload
aws s3api head-bucket --bucket pathA SHOW bucket HeadBucket
aws s3api head-object --bucket bucketA --key pathA SHOW and SELECT pathA HeadObject
aws s3api list-buckets SHOW ListBuckets
aws s3api list-multipart-uploads SHOW ListMultipartUploads
aws s3api list-objects-v2 SHOW ListObjectsV2
aws s3api list-parts --key pathA INSERT pathA ListParts
aws s3api upload-part --key pathA INSERT pathA UploadPart
aws s3api upload-part-copy --copy-source pathA SELECT on pathA UploadPartCopy
aws s3api upload-part-copy --key pathA INSERT pathA UploadPartCopy
aws s3api delete-object DELETE DeleteObject

To Spark Actions

The following table maps Spark actions to Okera access permissions. Equivalent S3 actions re

Spark Actions
Okera Permissions
Equivalent S3 Actions
spark.write.* INSERT
SELECT
DELETE
SHOW
CopyObject
DeleteObject
GetObject
HeadObject
ListBucket
spark.read.* SHOW
SELECT
HeadObject
GetObject

S3 Bucket Role Mapping Support

OkeraFS supports the ability to assume secondary roles to read S3 data, with different roles mapped to different buckets. For more information, see Amazon S3 Bucket Role Mapping Support.

S3 Example

This section provides an S3 example.

Grant Access to an S3 URI

DROP ROLE IF EXISTS emr_user_role;
CREATE ROLE emr_user_role;
GRANT ROLE emr_user_role TO GROUP hadoop;
GRANT SHOW ON URI ‘s3://mybucket/mypath’ TO ROLE emr_user_role;

For file access control, the permissions assignable to an S3 URI are:

  • SELECT - read permission allows copy-from or spark.read

  • SHOW - list permission which allows aws s3 ls

  • DELETE - delete permission

  • INSERT - write permission

Use the Roles tab in the Okera UI to configure OkeraFS permissions.

AWS CLI Command Example

aws s3 ls s3://mybucket/mypath/  
aws s3 ls s3://mybucket/anotherpath/  # will fail with Access Denied

Add New Users as Okera/EMR Admin

./okctl users create user_a group_a       # on Okera host
useradd -m user_a                         # on EMR host

Grant Access in Okera

Grant user_a access to s3a://mybucket/user_a_data.

sudo su - user_a                          # on EMR host
aws s3 ls s3://mybucket/user_a_data/      # user_a access data

Spark Example

This section provides a Spark example.

1. Grant an EMR User Read Access

As an Okera admin, this example grants an EMR user read access to s3a://controlled_data, and write access to s3a://my_work_bucket. Then, as the EMR user, in Spark-shell, read from one object and write to another.

spark-shell
val df = spark.read.csv(“s3a://controlled_data/input.csv”);
df.show();
df.write.csv(“s3a://myworkbucket/input.csv”)

2. Grant a User Access

In this step, we grant user_b access to run a query against structured data, filtering and writing the result set to a new table. We also register some data in Okera (for example, s3://controlled_data/transactions) as table source.transactions. Then in the Okera Workspace, we grant user_b read access to some part of that table.

DROP ROLE IF EXISTS analysts;
CREATE ROLE analysts;
GRANT ROLE analysts TO GROUP user_b;
GRANT SELECT ON TABLE source.transactions TO ROLE analysts;
GRANT ALL ON URI 's3://experimental/workspaces/ TO ROLE analysts;

3. Create a Destination External Table for the Result Set

spark.sql("CREATE EXTERNAL TABLE experiments.transaction_data ( 
tnxid INT, 
tnxdate DATE, 
amount DOUBLE, 
userid STRING, 
ip_address STRING,
address STRING, 
country STRING, 
region STRING ) 
STORED AS PARQUET LOCATION 's3://experimental/workspaces/transaction_data'")

4. Select Structured Data From the Source Table With a Filter

val df = spark.sql("select tnxid, tnxdate, amount, userid, ip_address, address, country, region from source.transactions where amount < 800")

5. Write to the Table With the Dataframe

df.write.mode("overwrite").parquet("s3a://experimental/workspaces/transaction_data")