Skip to content

OkeraFS Default Mode Deployment

The following diagram depicts the processing flow for OkeraFS with AWS S3.

OkeraFS with AWS S3

The OkeraFS access proxy:

  1. Services a request from an S3 client that is bound for AWS S3.
  2. Validates the user's permission against Okera's policies to access the data requested.
  3. If permission is validated, it modifies the request and resigns the AWS S3 authorization header using Okera's AWS credentials.

Credential Processing

The AWS CLI requires a JSON web token (JWT) specified in one of the following ways:

  • Specify the JWT in the token field of ~/.aws/config.

  • Specify token_source = authserver in ~/.aws/config to indicate that an authentication server will be used.

  • Specify the JWT in the user's $HOME/.okera/token setting.

The following sequence of events occur:

  1. When the CLI starts, it calls the refresh function (refresh()) for the Okera credentials provider. This uses the user name to obtain the user’s JWT from the authentication server (or looks for the token in ~/.aws/config).

  2. After the JWT is obtained, the credentials provider connects to the Okera REST server and retrieves an AWS-like access key and secret key. These are not valid AWS credentials, but are automatically generated by Okera. They are unique to the user and conform to the alphanumeric syntax required to run the AWS S3 CLI.

    The credentials provider returns the AWS-like credentials to the AWS CLI.

  3. The S3 command or request is then routed to S3 using the Okera access proxy.

  4. The access proxy uses the AWS-like credentials to authenticate the user with Okera and verifies (using the Okera Policy Engine) that the user is authorized to issue their command or request.

  5. If the user is authorized, the access proxy re-signs the command or request using the Okera system credentials (so the user has the permissions they need) and then sends the command or request to S3 for processing.

Installation Steps

During installation:

  • AWS CLI client hosts are updated with a plugin to the AWS CLI, a configuration in ~/.aws/config, and the plugin on PYTHONPATH.

  • EMR hosts are updated with updates to Okera JAR files and new Hadoop (/etc/hadoop/core-site.xml) and Spark (/usr/lib/spark/conf/spark-defaults.conf) configurations.

Note: After the installation, be sure to activate the access proxy and port. See Activate the Access Proxy and Port.

To install a new single-node cluster on an EC2 instance, see Deploy Okera on EC2. To upgrade an existing cluster, be sure to read Upgrade Okera first and then use the Okera okctl utility. Enter:

./okctl upgrade --version <okera-version>

Then enter okctl status to obtain the status of the upgrade.

The following instructions explain how to provision Amazon EMR using Okera’s odas-emr-boostrap.sh provisioning script. To install using Okera’s odas-emr-bootstrap.sh script, you must make some modifications to the standard EMR provisioning steps described in Amazon Web Services (AWS) EMR Integration.

  1. Follow the instructions for the EMR node bootstrap script.

  2. Follow the instructions for Setting Up Spark, but add the following property to spark-defaults:

    "spark.extraListeners":"com.okera.recordservice.spark.OkeraSparkListener"
    

  3. Add another map after the one representing spark-defaults, but this one for Hadoop’s core-site.xml:

    "Classification": "core-site",
    "Properties": {
      "fs.s3bfs.impl": "org.apache.hadoop.fs.s3.S3FileSystem",
      "fs.s3a.aws.credentials.provider": "com.okera.recordservice.hadoop.OkeraCredentialsProvider",
      "recordservice.token-provisioner": "https://<Okera REST server host>:8083",
      "fs.s3a.connection.ssl.enabled": "true",
      "fs.s3a.s3.client.factory.impl": "com.okera.recordservice.hadoop.OkeraS3ClientFactory",
      "okerafs.default.region": "us-west-2",
      "okerafs.<mybucket>.region": "us-east-1",
      "fs.s3a.endpoint": "https://<Okera REST Server Host:5010",
      "fs.s3a.path.style.access": "true"
    }
    

    Make a configuration okerafs.mybucket.region for each <mybucket> that resides in a region different than the default. Property okerafs.default.region defines the default. When that configuration is not defined, the default will be the AWS default us-east-1.

  4. Follow the instructions in Step 3: Set your cluster name and bootstrap scripts, but append the following arguments in the Okera libraries bootstrap script:

    --rest-server-hostports <Okera REST server host>:8083
    --access-proxy-hostports <Okera REST server host>:5010
    --aws-cli-autoconfig-omit-users <emr-username1>[,<emr-username2>]...
    --use-access-proxy-aws-cli
    

    The aws-cli-autoconfig-omit-users argument specifies a list of EMR host usernames for which the AWS CLI should not be configured to route through Okera for authorization. When this argument is not specified, only the root user is included in this list. If you specify this argument, be sure to include root in the list, if it is needed. The aws-cli-autoconfig-omit-users argument must be specified before the use-access-proxy-aws-cli argument.

    When the odas-emr-boostrap.sh script runs with the --use-access-proxy-aws-cli setting and these other parameter settings, it installs and configures the Okera AWS CLI plugin and creates the ~/.aws/config file changes necessary to integrate it with the Okera cluster. That file also provides information that the CLI needs to authenticate to Okera (see Credential Processing). If the --authserver <algorithm> arguments are passed to odas-emr-boostrap.sh, the AWS CLI sets the token_source value in its ~/.aws/config configurations to be authserver, and the AWS CLI uses authserver as its source for the users’ JSON Web Tokens. The odas-emr-boostrap.sh also sets up some /etc/profile.d scripts that configure the Okera plugin and AWS CLI automatically for new users of a multitenant EMR cluster.

Manually Install the AWS CLI on an EC2 (or EMR) Instance

To install and configure the AWS CLI on an EC2 or EMR instance manually, use the code in the following sections.

In AWS CLI V1 and Python 2.7 (Amazon Linux 2)

  1. Run the following as a user with sudo access to install the CLI plugin:

    INSTALLATION=/usr/lib/okera/python27/site-packages/
    sudo mkdir -p $INSTALLATION
    sudo aws s3 cp s3://okera-release-useast/<version>/client/awscli/awscli_plugin.tar/tmp/awscli_plugin.tar
    sudo tar xf /tmp/awscli_plugin.tar -C $INSTALLATION
    sudo yum -y install python-pip
    sudo pip install -r $INSTALLATION/requirements.txt
    

  2. Set up PYTHONPATH for users on this host to make use of the CLI:

    sudo echo "export PYTHONPATH=$PYTHONPATH:$INSTALLATION" > /tmp/0_okera_cli_setup.sh
    sudo mv /tmp/0_okera_cli_setup.sh /etc/profile.d/0_okera_cli_setup.sh
    

In AWS CLI V1 and Python 3.6+

  1. Install a Python 3 virtual environment and a new AWS CLI V1.

    sudo amazon-linux-extras install python3
    sudo yum groupinstall -y "Development Tools"
    sudo yum install -y python3-devel
    curl "https://s3.amazonaws.com/aws-cli/awscli-bundle.zip" -o "awscli-bundle.zip"
    sudo unzip awscli-bundle.zip
    PYTHON3="/usr/bin/env python3"
    sudo $PYTHON3 ./awscli-bundle/install -i $AWS_CLI_INSTALL -b /usr/local/bin/aws
    

  2. Install the Okera plugin.

    INSTALLATION=/usr/lib/okera/python27/site-packages/
    sudo mkdir -p $INSTALLATION
    sudo aws s3 cp s3://okera-release-useast/<version>/client/awscli/awscli_plugin.tar/tmp/awscli_plugin.tar/tmp/awscli_plugin.tar
    sudo tar xvf /tmp/awscli_plugin.tar -C $INSTALLATION
    sudo echo "export PYTHONPATH=$PYTHONPATH:$INSTALLATION" > /tmp/0_okera_cli_setup.sh
    sudo mv /tmp/0_okera_cli_setup.sh /etc/profile.d/0_okera_cli_setup.sh
    

  3. Install PyOkera and its dependencies.

    PIP="sudo $AWS_CLI_INSTALL/bin/pip"
    $PIP install -r $INSTALLATION/requirements.txt
    

Configure the AWS CLI Plug-In Manually

If you install the AWS CLI plug-in manually (rather than on EMR using the bootstrapping script), you need to configure the AWS CLI to use the plug-in by adding the following to the $HOME/.aws/config for any users who use it. The token_source property indicates whether the user’s token should be retrieved from the Okera authserver if that is set up on the host.

# add to ~/.aws/config
[profile okera]
okera =
    proxy = https://<Okera REST server host>:5010
    rest = https://<Okera REST server host>:8083
    token_source = <'authserver' or empty>
    token = <user’s JWT token if token_source is not ‘authserver’>
[plugins]
okera = okera_fs_aws
# example, AWS CLI retrieves token from authserver: [profile okera]
okera =
    proxy = https://10.1.10.99:5010
    rest = https://10.1.10.99:8083
    token_source = authserver
[plugins]
okera = okera_fs_aws
# example, using explicit token:
[profile okera]
okera =
    proxy = https://10.1.10.99:5010
    rest = https://10.1.10.99:8083
    token = eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZ (etc)
[plugins]
okera = okera_fs_aws

Setting AWS_PROFILE=okera causes the AWS CLI to use OkeraFS, by default. Otherwise, AWS CLI commands will need --profile okera to activate those features.

Activate the Access Proxy and Port (Required)

After OkeraFS is installed, the access proxy must be activated and port 5010 must be opened on the REST server.

  1. Add the following configuration parameter to the YAML configuration file for the cluster:

    REST_SERVER_ENABLE_ACCESS_PROXY: true
    

  2. Enter: okctl update --config=<config file> to apply the new configuration.

  3. Patch the REST server container to open port 5010:

    NEW_PORT='{ "metadata": { "managedFields": [ { "fields": { "f:spec": { "f:ports": { "k:{\"port\":5010,\"protocol\":\"TCP\"}": { ".": null, "f:name": null, "f:nodePort": null, "f:port": null, "f:protocol": null, "f:targetPort": null } } } } } ] } }'
    
    kubectl patch svc cdas-rest-server --patch "$NEW_PORT"
    
    kubectl patch svc cdas-rest-server --type='json' -p='[{"op": "add", "path": "/spec/ports/-", "value":{ "name": "access", "nodePort": 5010, "port": 5010, "protocol": "TCP", "targetPort": 5010 }}]'
    

S3 Example

This section provides an S3 example.

Grant Access to an S3 URI

DROP ROLE IF EXISTS emr_user_role;
CREATE ROLE emr_user_role;
GRANT ROLE emr_user_role TO GROUP hadoop;
GRANT SHOW ON URI ‘s3://mybucket/mypath’ TO ROLE emr_user_role;

For file access control, the permissions assignable to an S3 URI are:

  • SELECT - read permission allows copy-from or spark.read

  • SHOW - list permission which allows aws s3 ls

  • DELETE - delete permission

  • INSERT - write permission

Use the Roles tab in the Okera UI to configure OkeraFS permissions.

AWS CLI Command Example

aws s3 ls s3://mybucket/mypath/  
aws s3 ls s3://mybucket/anotherpath/  # will fail with Access Denied

Add New Users as Okera/EMR Admin

./okctl users create user_a group_a       # on Okera host
useradd -m user_a                         # on EMR host

Grant Access in Okera

Grant user_a access to s3a://mybucket/user_a_data.

sudo su - user_a                          # on EMR host
aws s3 ls s3://mybucket/user_a_data/      # user_a access data

Spark Example

This section provides a Spark example.

1. Grant an EMR User Read Access

As an Okera admin, this example grants an EMR user read access to s3a://controlled_data, and write access to s3a://my_work_bucket. Then, as the EMR user, in Spark-shell, read from one object and write to another.

spark-shell
val df = spark.read.csv(“s3a://controlled_data/input.csv”);
df.show();
df.write.csv(“s3a://myworkbucket/input.csv”)

2. Grant a User Access

In this step, we grant user_b access to run a query against structured data, filtering and writing the result set to a new table. We also register some data in Okera (for example, s3://controlled_data/transactions) as table source.transactions. Then in the Okera Workspace, we grant user_b read access to some part of that table.

DROP ROLE IF EXISTS analysts;
CREATE ROLE analysts;
GRANT ROLE analysts TO GROUP user_b;
GRANT SELECT ON TABLE source.transactions TO ROLE analysts;
GRANT ALL ON URI 's3://experimental/workspaces/ TO ROLE analysts;

3. Create a Destination External Table for the Result Set

spark.sql("""CREATE EXTERNAL TABLE experiments.transaction_data (
tnxid INT,
tnxdate DATE,
amount DOUBLE,
userid STRING,
ip_address STRING,
address STRING,
country STRING,
region STRING )
STORED AS PARQUET LOCATION 's3://experimental/workspaces/transaction_data'""")

4. Select Structured Data From the Source Table With a Filter

val df = spark.sql("select tnxid, tnxdate, amount, userid, ip_address, address, country, region from source.transactions where amount < 800")

5. Write to the Table With the Dataframe

df.write.mode("overwrite").parquet("s3a://experimental/workspaces/transaction_data")