Google Cloud Platform Dataproc Integration ^{( Preview Feature)}¶

This document describes how to use Okera for Google Cloud Platform (GCP) Dataproc and how to configure each of the supported services. It assumes that the Okera cluster is already running.

As part of the Dataproc setup, we will:

Perform a bootstrap action to download the Okera client libraries on the Dataproc cluster nodes.

Use a token to set appropriate user account permissions on the Dataproc cluster.
Configure Dataproc to run with the existing Okera cluster.

This is required for Spark.

Note: Dataproc version 2.0.47-wdebin10 is supported by the latest bootstrap scripts.

The complete list of supported components for Dataproc are:

Apache Spark 2 (spark-2.x)
Apache Spark (spark3)

Prerequisites¶

You must have a running Okera cluster already deployed.

Bootstrap Scripts¶

The following sections discuss the bootstrap scripts provided for Okera. These scripts can be run on an existing Dataproc cluster or specified as part of the bootstrap actions when creating a Dataproc cluster using the GCP Console or command line tools.

The bootstrap script is located at:

gs://$OKERA_BUCKET/$VERSION/dataproc/bootstrap.sh

Bootstrap parameters are passed via Dataproc cluster metadata. See Custom Cluster Metadata.

Configure Dataproc¶

Set up a Dataproc cluster in the GCP console. See Google's Dataproc on GKE documentation. See the sections below for details about how the Dataproc cluster should be configured.

Cluster Versioning¶

At this time, only Dataproc 2.0 on Debian 10 is supported.

Cluster Properties¶

The following cluster properties should be set.

Prefix	Key	Required	Description
spark	`spark.recordservice.planner.hostports`	yes	The host and port number of the Okera Policy Engine (planner). This must be reachable from the network context of Dataproc. There is no default. For example, `10.0.0.1:12050`.
spark	`spark.recordservice.workers.local-port`	no	The port that local worker runs on. This is required to run in nScale mode. You must also configure the `local-worker-port` in the bootstrap script. There is no default.

Cluster Initialization Actions¶

This is where the GS path to the bootstrap script must be passed. The path to the script is:

gs://$OKERA_BUCKET/$VERSION/dataproc/bootstrap.sh

For example: gs://okera-release-uswest/2.12.0/dataproc/bootstrap.sh

Custom Cluster Metadata¶

This is where bootstrapping parameters should be configured. The following parameters are available for Dataproc integration:

Parameter	Required	Default	Description
`okera-release-bucket`	no	`okera-release-uswest`	Google storage bucket to pull okera dependencies from.
`okera-version`	yes	--	Release version of okera. Example: 2.12.0
`local-worker-audit-dir`	no	`/var/log/audit`	Path to write worker audit logs
`local-worker-log-dir`	no	`/var/log/watcher`	Path to write watcher logs.
`local-worker-port`	no	Empty	Port to run local worker on. When passed will start local worker. Needed to run in nScale mode. You must also configure `spark.recordservice.workers.local-port`.
`local-worker-image`	no	`quay.io/okera/cdas`	Container repository from which to pull the Okera image.

Manage Security - Project Access¶

You must give the Dataproc cluster access to ~Dataproc~Bigquery. Through the GCP Console UI, this can only be done by granting full access though that is not strictly needed. Check the Allow API access to all Google Cloud services in the same project option when you set up security using the GCP Console. Creating a cluster via the gcloud CLI allows you to provide a more limited scope or a service account to use for access.

User Authentication¶

For clusters configured to use JWT system tokens, a token file ~/.okera/token must be created with the users token.

Obtain the access token from the Okera UI. See Copy Your Access Token.
Copy the access token.
Put the copied access token into a token file in your home directory:
```
~/.okera/token
```
Each user using Dataproc authenticates for themselves, with one token file per user.

Spark Shell Configuration Examples¶

GCS Temp Tables¶

// Create temporary view that uses gcs okera table
spark.sqlContext.sql("CREATE TEMPORARY VIEW temp_view_gcs USING com.cerebro.recordservice.spark OPTIONS (RecordServiceTable '`$OKERA_DB`.`$OKERA_TABLE`')")

val gcsDataFrame = spark.sqlContext.sql("SELECT * from temp_view_gcs")
gcsDataFrame.show(10)

Spark BigQuery Connector¶

Load Table¶

val bqDataFrame = spark.read.format("bigquery").load("$BQ_PROJECT.$BQ_DATASET.$BQ_TABLE")
bqDataFrame.show(10)

SQL Queries¶

Before running SQL queries the following configurations must be set.

spark.conf.set("viewsEnabled","true")
spark.conf.set("materializationDataset","$BQ_DATASET")

After the configurations are set, you can run queries like this:

val bqDataFrame = spark.read.format("bigquery").load("SELECT * FROM `$BQ_PROJECT.$BQ_DATASET.$BQ_TABLE`")
bqDataFrame.show(10)

nScale Enforcement¶

nScale enforcement is supported in Dataproc environments. For information, see nScale Enforcement Fleet Workers.

Google Cloud Platform Dataproc Integration ( Preview Feature)¶