Skip to content

Google Cloud Platform Dataproc Integration ( Preview Feature)

This document describes how to use Okera for Google Cloud Platform (GCP) Dataproc and how to configure each of the supported services. It assumes that the Okera cluster is already running.

As part of the Dataproc setup, we will:

  • Perform a bootstrap action to download the Okera client libraries on the Dataproc cluster nodes.
  • Use a token to set appropriate user account permissions on the Dataproc cluster.

  • Configure Dataproc to run with the existing Okera cluster.

This is required for Spark.

Note: Dataproc version 2.0.47-wdebin10 is supported by the latest bootstrap scripts.

The complete list of supported components for Dataproc are:

  • Apache Spark 2 (spark-2.x)
  • Apache Spark (spark3)

Prerequisites

You must have a running Okera cluster already deployed.

Bootstrap Scripts

The following sections discuss the bootstrap scripts provided for Okera. These scripts can be run on an existing Dataproc cluster or specified as part of the bootstrap actions when creating a Dataproc cluster using the GCP Console or command line tools.

The bootstrap script is located at:

gs://$OKERA_BUCKET/$VERSION/dataproc/bootstrap.sh
Bootstrap parameters are passed via Dataproc cluster metadata. See Custom Cluster Metadata.

Configure Dataproc

Set up a Dataproc cluster in the GCP console. See Google's Dataproc on GKE documentation. See the sections below for details about how the Dataproc cluster should be configured.

Cluster Versioning

At this time, only Dataproc 2.0 on Debian 10 is supported.

Cluster Properties

The following cluster properties should be set.

Prefix Key Required Description
spark spark.recordservice.planner.hostports yes The host and port number of the Okera Policy Engine (planner). This must be reachable from the network context of Dataproc. There is no default. For example, 10.0.0.1:12050.
spark spark.recordservice.workers.local-port no The port that local worker runs on. This is required to run in nScale mode. You must also configure the local-worker-port in the bootstrap script. There is no default.

Cluster Initialization Actions

This is where the GS path to the bootstrap script must be passed. The path to the script is:

gs://$OKERA_BUCKET/$VERSION/dataproc/bootstrap.sh

For example: gs://okera-release-uswest/2.12.0/dataproc/bootstrap.sh

Custom Cluster Metadata

This is where bootstrapping parameters should be configured. The following parameters are available for Dataproc integration:

Parameter Required Default Description
okera-release-bucket no okera-release-uswest Google storage bucket to pull okera dependencies from.
okera-version yes -- Release version of okera. Example: 2.12.0
local-worker-audit-dir no /var/log/audit Path to write worker audit logs
local-worker-log-dir no /var/log/watcher Path to write watcher logs.
local-worker-port no Empty Port to run local worker on. When passed will start local worker. Needed to run in nScale mode. You must also configure spark.recordservice.workers.local-port.
local-worker-image no quay.io/okera/cdas Container repository from which to pull the Okera image.

Manage Security - Project Access

You must give the Dataproc cluster access to ~Dataproc~Bigquery. Through the GCP Console UI, this can only be done by granting full access though that is not strictly needed. Check the Allow API access to all Google Cloud services in the same project option when you set up security using the GCP Console. Creating a cluster via the gcloud CLI allows you to provide a more limited scope or a service account to use for access.

User Authentication

For clusters configured to use JWT system tokens, a token file ~/.okera/token must be created with the users token.

  1. Obtain the access token from the Okera UI. See Copy Your Access Token.

  2. Copy the access token.

  3. Put the copied access token into a token file in your home directory:

    ~/.okera/token
    

    Each user using Dataproc authenticates for themselves, with one token file per user.

Spark Shell Configuration Examples

GCS Temp Tables

// Create temporary view that uses gcs okera table
spark.sqlContext.sql("CREATE TEMPORARY VIEW temp_view_gcs USING com.cerebro.recordservice.spark OPTIONS (RecordServiceTable '`$OKERA_DB`.`$OKERA_TABLE`')")

val gcsDataFrame = spark.sqlContext.sql("SELECT * from temp_view_gcs")
gcsDataFrame.show(10)

Spark BigQuery Connector

Load Table

val bqDataFrame = spark.read.format("bigquery").load("$BQ_PROJECT.$BQ_DATASET.$BQ_TABLE")
bqDataFrame.show(10)

SQL Queries

Before running SQL queries the following configurations must be set.

spark.conf.set("viewsEnabled","true")
spark.conf.set("materializationDataset","$BQ_DATASET")

After the configurations are set, you can run queries like this:

val bqDataFrame = spark.read.format("bigquery").load("SELECT * FROM `$BQ_PROJECT.$BQ_DATASET.$BQ_TABLE`")
bqDataFrame.show(10)

nScale Enforcement

nScale enforcement is supported in Dataproc environments. For information, see nScale Enforcement Fleet Workers.