Skip to content

Cloudera CDH Integration

This topic describes how to configure an existing CDH cluster to an existing Okera deployment.

Prerequisites

  • Okera must be running. We need the endpoints for the Okera catalog components referred to as ODAS HMS and ODAS Sentry. We also need the Okera planner endpoint referred to as ODAS Planner.
  • If kerberized, the principal for Okera (referred to as ODAS Principal)
  • CDH (5.7+) must be running and managed by Cloudera Manager (CM). This cluster should be fully functional with Kerberos enabled (if desired) and Apache Sentry enabled. This can include any subset of the CDH components.

The result of these configuration changes will have CDH use the Okera Catalog, replacing the Hive metastore and Apache sentry store components. Note that even if these components are still running, when properly configured, they will not be used. No clients should interact with them.

A summary of what we will do is:

  1. Configure HMS clients to talk to the Okera catalog. This includes other services such as Impala and HiveServer2 as well as gateway client configs for Spark, Pig, etc.
  2. Configure Sentry clients to talk to the Okera catalog. Clients typically do not contact this service directly, we will only need to update HiveServer2 and Impala.
  3. Configure the gateway client configs to use Okera's data access service. This provides the functionality that the RecordService service provided.

These steps are repeated across multiple CDH clusters, allowing them to share the same metadata.

HMS Configuration Changes

We need to make the following configuration changes in multiple places for the different HMS clients. The configuration changes are:

<property>
  <name>hive.metastore.uris</name>
  <value>thrift://<ODAS HMS Host:Port></value>
</property>
<property>
  <name>hive.metastore.kerberos.principal</name>
  <value><ODAS Principal></value>
</property>

If the cluster is not kerberized, then the Kerberos principal is not necessary.

Hive Service Configuration

Configuration settings need to be set in Hive -> Service Wide -> Hive Service Advanced Configuration Snippet (Safety Value) for hive-site.xml

This requires restarting the Hive service. You can verify this is set properly by going on the machine (requires root access) running HiveServer2 and looking in /var/run/cloudera-scm-agent/process/<latest folder for hive server2>/hive-site.xml.

The CM-generated configuration should make it very clear that these two values have been overridden.

Note: This will cause the Hive metastore server to be reported as unhealthy. This is expected and can be safely ignored. HiveServer2 health should be healthy.

Hive Client Configuration

Configuration settings need to be set in Hive -> Gateway -> Hive Client Advanced Configuration Snippet (Safety Value) for hive-site.xml

This requires deploying the client configurations and restarting the dependent services. You can verify this is set properly by going to any gateway machine and looking in /etc/hive/conf/hive-site.xml.

Impala Configuration

Impala also must be configured to the Okera catalog. Update the Hive service and Hive client configuration files (described above) in Impala -> Impala Catalog Server -> Catalog Server Hive Advanced Configuration Snippet (Safety Valve).

Sentry Store Configuration

For Apache Sentry use, we require the following configuration settings. Again, the Kerberos principal is only required for kerberized clusters.

<property>
  <name>sentry.service.client.server.rpc-address</name>
  <value><ODAS Sentry Host></value>
</property>
<property>
  <name>sentry.service.client.server.rpc-port</name>
  <value><ODAS Sentry Port></value>
</property>
<property>
  <name>sentry.service.principal</name>
  <value><ODAS Principal></value>
</property>

Hive Server 2 Configuration

Configuration settings should be set in Hive -> Service Wide -> Hive Service Advanced Configuration Snippet (Safety Valve) for sentry-site.xml.

This requires restarting the dependent services. These can be verified by looking in the generated configuration file for the HiveServer2 service (see HMS Configuration Changes for details).

Impala Configuration

Configuration settings should be set in: Impala -> Service Wide -> Impala Service Advanced Configuration Snippet (Safety Valve) for sentry-site.xml.

This requires restarting Impala.

Note: The Impala principal's primary (typically 'impala') must also be in the list of Okera catalog admins (env: OKERA_CATALOG_ADMINS).

RecordService Configuration

RecordService configuration settings can be set in either mapred-site.xml or yarn-site.xml, depending on which one you are using. The configuration is:

<property>
  <name>recordservice.planner.hostports</name>
  <value><ODAS Planner Host:Port></value>
</property>

This must be set in the safety valves in Yarn -> Gateway -> MapReduce Client Advanced Configuration Snippet (Safety Valve) for mapred-site.xml and Yarn -> Gateway -> YARN Client Advanced Configuration Snippet (Safety Valve) for yarn-site.xml

This requires redeploying the client configs. You can verify it is set by going on any gateway machine and looking in /etc/hadoop/conf/[mapred|yarn]-site.xml.

Client JARs

Okera publishes JARs that are API compatible with the RecordService JARs.

POM Configuration

To use these JARs from Apache Maven, configure the Maven project object model (POM) to use our repository and version. This can be added to the POM.

  <properties>
    <recordservice.version>1.0.0-beta-9</recordservice.version>
  </properties>

  <!-- For MapReduce -->
  <dependencies>
    <dependency>
      <groupId>com.okera.recordservice</groupId>
      <artifactId>recordservice-mr</artifactId>
      <version>${recordservice.version}</version>
    </dependency>
  </dependencies>

  <!-- For Spark 1.6 -->
  <dependencies>
    <dependency>
      <groupId>com.okera.recordservice</groupId>
      <artifactId>recordservice-spark</artifactId>
      <version>${recordservice.version}</version>
    </dependency>
  </dependencies>

  <!-- For Spark 2.0 -->
  <dependencies>
    <dependency>
      <groupId>com.okera.recordservice</groupId>
      <artifactId>recordservice-spark-2.0</artifactId>
      <version>${recordservice.version}</version>
    </dependency>
  </dependencies>

  <repositories>
    <repository>
      <id>cerebro.releases.repo</id>
      <name>libs-release</name>
      <url>https://cerebro.jfrog.io/cerebro/libs-release</url>
    </repository>
    <repository>
      <id>cerebro.snapshots.repo</id>
      <name>libs-snapshot</name>
      <url>https://cerebro.jfrog.io/cerebro/libs-snapshot</url>
    </repository>
  </repositories>

Download the JARs

All of the release JARs are available in S3 in the release location. They are available at:

s3://okera-release-useast/<version>/client

For example:

s3://okera-release-useast/2.10.0/client