Tutorial: Administration – How To Read the Okera Process Logs¶
This self-service tutorial guides you through an administrative task: how to read the log files produced by various Okera services and their processes.
Time needed: 4 hours
Okera's software is deployed on common server systems and usually in a managed Kubernetes environment. Regardless of the runtime specifics, all processes, by default, produce some level of logging messages about their operational status. These messages are useful for administrators and developers to assist with issues or problems that may arise from many sources:
Infrastructure - Okera's platform depends heavily on the infrastructure into which it is deployed. This includes proper, working access to, among many things, the network connectivity, storage services like file systems and relational database system, virtual or dedicated bare-metal server machines, and identity and access management services.
Environmental - This includes details of the infrastructure passed down into the processes as environment variables, such as IP addresses, and database connection information. Another area of possible problems is the software configuration itself, which completes the former with more instance-specific settings and properties (for example, the LDAP/SAML/OAuth details and SSL keys and certificates required for secure operations).
User Errors - Finally, the processes also react to user input and emit log messages in response. These are messages, for instance, recording when access is denied to users based on their permissions, but also when the Okera processes cannot access the data or metadata they require.
Assuming nominal activity on the cluster, a flurry of details are logged, that match a certain pattern. When things go wrong, that pattern changes, and it is important to be able to find the proverbial needle in the haystack.
This tutorial introduces you to the log files produced by Okera's platform and how to analyze them.
Distributed Logs and Log Collection¶
As explained in the architecture overview, the Okera platform has many moving parts. It includes a set of containerized processes that can be run once or many times, for redundancy. The processes produce their logs in such a way that they are captured inside the container by a script that uploads them to the configured log file storage service. The logs are prefixed by the name of the container, and optionally a date-based partition directory, while the file names are constructed from the internal IP address of the container and the time they were created.
Note: For Okera's platform, you can assume each container type is mapped to exactly one specific service - though many services may run at the same time. Consequently, this tutorial uses "container", "process", and "service" interchangeably.
As a distributed system, user requests are sent across many of the Okera processes and services they represent. For instance, assuming a user is logged into the Okera UI sends a SQL query using the Workspace page, the following rough chain of events takes place:
- The user enters the SQL query, for example
select * from okera_sample.sampleand presses the Run button.
- The Okera UI is a browser application that sends the request to run the query to the HTTP API service inside the same container that served the application itself.
- The HTTP API sends the request to the central control plane service in another container, which validates the user's credentials (they were passed with the request), checks the user's permissions, and plans the query.
- Planning entails a call to the catalog services to resolve dataset metadata, such as view and table schemas, and the underlying storage service.
- The control plane service then determines where the query should run, and in case of Okera's own query services it selects the proper query runtime resources (that is, which processing nodes should handle each partial query task), before returning the query plan to the caller, here the frontend HTTP API service.
- The query is then executed according to its query plan by the calling process, which sends the tasks of the plan to the processing nodes, which usually run on all available nodes in the cluster.
- The partial query results are streamed back to the caller and then returned to the Okera UI for rendering.
This simplified chain of events is very common for most data-related activities from clients, though even metadata requests often traverse many services - akin to what you would expect from distantly related microservice applications. This makes it very clear that logs really are as distributed as their originating system, and that finding related log entries requires looking into many separate log files.
Note: Okera strongly advises you to aggregate the log files into a central search service, such as OpenSearch, ElasticSearch or Splunk etc. Commercial or not, these services make it possible to correlate events across many sources by narrowing down the time window and adding important search terms (such as the query ID of the above example query).
Downloading the log files to a workstation and using some form of local search (like
grep or using an advanced editor that can handle these potentially very large files efficiently) is not for the faint of heart - but with enough prowess should help in debugging any service issue.
A final note of warning: Not all logs are alike! This means that when you search for patterns in the logs, be prepared to use variations and approximations, as they may differ based on the process that emitted them. This is an unfortunate side-effect of log frameworks from different programming environments. For instance, the log lines emitted by C++ differ from Python code. While Okera's engineering team is very careful to find a common denominator, in practice not every aspect of the log frameworks can be sufficiently fine-tuned to match exactly. Often dates or log-level information differs, making it more complicated to find, for example, all lines containing errors, ignoring information events. In case of errors specifically, the context information available at runtime, such as stack traces and heap dumps, vary significantly.
Log File Structure¶
Log files and their content varies over time, with most processes going through the following stages:
Pre-Startup - Just before the main process of the container starts, the script responsible to capture the logs and upload them as mentioned, dumps (that is, prints) the environment variables set inside the container, which helps debug environmental issue.
Startup - After the main process is started by the wrapping script, it prints its own startup details, such as memory allocations or details about any additional command line parameters that were set.
Warmup - The processes set up any dependencies, such as database connection pools, create metadata structures, and load the configured resource files. Any issue at this (or the previous) stage often results in a non-operational process or complete failure of the container startup - but there are exceptions that are discussed below with red herrings.
Steady State - After everything is ramped up, the processes go into their main serving loop. Any following messages come from internal system or user interactions.
Depending on the stage the server process is in, the log messages obviously have a different meanings for administrators, and it is best to consider each of them with the same attention - an early problem might manifest itself much later!
As discussed, the logs differ in their layout across the stages as well, even in the same log file, so getting used to them is an essential trait to gain.
Here some examples for each of the stages of the logs produced by the
cerebro-planner container and service.
The following log messages show the Okera configuration parameters dumped in a sorted manner, with Kubernetes-injected ones in between (for the other registered services).
... Dumping Environment Variables: AUTOTAGGER_CONFIGURATION=/opt/okera/data/autotagger-config-wellknowns.json CACHE_MAKER_ENABLED=false CACHE_QUERIES=/opt/cache_queries.json CANARY_PORT=tcp://172.20.85.83:5010 CANARY_PORT_5010_TCP=tcp://172.20.85.83:5010 CANARY_PORT_5010_TCP_ADDR=172.20.85.83 CANARY_PORT_5010_TCP_PORT=5010 CANARY_PORT_5010_TCP_PROTO=tcp CANARY_SERVICE_HOST=172.20.85.83 CANARY_SERVICE_PORT=5010 CANARY_SERVICE_PORT_WEBPORT=5010 CATALOG_ADMINS=admin CATALOG_DB_ENGINE=postgres CATALOG_DB_HMS_DB=hms CATALOG_DB_OKERA_DB=okera CATALOG_DB_PASSWORD=**REDACTED** CATALOG_DB_SSL=true CATALOG_DB_URL=okera.cluster-cyn8yfv.us-west-2.rds.amazonaws.com:5432 CATALOG_DB_USER=okera CATALOG_DB_USERS_DB=users CATALOG_TYPE=hms ...
... Injecting 31381 as CEREBRO_PLANNER_NODE_PORT Injecting 31200 as OKERA_PRESTO_COORDINATOR_NODE_PORT Checking if worker is behind a load balancer... Workers are not running behind load balancer. watcher info: Process started. pid: 14 proc stdout: ************************************************************************ proc stdout: Total system memory: 7 GB proc stdout: ... proc stdout: Final: proc stdout: System-wide headroom: 1024 MB proc stdout: HMS JVM: 512 MB proc stdout: Planner JVM: 1024 MB proc stdout: Worker Memory: 2432 MB proc stdout: Worker JVM: 1024 MB proc stdout: Worker Native: 1408 MB proc stdout: Presto JVM: 2176 MB proc stdout: Presto Query Max Memory: 108800 MB proc stdout: Presto Query Max User Memory Per Node: 1088 MB proc stdout: Presto Query Max Total Memory Per Node: 1414 MB proc stdout: Presto Query Memory Headroom Per Node: 544 MB proc stdout: ************************************************************************ proc stdout: Untar dependencies proc stderr: *** WARNING : deprecated key derivation used. proc stderr: Using -iter or -pbkdf2 would be better. watcher info: Uploading logs: ymd=2022-10-10/h=00/component=planner/2022-10-10-00-09-25-planner-10.1.201.246-0-c372ec60-31d1-442a-b0bf-f63b5b8339d0 to s3://okera/customers/demo/logs/ watcher info: Wrote log to: okera/customers/demo/logs/ymd=2022-10-10/h=00/component=planner/2022-10-10-00-09-25-planner-10.1.201.246-0-c372ec60-31d1-442a-b0bf-f63b5b8339d0.log proc stdout: Start master is true, IS_MASTER ...
In the following messages, the process is starting up and produces its own log messages, indicating what it has been configured with, such as the database service details. It then connects to the first dependent service, here the metadata catalog, which succeeds (as tested by pinging the service internally).
... proc stdout: proc stdout: Using DB engine: postgres proc stdout: No CATALOG_DB_CONN_PARAMS set proc stdout: No CATALOG_DB_CONN_PARAMS set ... proc stdout: No need to init postgres proc stdout: Using Postgres but DB URL is set, using that: okera.cluster-cyn8y.us-west-2.rds.amazonaws.com:5432 ... proc stdout: Downloading custom jars to /opt/deps/ directory... proc stdout: Connecting to HMS @ 172.20.203.179:9083 proc stderr: Picked up JAVA_TOOL_OPTIONS: -Xss256m proc stderr: 2022-10-09 14:53:13 INFO HiveMetastoreServicePinger:178 - Using pinger timeout: 4 proc stderr: 2022-10-09 14:53:14 INFO Logger:154 - Logging configs from environment. proc stderr: 2022-10-09 14:53:14 INFO Logger:156 - Logging config for authentication: disabled proc stderr: 2022-10-09 14:53:14 INFO Logger:156 - Logging config for authorization: disabled ... proc stdout: Metastore URI: thrift://172.20.203.179:9083 proc stderr: 2022-10-09 14:53:14 INFO metastore:391 - Trying to connect to metastore with URI thrift://172.20.203.179:9083 proc stderr: 2022-10-09 14:53:14 INFO metastore:465 - Opened a connection to metastore, current connections: 1 proc stderr: 2022-10-09 14:53:14 INFO JniFrontend:297 - Enabled UDF cache with TTL 30 seconds proc stderr: 2022-10-09 14:53:14 INFO GroupMappingUtil:127 - No required groups set proc stderr: 2022-10-09 14:53:14 INFO UnixGroupsMapper:89 - Resolved groups for username 'okera': [__okera_public_group, okera, okera_secure] at 1665352394476 ms proc stderr: 2022-10-09 14:53:14 INFO GroupMappingUtil:127 - No required groups set proc stderr: 2022-10-09 14:53:14 INFO metastore:518 - Connected to metastore. proc stderr: 2022-10-09 14:53:14 INFO HiveMetastoreServicePinger:198 - HMS Service ping succeeded. proc stdout: HMS Running proc stdout: Connected to HMS proc stdout: Connecting to Sentry @ 172.20.203.179:30911 ...
Here process emits more information about the machine on which it runs, which for a Kubernetes cluster usually is a managed VM. It also prints the hostname, IP addresses and other details.
... proc stderr: I1009 14:53:18.223522 224 atomicops-internals-x86.cc:93] vendor GenuineIntel family 6 model 5 sse2 1 cmpxchg16b 1 proc stderr: I1009 14:53:18.223559 224 authentication.cc:1376] Initializing auth manager proc stderr: I1009 14:53:18.237447 224 init.cc:312] Cpu Info: proc stderr: Model: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz proc stderr: Cores: 2 proc stderr: L1 Cache: 32.00 KB proc stderr: L2 Cache: 1.00 MB proc stderr: L3 Cache: 33.00 MB proc stderr: Hardware Supports: proc stderr: ssse3 proc stderr: sse4_1 proc stderr: sse4_2 proc stderr: popcnt proc stderr: I1009 14:53:18.237488 224 init.cc:313] Disk Info: proc stderr: Num disks 2: proc stderr: nvme0n (rotational=true) proc stderr: nvme0n1p (rotational=true) proc stderr: proc stderr: I1009 14:53:18.237495 224 init.cc:314] Physical Memory: 7.48 GB proc stderr: I1009 14:53:18.237504 224 init.cc:315] OS version: Linux version 5.4.209-116.367.amzn2.x86_64 (mockbuild@ip-10-0-41-5) (gcc version 7.3.1 20180712 (Red Hat 7.3.1-13) (GCC)) #1 SMP Wed Aug 31 00:09:52 UTC 2022 proc stderr: Clock: clocksource: 'tsc', clockid_t: CLOCK_MONOTONIC proc stderr: Hostname: cerebro-planner-5776f597fd-ngg7t proc stderr: Host IP: 10.1.201.246 proc stderr: I1009 14:53:18.237509 224 init.cc:316] Process ID: 224 proc stderr: I1009 14:53:18.237531 224 init.cc:323] Initializing JNI ...
Now the process, since it is the query planner and part of the control plane, starts to create the default Okera roles, permissions, and example and system databases.
... proc stdout: Configured init files user init files proc stdout: Granting workspace access to role 'okera_public_role proc stdout: Creating safeguard policies role and granting to '__okera_public_group proc stdout: Added default init files, current set ,/opt/okera/data/okera_sample.sql,/opt/okera/data/okera_roles.sql,/opt/okera/data/okera_system/system_tables.sql,/opt/okera/data/okera_system/system_postgres.sql,/opt/okera/data/okera_sample_views.sql,/opt/okera/data/okera_system/system_views.sql,/opt/okera/data/okera_system/workspace_public.sql,/opt/okera/data/okera_system/safeguard_role_public.sql proc stdout: 10 files to execute: proc stdout: proc stdout: /opt/okera/data/okera_sample.sql proc stdout: /opt/okera/data/okera_roles.sql proc stdout: /opt/okera/data/okera_system/system_tables.sql proc stdout: /opt/okera/data/okera_system/system_postgres.sql proc stdout: /opt/okera/data/okera_sample_views.sql proc stdout: /opt/okera/data/okera_system/system_views.sql proc stdout: /opt/okera/data/okera_system/workspace_public.sql proc stdout: /opt/okera/data/okera_system/safeguard_role_public.sql proc stdout: proc stdout: init-catalog: Executing from file: /opt/okera/data/okera_sample.sql proc stderr: I1009 14:53:38.045115 613 impala-server.cc:2290] Connection from client 127.0.0.1:55302 closed, closing 1 associated session(s). Currently active connections: 0 proc stderr: I1009 14:53:38.045336 613 impala-server.cc:2301] Closed session id=b9436b3663b3b1b8:a40bfec0cc0bc1a7 proc stderr: WARNING: Logging before InitGoogleLogging() is written to STDERR proc stderr: I1009 14:53:38.083245 617 authentication.cc:1376] Initializing auth manager proc stderr: I1009 14:53:38.085347 617 authentication.cc:1376] Initializing auth manager ...
Eventually the process ends up in the main loop and, while idle, only sees regular activity from the Kubernetes health checks, specified in the planner's manifest file:
... proc stderr: I1009 17:12:05.638563 1411 impala-server.cc:2301] Closed session id=da4ffb9b3c01f392:262cd672e0ae0c94 proc stderr: I1009 17:12:06.894057 324 RecoverPartitionsTask.java:96] Recovering partitions for: okera_system.audit_logs proc stderr: I1009 17:12:09.533171 334 RestApiMembership.java:75] Running membership check using REST proc stderr: I1009 17:12:09.605324 334 RestApiMembership.java:157] Updated membership: [firstname.lastname@example.org:30892] proc stderr: I1009 17:12:09.605394 334 impala-server.cc:1828] Clearing membership. proc stderr: I1009 17:12:09.605404 334 impala-server.cc:1856] Adding email@example.com:30892 to membership. proc stderr: I1009 17:12:17.761364 324 RecoverPartitionsTask.java:96] Recovering partitions for: okera_system.fs_audit_logs proc stderr: I1009 17:12:28.135701 324 RecoverPartitionsTask.java:96] Recovering partitions for: okera_system.analytics_audit_logs proc stderr: I1009 17:12:35.546777 636 thrift-util.cc:90] TAcceptQueueServer: accepted new socket connection, authenticating client. proc stderr: I1009 17:12:35.558442 1427 authentication.cc:437] Authenticated with user: root proc stderr: I1009 17:12:35.569437 1427 impala-server.cc:2245] New connection from client: user=root service=planner session id=a24a0692e659e458:d8d6884a13f07a92 start time=2022-10-10 00:12:35.569413000 client addr=127.0.0.1:45324 active connections=10 proc stderr: I1009 17:12:35.569803 1427 impala-server.cc:1371] RegisterQuery(): query_id=e1431cbec3190112:486b13548d1805a0 proc stderr: I1009 17:12:35.576174 1427 impala-server.cc:1421] UnregisterQuery(): query_id=e1431cbec3190112:486b13548d1805a0 ...
You can modify the default level of log messages produced by the Okera processes, although care must be taken not to overload any secondary system, such as the indexing and search services.
The log level is specified as a numeric configuration parameter
LOGGING_LEVEL, which is commonly set in the
ConfigMap manifest for the entire cluster, or per container in the
Daemonset manifests. The latter allows keeping the number of extra messages emitted restricted to just one or more containers.
Note that the numeric log levels apply directly to the main processes but are also translated into the common logging level for all Java code.
The following shows how the log level is set:
For Java, the log levels translate into the following:
||This is the default for specific Java classes.|
||Warning level logging. Default value.||
||Default for other Java classes.|
||Plan fragment level logging.||
||File level logging.||
||IO buffer level logging.||
||Row batch level logging.||
||Row level logging.||
While the Okera engineering team is taking utmost care to log only important information, at times some expected conditions are logged at nonintuitive log levels or to
stderr instead of
Cases like this are referred to as red herrings, that is, something looks suspicious but is misleading in the end. For example, the following log messages looks like something is wrong, but upon closer inspection the messages can be safely ignored.
... proc stderr: createdb: database creation failed: ERROR: database "okera" already exists proc stdout: cmd: 'createdb -w -U okera --host okera.cluster-cyn8y.us-west-2.rds.amazonaws.com --port 5432 okera' failed with error: 'Failed to run cmd: createdb -w -U okera --host okera.cluster-cyn8y.us-west-2.rds.amazonaws.com --port 5432 okera ret=1', continuing proc stderr: createdb: database creation failed: ERROR: database "users" already exists proc stdout: cmd: 'createdb -w -U okera --host okera.cluster-cyn8y.us-west-2.rds.amazonaws.com --port 5432 users' failed with error: 'Failed to run cmd: createdb -w -U okera --host okera.cluster-cyn8y.us-west-2.rds.amazonaws.com --port 5432 users ret=1', continuing proc stdout: Done creating metastore DBs for postgres proc stdout: Initializing metastore schemas ...
In this example, the metadata databases have already been created when the cluster was first started. Here, during a restart, the databases do not have to be created again.
Live Log Analysis¶
Now that we have discussed the nature of the Okera log files, it is time to analyze them while debugging operational problems. In lieu of using a proper indexing and search solution, as highly recommended earlier, we can use the regular command line tools to look at the logs of a selected container. First, we get the list of pods, and then list the log messages of a specific one:
$ kubectl get pods NAME READY STATUS RESTARTS AGE cache-maker-ddfcf6b4c-t9wpm 1/1 Running 0 117m canary-5spqq 1/1 Running 0 117m cdas-rest-server-8466b94f65-vzlpk 1/1 Running 1 (113m ago) 117m cerebro-catalog-865cd69758-w25bb 1/1 Running 0 117m cerebro-planner-58fdf94749-dsh4s 1/1 Running 0 117m cerebro-worker-7m4vx 1/1 Running 0 117m odas-identity-server-579644d849-vdp58 1/1 Running 1 (116m ago) 117m presidio-5979856cb7-x5cg2 1/1 Running 0 117m presto-coordinator-6fc8ccdd9d-zwqtv 1/1 Running 0 117m presto-worker-52rzm 1/1 Running 0 117m $ kubectl logs cerebro-planner-58fdf94749-dsh4s
A more practical approach is to add parameters that keep the logs printing while the command continues to run (
--follow), and limits the number of lines printed from the backlog of messages (
$ kubectl logs -f --tail==100 cerebro-planner-58fdf94749-dsh4s
Example: Startup Failure¶
A cluster that fails to start can be debugged by looking at the pods and the container logs. More specifically, looking at the pod status, the problem might become immediately obvious, as shown in the following example:
$ kubectl get pods NAME READY STATUS RESTARTS AGE cache-maker-86bb854f4d-sd9mf 1/1 Running 0 21m canary-w8tqb 1/1 Running 1 (11m ago) 21m cdas-rest-server-84678c4c8-bqrcf 1/1 Running 0 21m cerebro-catalog-69cf58bf7-8xv8b 1/1 Running 2 (12m ago) 21m cerebro-planner-7684ccc454-qzjrw 0/1 CrashLoopBackOff 6 (113s ago) 21m cerebro-worker-6qsmv 0/1 CrashLoopBackOff 7 (107s ago) 21m odas-identity-server-6b4cf9dcf4-vrhpz 1/1 Running 0 21m presidio-57947cc996-pkxml 1/1 Running 0 21m presto-coordinator-6765c85dc9-5642k 1/1 Running 0 21m presto-worker-gq9tg 1/1 Running 0 21m
The symptoms are that the planner and worker pods are failing and eventually end up in
CrashLoopBackOff state - which is Kubernetes notification that the pods have failed too often and are now tried only on slower schedule.
The Okera platform has a couple of internal dependencies, as shown in the architecture overview diagram, where some of the services must be started before others can run. This applies to the catalog services, which must be operational before the planner and HTTP API (REST) services can start fully.
Since, in this example, the catalog service is shown as
Running (it runs as
1/1, which means it is running as one out of one replica). Consequently, the problem must be related to something else.
The first step is to look at the logs of the pods in error:
$ kubectl logs cerebro-planner-7684ccc454-qzjrw ... proc stderr: I1011 10:54:15.370869 215 impala-server.cc:559] Initializing JWT Authenticator proc stderr: 2022-10-11 10:54:15 INFO JWTAuthenticatorFactory:59 - Using default user claim: sub proc stderr: 2022-10-11 10:54:15 INFO JWTAuthenticatorFactory:66 - Using default group claim: groups proc stderr: 2022-10-11 10:54:15 INFO JWTAuthenticatorFactory:100 - Using algorithms: [RSA512, HS512, RSA512] and key file(s): [/etc/secrets/JWT_PRIVATE_KEY, /etc/secrets/SYSTEM_TOKEN, /etc/okera_id_rsa.512.pub] proc stderr: 2022-10-11 10:54:15 INFO JWTAuthenticatorFactory:104 - Creating Public Key JWT authenticator based on file: /etc/secrets/JWT_PRIVATE_KEY and algorithm: RSA512 proc stderr: 2022-10-11 10:54:15 INFO JWTUtil:87 - Reading an RSA512 key from /etc/secrets/JWT_PRIVATE_KEY proc stderr: 2022-10-11 10:54:15 INFO SecurityKeyUtil:48 - Read key from file: /etc/secrets/JWT_PRIVATE_KEY proc stderr: 2022-10-11 10:54:15 ERROR SecurityKeyUtil:71 - Supplied public key bytes are not in the expected format. proc stderr: java.security.spec.InvalidKeySpecException: java.security.InvalidKeyException: IOException: algid parse error, not a sequence proc stderr: at sun.security.rsa.RSAKeyFactory.engineGeneratePublic(RSAKeyFactory.java:240) ~[?:1.8.0_342] proc stderr: at java.security.KeyFactory.generatePublic(KeyFactory.java:334) ~[?:1.8.0_342] proc stderr: at com.cerebro.security.SecurityKeyUtil.getPublicKey(SecurityKeyUtil.java:65) ~[cerebro-common-2.11.0.jar:?] proc stderr: at com.cerebro.security.SecurityKeyUtil.getPublicKeyFromFile(SecurityKeyUtil.java:110) ~[cerebro-common-2.11.0.jar:?] proc stderr: at com.cerebro.security.JWTUtil.buildAlgorithm(JWTUtil.java:89) ~[cerebro-common-2.11.0.jar:?] proc stderr: at com.cerebro.security.JWTAuthenticatorFactory.getAuthenticator(JWTAuthenticatorFactory.java:107) ~[cerebro-common-2.11.0.jar:?] proc stderr: at com.cloudera.impala.security.JWTTokenManager.init(JWTTokenManager.java:42) ~[impala-frontend-2.11.0.jar:?] proc stderr: at com.cloudera.impala.service.JniFrontend.lambda$initJWTAuthenticator$0(JniFrontend.java:431) ~[impala-frontend-2.11.0.jar:?] proc stderr: at com.cloudera.impala.service.JniFrontend.handle(JniFrontend.java:405) [impala-frontend-2.11.0.jar:?] proc stderr: at com.cloudera.impala.service.JniFrontend.initJWTAuthenticator(JniFrontend.java:429) [impala-frontend-2.11.0.jar:?] proc stderr: Caused by: java.security.InvalidKeyException: IOException: algid parse error, not a sequence proc stderr: at sun.security.x509.X509Key.decode(X509Key.java:397) ~[?:1.8.0_342] proc stderr: at sun.security.x509.X509Key.decode(X509Key.java:402) ~[?:1.8.0_342] proc stderr: at sun.security.rsa.RSAPublicKeyImpl.<init>(RSAPublicKeyImpl.java:122) ~[?:1.8.0_342] proc stderr: at sun.security.rsa.RSAKeyFactory.generatePublic(RSAKeyFactory.java:331) ~[?:1.8.0_342] proc stderr: at sun.security.rsa.RSAKeyFactory.engineGeneratePublic(RSAKeyFactory.java:236) ~[?:1.8.0_342] proc stderr: ... 9 more proc stderr: I1011 10:54:15.917727 215 jni-util.cc:163] java.lang.IllegalArgumentException: Unable to create a JWT authenticator. proc stderr: at com.cerebro.security.JWTAuthenticatorFactory.getAuthenticator(JWTAuthenticatorFactory.java:111) proc stderr: at com.cloudera.impala.security.JWTTokenManager.init(JWTTokenManager.java:42) proc stderr: at com.cloudera.impala.service.JniFrontend.lambda$initJWTAuthenticator$0(JniFrontend.java:431) proc stderr: at com.cloudera.impala.service.JniFrontend.handle(JniFrontend.java:405) proc stderr: at com.cloudera.impala.service.JniFrontend.initJWTAuthenticator(JniFrontend.java:429) proc stderr: Caused by: java.lang.IllegalArgumentException: Could not create specified algorithm: RSA512. Encountered error. proc stderr: at com.cerebro.security.JWTUtil.buildAlgorithm(JWTUtil.java:99) proc stderr: at com.cerebro.security.JWTAuthenticatorFactory.getAuthenticator(JWTAuthenticatorFactory.java:107) proc stderr: ... 4 more ... proc stderr: I1011 10:54:15.918684 215 status.cc:108] IllegalArgumentException: Unable to create a JWT authenticator. proc stderr: CAUSED BY: IllegalArgumentException: Could not create specified algorithm: RSA512. Encountered error. proc stderr: CAUSED BY: IOException: java.security.spec.InvalidKeySpecException: java.security.InvalidKeyException: IOException: algid parse error, not a sequence proc stderr: CAUSED BY: InvalidKeySpecException: java.security.InvalidKeyException: IOException: algid parse error, not a sequence proc stderr: CAUSED BY: InvalidKeyException: IOException: algid parse error, not a sequence proc stderr: @ 0xb47e53 impala::Status::Status() proc stderr: @ 0xf69bd6 impala::JniUtil::GetJniExceptionMsg() proc stderr: @ 0x102e3b6 impala::Frontend::InitJWTAuthenticator() proc stderr: @ 0x107b44b impala::ImpalaServer::ImpalaServer() proc stderr: @ 0x107cf9d impala::CreateImpalaServer() proc stderr: @ 0xad7173 main proc stderr: @ 0x7f187eb8109b (unknown) proc stderr: @ 0xb2aaba _start proc stderr: E1011 10:54:16.015326 215 impala-server.cc:562] Could not initialize JWTAuthenticator. Exiting. proc stdout: Waiting for ODAS to be ready at http://localhost:11050/status proc stdout: Waiting for ODAS to be ready at http://localhost:11050/status proc stdout: Waiting for ODAS to be ready at http://localhost:11050/status ... proc stderr: Exception: server (http://localhost:11050) did not get to ready status in 120 sec. watcher info: Process exited with code: 1 ... watcher info: Exiting watcher with: 1
The log starts out normally, with the container going through the different lifecycle stages. But then there is a block of error messages, which indicate that something is wrong setting up the JWT authentication. More specifically, this line is interesting:
... proc stderr: 2022-10-11 10:54:15 INFO JWTAuthenticatorFactory:104 - Creating Public Key JWT authenticator based on file: /etc/secrets/JWT_PRIVATE_KEY and algorithm: RSA512 ...
Why is the necessary public key pointing to the location that has private key in its name? Looking at the environment variables dump at the beginning of the log, we can see that there is an error in the cluster configuration, that is, the names for the private and public JWT keys have been swapped, each pointing to the wrong key:
... JWT_ALGORITHM=RSA512 JWT_PRIVATE_KEY=/etc/secrets/JWT_PUBLIC_KEY JWT_PUBLIC_KEY=/etc/secrets/JWT_PRIVATE_KEY ...
Since the planner and worker containers share the same image, we can safely assume that it fails with the exact same error (and that is what indeed happens, but the logs are omitted here for the sake of brevity). The fix is to swap the file names and push the config into the cluster. After the pods restart, they are healthy again.