Tutorial: Administration – How To Read the Okera Process Logs¶
This self-service tutorial guides you through an administrative task: how to read the log files produced by various Okera services and their processes.
Difficulty: Advanced
Time needed: 4 hours
Introduction¶
Okera's software is deployed on common server systems and usually in a managed Kubernetes environment. Regardless of the runtime specifics, all processes, by default, produce some level of logging messages about their operational status. These messages are useful for administrators and developers to assist with issues or problems that may arise from many sources:
-
Infrastructure - Okera's platform depends heavily on the infrastructure into which it is deployed. This includes proper, working access to, among many things, the network connectivity, storage services like file systems and relational database system, virtual or dedicated bare-metal server machines, and identity and access management services.
-
Environmental - This includes details of the infrastructure passed down into the processes as environment variables, such as IP addresses, and database connection information. Another area of possible problems is the software configuration itself, which completes the former with more instance-specific settings and properties (for example, the LDAP/SAML/OAuth details and SSL keys and certificates required for secure operations).
-
User Errors - Finally, the processes also react to user input and emit log messages in response. These are messages, for instance, recording when access is denied to users based on their permissions, but also when the Okera processes cannot access the data or metadata they require.
Assuming nominal activity on the cluster, a flurry of details are logged, that match a certain pattern. When things go wrong, that pattern changes, and it is important to be able to find the proverbial needle in the haystack.
This tutorial introduces you to the log files produced by Okera's platform and how to analyze them.
Distributed Logs and Log Collection¶
As explained in the architecture overview, the Okera platform has many moving parts. It includes a set of containerized processes that can be run once or many times, for redundancy. The processes produce their logs in such a way that they are captured inside the container by a script that uploads them to the configured log file storage service. The logs are prefixed by the name of the container, and optionally a date-based partition directory, while the file names are constructed from the internal IP address of the container and the time they were created.
Note: For Okera's platform, you can assume each container type is mapped to exactly one specific service - though many services may run at the same time. Consequently, this tutorial uses "container", "process", and "service" interchangeably.
As a distributed system, user requests are sent across many of the Okera processes and services they represent. For instance, assuming a user is logged into the Okera UI sends a SQL query using the Workspace page, the following rough chain of events takes place:
- The user enters the SQL query, for example
select * from okera_sample.sample
and presses the Run button. - The Okera UI is a browser application that sends the request to run the query to the HTTP API service inside the same container that served the application itself.
- The HTTP API sends the request to the central control plane service in another container, which validates the user's credentials (they were passed with the request), checks the user's permissions, and plans the query.
- Planning entails a call to the catalog services to resolve dataset metadata, such as view and table schemas, and the underlying storage service.
- The control plane service then determines where the query should run, and in case of Okera's own query services it selects the proper query runtime resources (that is, which processing nodes should handle each partial query task), before returning the query plan to the caller, here the frontend HTTP API service.
- The query is then executed according to its query plan by the calling process, which sends the tasks of the plan to the processing nodes, which usually run on all available nodes in the cluster.
- The partial query results are streamed back to the caller and then returned to the Okera UI for rendering.
This simplified chain of events is very common for most data-related activities from clients, though even metadata requests often traverse many services - akin to what you would expect from distantly related microservice applications. This makes it very clear that logs really are as distributed as their originating system, and that finding related log entries requires looking into many separate log files.
Note: Okera strongly advises you to aggregate the log files into a central search service, such as OpenSearch, ElasticSearch or Splunk etc. Commercial or not, these services make it possible to correlate events across many sources by narrowing down the time window and adding important search terms (such as the query ID of the above example query).
Downloading the log files to a workstation and using some form of local search (like grep
or using an advanced editor that can handle these potentially very large files efficiently) is not for the faint of heart - but with enough prowess should help in debugging any service issue.
A final note of warning: Not all logs are alike! This means that when you search for patterns in the logs, be prepared to use variations and approximations, as they may differ based on the process that emitted them. This is an unfortunate side-effect of log frameworks from different programming environments. For instance, the log lines emitted by C++ differ from Python code. While Okera's engineering team is very careful to find a common denominator, in practice not every aspect of the log frameworks can be sufficiently fine-tuned to match exactly. Often dates or log-level information differs, making it more complicated to find, for example, all lines containing errors, ignoring information events. In case of errors specifically, the context information available at runtime, such as stack traces and heap dumps, vary significantly.
Log File Structure¶
Log files and their content varies over time, with most processes going through the following stages:
-
Pre-Startup - Just before the main process of the container starts, the script responsible to capture the logs and upload them as mentioned, dumps (that is, prints) the environment variables set inside the container, which helps debug environmental issue.
-
Startup - After the main process is started by the wrapping script, it prints its own startup details, such as memory allocations or details about any additional command line parameters that were set.
-
Warmup - The processes set up any dependencies, such as database connection pools, create metadata structures, and load the configured resource files. Any issue at this (or the previous) stage often results in a non-operational process or complete failure of the container startup - but there are exceptions that are discussed below with red herrings.
-
Steady State - After everything is ramped up, the processes go into their main serving loop. Any following messages come from internal system or user interactions.
Depending on the stage the server process is in, the log messages obviously have a different meanings for administrators, and it is best to consider each of them with the same attention - an early problem might manifest itself much later!
As discussed, the logs differ in their layout across the stages as well, even in the same log file, so getting used to them is an essential trait to gain.
Here some examples for each of the stages of the logs produced by the cerebro-planner
container and service.
The following log messages show the Okera configuration parameters dumped in a sorted manner, with Kubernetes-injected ones in between (for the other registered services).
...
Dumping Environment Variables:
AUTOTAGGER_CONFIGURATION=/opt/okera/data/autotagger-config-wellknowns.json
CACHE_MAKER_ENABLED=false
CACHE_QUERIES=/opt/cache_queries.json
CANARY_PORT=tcp://172.20.85.83:5010
CANARY_PORT_5010_TCP=tcp://172.20.85.83:5010
CANARY_PORT_5010_TCP_ADDR=172.20.85.83
CANARY_PORT_5010_TCP_PORT=5010
CANARY_PORT_5010_TCP_PROTO=tcp
CANARY_SERVICE_HOST=172.20.85.83
CANARY_SERVICE_PORT=5010
CANARY_SERVICE_PORT_WEBPORT=5010
CATALOG_ADMINS=admin
CATALOG_DB_ENGINE=postgres
CATALOG_DB_HMS_DB=hms
CATALOG_DB_OKERA_DB=okera
CATALOG_DB_PASSWORD=**REDACTED**
CATALOG_DB_SSL=true
CATALOG_DB_URL=okera.cluster-cyn8yfv.us-west-2.rds.amazonaws.com:5432
CATALOG_DB_USER=okera
CATALOG_DB_USERS_DB=users
CATALOG_TYPE=hms
...
...
Injecting 31381 as CEREBRO_PLANNER_NODE_PORT
Injecting 31200 as OKERA_PRESTO_COORDINATOR_NODE_PORT
Checking if worker is behind a load balancer...
Workers are not running behind load balancer.
watcher info: Process started. pid: 14
proc stdout: ************************************************************************
proc stdout: Total system memory: 7 GB
proc stdout:
...
proc stdout: Final:
proc stdout: System-wide headroom: 1024 MB
proc stdout: HMS JVM: 512 MB
proc stdout: Planner JVM: 1024 MB
proc stdout: Worker Memory: 2432 MB
proc stdout: Worker JVM: 1024 MB
proc stdout: Worker Native: 1408 MB
proc stdout: Presto JVM: 2176 MB
proc stdout: Presto Query Max Memory: 108800 MB
proc stdout: Presto Query Max User Memory Per Node: 1088 MB
proc stdout: Presto Query Max Total Memory Per Node: 1414 MB
proc stdout: Presto Query Memory Headroom Per Node: 544 MB
proc stdout: ************************************************************************
proc stdout: Untar dependencies
proc stderr: *** WARNING : deprecated key derivation used.
proc stderr: Using -iter or -pbkdf2 would be better.
watcher info: Uploading logs: ymd=2022-10-10/h=00/component=planner/2022-10-10-00-09-25-planner-10.1.201.246-0-c372ec60-31d1-442a-b0bf-f63b5b8339d0 to s3://okera/customers/demo/logs/
watcher info: Wrote log to: okera/customers/demo/logs/ymd=2022-10-10/h=00/component=planner/2022-10-10-00-09-25-planner-10.1.201.246-0-c372ec60-31d1-442a-b0bf-f63b5b8339d0.log
proc stdout: Start master is true, IS_MASTER
...
In the following messages, the process is starting up and produces its own log messages, indicating what it has been configured with, such as the database service details. It then connects to the first dependent service, here the metadata catalog, which succeeds (as tested by pinging the service internally).
...
proc stdout:
proc stdout: Using DB engine: postgres
proc stdout: No CATALOG_DB_CONN_PARAMS set
proc stdout: No CATALOG_DB_CONN_PARAMS set
...
proc stdout: No need to init postgres
proc stdout: Using Postgres but DB URL is set, using that: okera.cluster-cyn8y.us-west-2.rds.amazonaws.com:5432
...
proc stdout: Downloading custom jars to /opt/deps/ directory...
proc stdout: Connecting to HMS @ 172.20.203.179:9083
proc stderr: Picked up JAVA_TOOL_OPTIONS: -Xss256m
proc stderr: 2022-10-09 14:53:13 INFO HiveMetastoreServicePinger:178 - Using pinger timeout: 4
proc stderr: 2022-10-09 14:53:14 INFO Logger:154 - Logging configs from environment.
proc stderr: 2022-10-09 14:53:14 INFO Logger:156 - Logging config for authentication: disabled
proc stderr: 2022-10-09 14:53:14 INFO Logger:156 - Logging config for authorization: disabled
...
proc stdout: Metastore URI: thrift://172.20.203.179:9083
proc stderr: 2022-10-09 14:53:14 INFO metastore:391 - Trying to connect to metastore with URI thrift://172.20.203.179:9083
proc stderr: 2022-10-09 14:53:14 INFO metastore:465 - Opened a connection to metastore, current connections: 1
proc stderr: 2022-10-09 14:53:14 INFO JniFrontend:297 - Enabled UDF cache with TTL 30 seconds
proc stderr: 2022-10-09 14:53:14 INFO GroupMappingUtil:127 - No required groups set
proc stderr: 2022-10-09 14:53:14 INFO UnixGroupsMapper:89 - Resolved groups for username 'okera': [__okera_public_group, okera, okera_secure] at 1665352394476 ms
proc stderr: 2022-10-09 14:53:14 INFO GroupMappingUtil:127 - No required groups set
proc stderr: 2022-10-09 14:53:14 INFO metastore:518 - Connected to metastore.
proc stderr: 2022-10-09 14:53:14 INFO HiveMetastoreServicePinger:198 - HMS Service ping succeeded.
proc stdout: HMS Running
proc stdout: Connected to HMS
proc stdout: Connecting to Sentry @ 172.20.203.179:30911
...
Here process emits more information about the machine on which it runs, which for a Kubernetes cluster usually is a managed VM. It also prints the hostname, IP addresses and other details.
...
proc stderr: I1009 14:53:18.223522 224 atomicops-internals-x86.cc:93] vendor GenuineIntel family 6 model 5 sse2 1 cmpxchg16b 1
proc stderr: I1009 14:53:18.223559 224 authentication.cc:1376] Initializing auth manager
proc stderr: I1009 14:53:18.237447 224 init.cc:312] Cpu Info:
proc stderr: Model: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
proc stderr: Cores: 2
proc stderr: L1 Cache: 32.00 KB
proc stderr: L2 Cache: 1.00 MB
proc stderr: L3 Cache: 33.00 MB
proc stderr: Hardware Supports:
proc stderr: ssse3
proc stderr: sse4_1
proc stderr: sse4_2
proc stderr: popcnt
proc stderr: I1009 14:53:18.237488 224 init.cc:313] Disk Info:
proc stderr: Num disks 2:
proc stderr: nvme0n (rotational=true)
proc stderr: nvme0n1p (rotational=true)
proc stderr:
proc stderr: I1009 14:53:18.237495 224 init.cc:314] Physical Memory: 7.48 GB
proc stderr: I1009 14:53:18.237504 224 init.cc:315] OS version: Linux version 5.4.209-116.367.amzn2.x86_64 (mockbuild@ip-10-0-41-5) (gcc version 7.3.1 20180712 (Red Hat 7.3.1-13) (GCC)) #1 SMP Wed Aug 31 00:09:52 UTC 2022
proc stderr: Clock: clocksource: 'tsc', clockid_t: CLOCK_MONOTONIC
proc stderr: Hostname: cerebro-planner-5776f597fd-ngg7t
proc stderr: Host IP: 10.1.201.246
proc stderr: I1009 14:53:18.237509 224 init.cc:316] Process ID: 224
proc stderr: I1009 14:53:18.237531 224 init.cc:323] Initializing JNI
...
Now the process, since it is the query planner and part of the control plane, starts to create the default Okera roles, permissions, and example and system databases.
...
proc stdout: Configured init files user init files
proc stdout: Granting workspace access to role 'okera_public_role
proc stdout: Creating safeguard policies role and granting to '__okera_public_group
proc stdout: Added default init files, current set ,/opt/okera/data/okera_sample.sql,/opt/okera/data/okera_roles.sql,/opt/okera/data/okera_system/system_tables.sql,/opt/okera/data/okera_system/system_postgres.sql,/opt/okera/data/okera_sample_views.sql,/opt/okera/data/okera_system/system_views.sql,/opt/okera/data/okera_system/workspace_public.sql,/opt/okera/data/okera_system/safeguard_role_public.sql
proc stdout: 10 files to execute:
proc stdout:
proc stdout: /opt/okera/data/okera_sample.sql
proc stdout: /opt/okera/data/okera_roles.sql
proc stdout: /opt/okera/data/okera_system/system_tables.sql
proc stdout: /opt/okera/data/okera_system/system_postgres.sql
proc stdout: /opt/okera/data/okera_sample_views.sql
proc stdout: /opt/okera/data/okera_system/system_views.sql
proc stdout: /opt/okera/data/okera_system/workspace_public.sql
proc stdout: /opt/okera/data/okera_system/safeguard_role_public.sql
proc stdout:
proc stdout: init-catalog: Executing from file: /opt/okera/data/okera_sample.sql
proc stderr: I1009 14:53:38.045115 613 impala-server.cc:2290] Connection from client 127.0.0.1:55302 closed, closing 1 associated session(s). Currently active connections: 0
proc stderr: I1009 14:53:38.045336 613 impala-server.cc:2301] Closed session id=b9436b3663b3b1b8:a40bfec0cc0bc1a7
proc stderr: WARNING: Logging before InitGoogleLogging() is written to STDERR
proc stderr: I1009 14:53:38.083245 617 authentication.cc:1376] Initializing auth manager
proc stderr: I1009 14:53:38.085347 617 authentication.cc:1376] Initializing auth manager
...
Eventually the process ends up in the main loop and, while idle, only sees regular activity from the Kubernetes health checks, specified in the planner's manifest file:
...
proc stderr: I1009 17:12:05.638563 1411 impala-server.cc:2301] Closed session id=da4ffb9b3c01f392:262cd672e0ae0c94
proc stderr: I1009 17:12:06.894057 324 RecoverPartitionsTask.java:96] Recovering partitions for: okera_system.audit_logs
proc stderr: I1009 17:12:09.533171 334 RestApiMembership.java:75] Running membership check using REST
proc stderr: I1009 17:12:09.605324 334 RestApiMembership.java:157] Updated membership: [worker@10.1.201.246:30892]
proc stderr: I1009 17:12:09.605394 334 impala-server.cc:1828] Clearing membership.
proc stderr: I1009 17:12:09.605404 334 impala-server.cc:1856] Adding worker@10.1.201.246:30892 to membership.
proc stderr: I1009 17:12:17.761364 324 RecoverPartitionsTask.java:96] Recovering partitions for: okera_system.fs_audit_logs
proc stderr: I1009 17:12:28.135701 324 RecoverPartitionsTask.java:96] Recovering partitions for: okera_system.analytics_audit_logs
proc stderr: I1009 17:12:35.546777 636 thrift-util.cc:90] TAcceptQueueServer: accepted new socket connection, authenticating client.
proc stderr: I1009 17:12:35.558442 1427 authentication.cc:437] Authenticated with user: root
proc stderr: I1009 17:12:35.569437 1427 impala-server.cc:2245] New connection from client: user=root service=planner session id=a24a0692e659e458:d8d6884a13f07a92 start time=2022-10-10 00:12:35.569413000 client addr=127.0.0.1:45324 active connections=10
proc stderr: I1009 17:12:35.569803 1427 impala-server.cc:1371] RegisterQuery(): query_id=e1431cbec3190112:486b13548d1805a0
proc stderr: I1009 17:12:35.576174 1427 impala-server.cc:1421] UnregisterQuery(): query_id=e1431cbec3190112:486b13548d1805a0
...
Log Levels¶
You can modify the default level of log messages produced by the Okera processes, although care must be taken not to overload any secondary system, such as the indexing and search services.
The log level is specified as a numeric configuration parameter LOGGING_LEVEL
, which is commonly set in the ConfigMap
manifest for the entire cluster, or per container in the Deployment
or Daemonset
manifests. The latter allows keeping the number of extra messages emitted restricted to just one or more containers.
Note that the numeric log levels apply directly to the main processes but are also translated into the common logging level for all Java code.
The following shows how the log level is set:
LOGGING_LEVEL: 1
For Java, the log levels translate into the following:
Level | Process | Notes | Java | Notes |
---|---|---|---|---|
0 |
OFF |
No logging. | INFO |
This is the default for specific Java classes. |
1 |
WARNING |
Warning level logging. Default value. | DEBUG |
Default for other Java classes. |
2 |
FRAGMENT |
Plan fragment level logging. | DEBUG |
|
3 |
FILE |
File level logging. | TRACE |
|
4 |
BUFFER |
IO buffer level logging. | TRACE |
|
5 |
BATCH |
Row batch level logging. | TRACE |
|
6 |
ROW |
Row level logging. | TRACE |
While the Okera engineering team is taking utmost care to log only important information, at times some expected conditions are logged at nonintuitive log levels or to stderr
instead of stdout
.
Cases like this are referred to as red herrings, that is, something looks suspicious but is misleading in the end. For example, the following log messages looks like something is wrong, but upon closer inspection the messages can be safely ignored.
...
proc stderr: createdb: database creation failed: ERROR: database "okera" already exists
proc stdout: cmd: 'createdb -w -U okera --host okera.cluster-cyn8y.us-west-2.rds.amazonaws.com --port 5432 okera' failed with error: 'Failed to run cmd: createdb -w -U okera --host okera.cluster-cyn8y.us-west-2.rds.amazonaws.com --port 5432 okera ret=1', continuing
proc stderr: createdb: database creation failed: ERROR: database "users" already exists
proc stdout: cmd: 'createdb -w -U okera --host okera.cluster-cyn8y.us-west-2.rds.amazonaws.com --port 5432 users' failed with error: 'Failed to run cmd: createdb -w -U okera --host okera.cluster-cyn8y.us-west-2.rds.amazonaws.com --port 5432 users ret=1', continuing
proc stdout: Done creating metastore DBs for postgres
proc stdout: Initializing metastore schemas
...
In this example, the metadata databases have already been created when the cluster was first started. Here, during a restart, the databases do not have to be created again.
Live Log Analysis¶
Now that we have discussed the nature of the Okera log files, it is time to analyze them while debugging operational problems. In lieu of using a proper indexing and search solution, as highly recommended earlier, we can use the regular command line tools to look at the logs of a selected container. First, we get the list of pods, and then list the log messages of a specific one:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
cache-maker-ddfcf6b4c-t9wpm 1/1 Running 0 117m
canary-5spqq 1/1 Running 0 117m
cdas-rest-server-8466b94f65-vzlpk 1/1 Running 1 (113m ago) 117m
cerebro-catalog-865cd69758-w25bb 1/1 Running 0 117m
cerebro-planner-58fdf94749-dsh4s 1/1 Running 0 117m
cerebro-worker-7m4vx 1/1 Running 0 117m
odas-identity-server-579644d849-vdp58 1/1 Running 1 (116m ago) 117m
presidio-5979856cb7-x5cg2 1/1 Running 0 117m
presto-coordinator-6fc8ccdd9d-zwqtv 1/1 Running 0 117m
presto-worker-52rzm 1/1 Running 0 117m
$ kubectl logs cerebro-planner-58fdf94749-dsh4s
A more practical approach is to add parameters that keep the logs printing while the command continues to run (-f
, or --follow
), and limits the number of lines printed from the backlog of messages (--tail=
):
$ kubectl logs -f --tail==100 cerebro-planner-58fdf94749-dsh4s
Example: Startup Failure¶
A cluster that fails to start can be debugged by looking at the pods and the container logs. More specifically, looking at the pod status, the problem might become immediately obvious, as shown in the following example:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
cache-maker-86bb854f4d-sd9mf 1/1 Running 0 21m
canary-w8tqb 1/1 Running 1 (11m ago) 21m
cdas-rest-server-84678c4c8-bqrcf 1/1 Running 0 21m
cerebro-catalog-69cf58bf7-8xv8b 1/1 Running 2 (12m ago) 21m
cerebro-planner-7684ccc454-qzjrw 0/1 CrashLoopBackOff 6 (113s ago) 21m
cerebro-worker-6qsmv 0/1 CrashLoopBackOff 7 (107s ago) 21m
odas-identity-server-6b4cf9dcf4-vrhpz 1/1 Running 0 21m
presidio-57947cc996-pkxml 1/1 Running 0 21m
presto-coordinator-6765c85dc9-5642k 1/1 Running 0 21m
presto-worker-gq9tg 1/1 Running 0 21m
The symptoms are that the planner and worker pods are failing and eventually end up in CrashLoopBackOff
state - which is Kubernetes notification that the pods have failed too often and are now tried only on slower schedule.
The Okera platform has a couple of internal dependencies, as shown in the architecture overview diagram, where some of the services must be started before others can run. This applies to the catalog services, which must be operational before the planner and HTTP API (REST) services can start fully.
Since, in this example, the catalog service is shown as Running
(it runs as 1/1
, which means it is running as one out of one replica). Consequently, the problem must be related to something else.
The first step is to look at the logs of the pods in error:
$ kubectl logs cerebro-planner-7684ccc454-qzjrw
...
proc stderr: I1011 10:54:15.370869 215 impala-server.cc:559] Initializing JWT Authenticator
proc stderr: 2022-10-11 10:54:15 INFO JWTAuthenticatorFactory:59 - Using default user claim: sub
proc stderr: 2022-10-11 10:54:15 INFO JWTAuthenticatorFactory:66 - Using default group claim: groups
proc stderr: 2022-10-11 10:54:15 INFO JWTAuthenticatorFactory:100 - Using algorithms: [RSA512, HS512, RSA512] and key file(s): [/etc/secrets/JWT_PRIVATE_KEY, /etc/secrets/SYSTEM_TOKEN, /etc/okera_id_rsa.512.pub]
proc stderr: 2022-10-11 10:54:15 INFO JWTAuthenticatorFactory:104 - Creating Public Key JWT authenticator based on file: /etc/secrets/JWT_PRIVATE_KEY and algorithm: RSA512
proc stderr: 2022-10-11 10:54:15 INFO JWTUtil:87 - Reading an RSA512 key from /etc/secrets/JWT_PRIVATE_KEY
proc stderr: 2022-10-11 10:54:15 INFO SecurityKeyUtil:48 - Read key from file: /etc/secrets/JWT_PRIVATE_KEY
proc stderr: 2022-10-11 10:54:15 ERROR SecurityKeyUtil:71 - Supplied public key bytes are not in the expected format.
proc stderr: java.security.spec.InvalidKeySpecException: java.security.InvalidKeyException: IOException: algid parse error, not a sequence
proc stderr: at sun.security.rsa.RSAKeyFactory.engineGeneratePublic(RSAKeyFactory.java:240) ~[?:1.8.0_342]
proc stderr: at java.security.KeyFactory.generatePublic(KeyFactory.java:334) ~[?:1.8.0_342]
proc stderr: at com.cerebro.security.SecurityKeyUtil.getPublicKey(SecurityKeyUtil.java:65) ~[cerebro-common-2.11.0.jar:?]
proc stderr: at com.cerebro.security.SecurityKeyUtil.getPublicKeyFromFile(SecurityKeyUtil.java:110) ~[cerebro-common-2.11.0.jar:?]
proc stderr: at com.cerebro.security.JWTUtil.buildAlgorithm(JWTUtil.java:89) ~[cerebro-common-2.11.0.jar:?]
proc stderr: at com.cerebro.security.JWTAuthenticatorFactory.getAuthenticator(JWTAuthenticatorFactory.java:107) ~[cerebro-common-2.11.0.jar:?]
proc stderr: at com.cloudera.impala.security.JWTTokenManager.init(JWTTokenManager.java:42) ~[impala-frontend-2.11.0.jar:?]
proc stderr: at com.cloudera.impala.service.JniFrontend.lambda$initJWTAuthenticator$0(JniFrontend.java:431) ~[impala-frontend-2.11.0.jar:?]
proc stderr: at com.cloudera.impala.service.JniFrontend.handle(JniFrontend.java:405) [impala-frontend-2.11.0.jar:?]
proc stderr: at com.cloudera.impala.service.JniFrontend.initJWTAuthenticator(JniFrontend.java:429) [impala-frontend-2.11.0.jar:?]
proc stderr: Caused by: java.security.InvalidKeyException: IOException: algid parse error, not a sequence
proc stderr: at sun.security.x509.X509Key.decode(X509Key.java:397) ~[?:1.8.0_342]
proc stderr: at sun.security.x509.X509Key.decode(X509Key.java:402) ~[?:1.8.0_342]
proc stderr: at sun.security.rsa.RSAPublicKeyImpl.<init>(RSAPublicKeyImpl.java:122) ~[?:1.8.0_342]
proc stderr: at sun.security.rsa.RSAKeyFactory.generatePublic(RSAKeyFactory.java:331) ~[?:1.8.0_342]
proc stderr: at sun.security.rsa.RSAKeyFactory.engineGeneratePublic(RSAKeyFactory.java:236) ~[?:1.8.0_342]
proc stderr: ... 9 more
proc stderr: I1011 10:54:15.917727 215 jni-util.cc:163] java.lang.IllegalArgumentException: Unable to create a JWT authenticator.
proc stderr: at com.cerebro.security.JWTAuthenticatorFactory.getAuthenticator(JWTAuthenticatorFactory.java:111)
proc stderr: at com.cloudera.impala.security.JWTTokenManager.init(JWTTokenManager.java:42)
proc stderr: at com.cloudera.impala.service.JniFrontend.lambda$initJWTAuthenticator$0(JniFrontend.java:431)
proc stderr: at com.cloudera.impala.service.JniFrontend.handle(JniFrontend.java:405)
proc stderr: at com.cloudera.impala.service.JniFrontend.initJWTAuthenticator(JniFrontend.java:429)
proc stderr: Caused by: java.lang.IllegalArgumentException: Could not create specified algorithm: RSA512. Encountered error.
proc stderr: at com.cerebro.security.JWTUtil.buildAlgorithm(JWTUtil.java:99)
proc stderr: at com.cerebro.security.JWTAuthenticatorFactory.getAuthenticator(JWTAuthenticatorFactory.java:107)
proc stderr: ... 4 more
...
proc stderr: I1011 10:54:15.918684 215 status.cc:108] IllegalArgumentException: Unable to create a JWT authenticator.
proc stderr: CAUSED BY: IllegalArgumentException: Could not create specified algorithm: RSA512. Encountered error.
proc stderr: CAUSED BY: IOException: java.security.spec.InvalidKeySpecException: java.security.InvalidKeyException: IOException: algid parse error, not a sequence
proc stderr: CAUSED BY: InvalidKeySpecException: java.security.InvalidKeyException: IOException: algid parse error, not a sequence
proc stderr: CAUSED BY: InvalidKeyException: IOException: algid parse error, not a sequence
proc stderr: @ 0xb47e53 impala::Status::Status()
proc stderr: @ 0xf69bd6 impala::JniUtil::GetJniExceptionMsg()
proc stderr: @ 0x102e3b6 impala::Frontend::InitJWTAuthenticator()
proc stderr: @ 0x107b44b impala::ImpalaServer::ImpalaServer()
proc stderr: @ 0x107cf9d impala::CreateImpalaServer()
proc stderr: @ 0xad7173 main
proc stderr: @ 0x7f187eb8109b (unknown)
proc stderr: @ 0xb2aaba _start
proc stderr: E1011 10:54:16.015326 215 impala-server.cc:562] Could not initialize JWTAuthenticator. Exiting.
proc stdout: Waiting for ODAS to be ready at http://localhost:11050/status
proc stdout: Waiting for ODAS to be ready at http://localhost:11050/status
proc stdout: Waiting for ODAS to be ready at http://localhost:11050/status
...
proc stderr: Exception: server (http://localhost:11050) did not get to ready status in 120 sec.
watcher info: Process exited with code: 1
...
watcher info: Exiting watcher with: 1
The log starts out normally, with the container going through the different lifecycle stages. But then there is a block of error messages, which indicate that something is wrong setting up the JWT authentication. More specifically, this line is interesting:
...
proc stderr: 2022-10-11 10:54:15 INFO JWTAuthenticatorFactory:104 - Creating Public Key JWT authenticator based on file: /etc/secrets/JWT_PRIVATE_KEY and algorithm: RSA512
...
Why is the necessary public key pointing to the location that has private key in its name? Looking at the environment variables dump at the beginning of the log, we can see that there is an error in the cluster configuration, that is, the names for the private and public JWT keys have been swapped, each pointing to the wrong key:
...
JWT_ALGORITHM=RSA512
JWT_PRIVATE_KEY=/etc/secrets/JWT_PUBLIC_KEY
JWT_PUBLIC_KEY=/etc/secrets/JWT_PRIVATE_KEY
...
Since the planner and worker containers share the same image, we can safely assume that it fails with the exact same error (and that is what indeed happens, but the logs are omitted here for the sake of brevity). The fix is to swap the file names and push the config into the cluster. After the pods restart, they are healthy again.