Tutorial: Analyzing Okera Audit Logs¶
This self-service tutorial shows you how you can access and query the Okera audit logs. It explains how to write SQL queries for important information stored in the logs, which is useful when managing a cluster or when performing an audit.
Difficulty: Intermediate
Time needed: 4 hour
Introduction¶
Clients, that is, users and applications, can use any of the available Okera endpoints to issue metadata requests and query managed datasets. Every access transaction from these clients is logged by the Okera audit engine as JSON records stored in text files. These files are uploaded regularly to a configurable storage service and made available as system tables and related views, allowing them to be queried by administrators.
The following shows the audit log-related datasets in the Okera UI.
Data managers, such as data owners, stewards, or line-of-business (LOB) managers, often require insight into what users do with the data they manage. The audit log-related datasets provide a great source of information to create insights using standard SQL queries.
Audit Log Records¶
As explained in our Auditing documentation, the JSON records stored in the log file comprise a long list of columns that contain details about the request that was made.
The following shows an example where the user creates and then drops a database.
Example: Audit records showing a database being created and subsequently dropped
{
"request_time":"2020-10-30 10:05:01.437238000",
"request_id":"504717762e22e373:79a9d59cfeb9b2a5",
"session_id":"ef4b2d937103c170:8a64f031fecfeab9",
"start_unix_time":1604052301437,
"end_unix_time":1604052301455,
"auth_failure":false,
"status":"ok",
"user":"admin",
"connected_user":"root",
"client_network_address":"172.23.77.80:40374",
"client_application":"okera-rest-server",
"num_results_returned":0,
"num_results_read":0,
"bytes_scanned":0,
"server_total_time_ms":0,
"queue_time_ms":0,
"planning_time_ms":0,
"execution_time_ms":0,
"statement_type":"CREATE_DATABASE",
"for_reporting":true,
"default_db":"default",
"statement":"create database foobar",
"ae_attribute":"",
"ae_database":"",
"ae_function":"",
"ae_path":"",
"ae_role":"",
"ae_table":"",
"ae_view":""
}
...
{
"request_time":"2020-10-30 10:05:07.519814000",
"request_id":"d44cf1ac90cb9d52:7811f9cb5065a29a",
"session_id":"ca4532336b844b72:a4bbece430b38587",
...
"statement_type":"DROP_DATABASE",
...
"statement":"drop database foobar",
"ae_attribute":"",
"ae_database":"foobar",
"ae_function":"",
...
}
The second record was abbreviated for the sake of brevity. Notice though how the record is containing the actual SQL statement, as well as the statement type, which is determined when the Okera Policy Engine (formerly known as the Planner) parses the query.
The audit records shown above were taken from the Policy Engine audit log files. Okera Enforcement Fleet workers emit their own audit records to separate log files, but to the same shared storage location. It means that both Policy Engine and Enforcement Fleet worker audit records are accessible through the same system tables, mitigating the need to merge them manually.
Here are the most important statement types the records may contain, along with the source emitting them and the instigating event.
Type | Kind | Source | Instigating Event |
---|---|---|---|
PLAN |
DML | Policy Engine | Query is planned |
SCAN_START |
DML | Worker | Query task is executed |
SCAN_END |
DML | Worker | Query task completes |
INTERNAL_API |
DDL | Policy Engine | Metadata is returned to another Okera service |
DESCRIBE/DESCRIBE_DB |
DDL | Policy Engine | A client submits a DESCRIBE statement |
SHOW_{CREATE_TABLE,DBS,GRANT,ROLES,...} |
DDL | Policy Engine | A client submits a SHOW statement |
ALTER_{DATABASE,TABLE} |
DDL | Policy Engine | A client submits an ALTER statement |
CREATE_{DATABASE,TABLE,VIEW,...} |
DDL | Policy Engine | A client submits a CREATE statement |
DROP_{DATABASE,TABLE,VIEW,...} |
DDL | Policy Engine | A client submits a DROP statement |
GRANT_{PRIVILEGE,ROLE} |
DCL | Policy Engine | A client submits a GRANT statement |
REVOKE_{PRIVILEGE,ROLE} |
DCL | Policy Engine | A client submits a REVOKE statement |
{CREATE,ASSIGN,DELETE,...}_ATTRIBUTES |
DCL | Policy Engine | A client submits an ABAC-related statement |
UNKNOWN |
- | Policy Engine | A statement is not categorized or fails to parse |
If you want to see all the statement types in your cluster, execute the following query.
SELECT statement_type, count(statement_type) AS cnt
FROM okera_system.audit_logs
GROUP BY statement_type
ORDER BY cnt DESC
Note: In this tutorial, we assume you are using the Presto SQL dialect, by either selecting the "Presto" option in the Okera UI's Workspace, or by using an external SQL client, such as DBeaver or SQLWorkbench/J.
Use Cases¶
The following sections introduce you to the audit log-related datasets and their content in more detail. Practical examples guide you through common use cases for data managers.
Correlate Audit Records for Queries¶
Each query in Okera goes through two phases: the planning phase and execution phase. The planning phase parses the query in the context of the user and plans the physical execution, while applying all necessary data access policies. After the query plan is compiled, it is returned to the caller, which then uses it to subsequently execute the plan in a series of query tasks. How this happens is client-dependent, but assuming a modern compute framework such as Spark or Presto, the query tasks are executed at some level of parallelization, where executors or worker processes run the tasks in parallel using CPU multithreading. This means that from an Okera perspective, the query triggers more than one audit event: one for the planning phase, and one or more for each task in the execution phase.
Example: Audit events for a simple SQL query
{
"request_time":"2020-10-29 13:56:54.541251000",
"request_id":"704d151f52e9f5f8:24c1fa9321700000",
"session_id":"e49b9dc6af964db:60a778ce5801a186",
...
"num_results_returned":1,...
"planning_time_ms":0,
"execution_time_ms":0,
"statement_type":"PLAN",
...
"statement":"select count(*) from `marketing`.`user_account_data`",
"ae_attribute":"marketing.approved",
"ae_database":"marketing",
"ae_function":"",
"ae_path":"",
"ae_role":"",
"ae_table":"marketing.user_account_data",
"ae_view":""
}
{
"request_time":"2020-10-29 13:56:54.733802000",
"request_id":"704d151f52e9f5f8:24c1fa9321700001",
"session_id":"e49b9dc6af964db:60a778ce5801a186",
...
"statement_type":"SCAN_START",
...
"statement":"select count(*) from `marketing`.`user_account_data`",
...
}
{
"request_time":"2020-10-29 13:56:54.734082000",
"request_id":"704d151f52e9f5f8:24c1fa9321700001",
"session_id":"e49b9dc6af964db:60a778ce5801a186",
...
"statement_type":"SCAN_END",
...
"statement":"select count(*) from `marketing`.`user_account_data`",
...
}
In the example above, the tasks (just one in this example) emit two distinct events, one for the start of the read operation, and one for the end of it.
The request_id
value of the events has the following qualities:
- Its format is two hexadecimal numbers divided by a colon:
<prefix>:<postfix>
. - For a specific client query, the request ID prefix (e.g.
704d151f52e9f5f8
) is the same for all events of that query. - The postfix starts at zero in the right-most digits for the first query event and is incremented for each task (for example,
24c1fa9321700000
for thePLAN
event, and24c1fa9321700001
for both theSCAN
events of the same task) .
In other words, the SCAN
start and end events have the same request ID.
Since the end event is more interesting than the start, you can filter out that event completely.
This allows you to correlate the events and query them to see the details for each query phase.
Example: Correlate plan and query task events
SELECT request_id, statement_type, statement
FROM okera_system.steward_audit_logs
WHERE statement_type IN ('PLAN', 'SCAN_END')
ORDER BY 1
The simple sort aligns all of the request ID prefixes next to each other, and groups the query-related events together.
Note that the query is using the okera_system.steward_audit_logs
view, which implicitly is filtering the audit records for all rows for which the current user has access.
This allows for this view to be accessible by many users, possibly even everyone, as each user only sees what they are allowed to see.
Other views, such as the okera_system.reporting_audit_logs
view (shown in the next use case) would not work. While it also filters on the source of the events, it only shows Policy Engine events, but not worker-related ones.
Find Events in Time¶
Another common task for auditors, managers, and administrators is to determine who accessed datasets through one of the Okera endpoints.
This requires the use of the okera_system.audit_logs
table and rather simple SQL queries to determine what happened during a specific period of time. First we need to look at the available columns in the audit logs table:
There are a few things to note:
-
The columns that record the start (
start_unix_time
) and end time (end_unix_time
) of an operation are defined asBIGINT
. They are actually Unix epoch-based, recording the time elapsed since January 1st, 1970. For Okera, this elapsed time is given in millisecond granularity, while some clients or SQL functions may expect full seconds instead. -
The audit logs table is partitioned by date (
ymd
) so that queries can take advantage of eliminating unwanted partitions altogether. Queries can either try to rely on the Policy Engine to detect predicates well enough to be able to prune partitions automatically, or they can simply specify the date range explicitly and force the pruning. -
The user responsible for the audit record is shown in the
user
column and can be used to filter the audit records further.
For example, to find all audit records for a specific time period, write a simple SQL query combining this knowledge.
Example: Query all audit events for a specific time period
SELECT *
FROM okera_system.reporting_audit_logs
WHERE
ymd BETWEEN '2020-09-17' AND '2020-09-18'
AND start_unix_time BETWEEN 1600300801197 AND 1600387206546
-- AND ae_table in ('marketing.transactions', 'creditcard.transactions_cc')
AND user NOT IN ('okera', 'admin')
Note: The query also shows a comment line that can further filter the audit records by objects that were affected in the event. Experiment with the available columns to pinpoint what you need.
The results shown on the Okera UI's Workspace page might look like this:
Converting epoch-based values into human-readable dates and back again is not easy. Use the built-in date conversion functions and some simple math to convert milliseconds to seconds (and back) where necessary.
Example: Using the built-in conversion functions to handle readable dates
SELECT
from_unixtime(cast(start_unix_time/1000 as int)) as start_time,
from_unixtime(cast(end_unix_time/1000 as int)) as end_time,
auth_failure,
status,
user,
client_application,
statement_type
FROM okera_system.reporting_audit_logs
WHERE
ymd BETWEEN '2020-09-17' AND '2020-09-18'
AND start_unix_time BETWEEN to_unixtime(
from_iso8601_timestamp('2020-09-17T00:00:00Z'))*1000
AND to_unixtime(
from_iso8601_timestamp('2020-09-18T00:00:00Z'))*1000
AND user NOT IN ('okera', 'admin')
The output looks like this:
Finally, we can further simplify the query, as the okera_system.reporting_audit_logs
view automatically converts column values (this capability is built in).
For instance, start_unix_time
is available as start_time_utc
and allows you to completely drop the use of the conversion functions.
Example: Use the provided columns of the view to display and filter records
SELECT
start_time_utc,
end_time_utc,
auth_failure,
status,
user,
client_application,
statement_type
FROM okera_system.reporting_audit_logs
WHERE
ymd BETWEEN '2020-09-17' AND '2020-09-18'
AND start_time_utc BETWEEN '2020-09-17' AND '2020-09-18'
AND user not in ('okera', 'admin', 'root')
Use whichever option suits you best to access the audit logs table. Either way, you should be able to select the exact set of records based on time, user, and objects affected.
Permissions Granted to Users¶
Building on the previous use case, you might also ask who was assigned specific permissions to objects in the Schema Registry during a given period of time. This can be traced by looking further into the audit record values, which show the type of the statement executed. Okera uses this to filter for the data control language (DCL) events used to grant and revoke permissions, which, for Okera, means the granting of a role to a group of users or a specific user.
Example: Filtering audit records pertaining to permission changes
SELECT
start_time_utc,
user,
statement_type,
statement
FROM okera_system.reporting_audit_logs
WHERE
ymd BETWEEN '2020-10-05' AND '2020-10-06'
AND (statement_type = 'GRANT_ROLE' OR statement_type = 'REVOKE_ROLE')
The output might look like this:
The following example applies some string manipulation to make the result more readable. It extracts the role and group from the statement into dedicated columns.
Example: Extracting role and group from the filtered events
SELECT
start_time_utc,
statement_type AS action,
regexp_extract(statement,
'(?i)(?:GRANT|REVOKE) ROLE `?([^;`]+)`? (?:FROM|TO) GROUP `?([^`;]+)`?[;]?', 1)
AS role,
regexp_extract(statement,
'(?i)(?:GRANT|REVOKE) ROLE `?([^;`]+)`? (?:FROM|TO) GROUP `?([^`;]+)`?[;]?', 2)
AS `group`,
`user` AS executing_user
FROM okera_system.reporting_audit_logs
WHERE
ymd BETWEEN '2020-10-05' AND '2020-10-06'
AND (statement_type = 'GRANT_ROLE' OR statement_type = 'REVOKE_ROLE')
The final result looks like this:
Using the stored DCL-related records, you can also reconstruct the state of a role and group at any point in time.
Access Grouped by User and Date¶
Another common question that data administrators ask is: How many times did users access a specific registered dataset?
To answer this question, we can use the same approach as before, using the audit logs table or views to filter by date range, while also grouping by the user and building an aggregate.
Example: Compute number of accesses to datasets for specific users
SELECT
ae_table AS query_table,
user AS query_user,
client_application AS app,
ymd,
count(*) AS count
FROM okera_system.reporting_audit_logs
WHERE
ymd BETWEEN '2020-09-17' AND '2020-09-18'
GROUP BY ae_table, user, client_application, ymd
ORDER BY query_table, user, client_application, ymd ASC;
The result might look like this:
The following examples widen or narrow the scope of the above query. Results of these queries are omitted for the sake of brevity.
Example: Narrow down result by adding a specific user
SELECT
ae_table AS query_table,
client_application AS app,
ymd,
count(*) AS count
FROM okera_system.reporting_audit_logs
WHERE
ymd BETWEEN '2020-09-17' AND '2020-09-18'
user IN ('janedoe', 'johndoe')
GROUP BY ae_table, user, client_application, ymd
ORDER BY query_table, user, client_application, ymd ASC;
Example: Broaden the scope by removing the application info
SELECT
ae_table AS query_table,
user AS query_user,
count(*) AS count
FROM okera_system.reporting_audit_logs
WHERE
ymd BETWEEN '2020-09-17' AND '2020-09-18'
GROUP BY ae_table, user
ORDER BY query_table, user ASC;
Example: Adding the statement the user ran
SELECT
user AS query_user,
ae_table AS query_table,
statement,
count(*) AS count
FROM okera_system.reporting_audit_logs
WHERE
ymd BETWEEN '2020-09-17' AND '2020-09-18'
GROUP BY user, ae_table, statement
ORDER BY user, query_table ASC;
Note that some rows do not have any values for the query_table
column. This is due to the nature of the commands that were run by the user: some statements do not contain a specific table, for instance, GRANT ROLE ... TO GROUP ...
statements.
This can be fixed by further filtering on the statement type, and, if desired, all the way down to just the reading of data.
Example: Filtering only read access events
SELECT
user AS query_user,
ae_table AS query_table,
statement,
count(*) AS count
FROM okera_system.reporting_audit_logs
WHERE
ymd BETWEEN '2020-09-17' AND '2020-09-18'
AND statement_type in ('PLAN')
GROUP BY user, ae_table, statement
ORDER BY user, query_table ASC;
Adding the filter for PLAN
-type events, the result is now comprised of only the audit records that were created by SELECT
statements, that is, read access to the table:
Roles, Groups and Users¶
The system database okera_system
includes other interesting datasets that are useful for data administrators, in addition to the main audit logs table, .
A whole set of tables relates to the users issuing the query (the user
column in the audit records) and their known groups, as well as roles that are associated with them.
The following diagram explains the relationship of the first set of datasets.
Note: The diagrams use the standard entity relationship diagram (ERD) notation. This serves more as a help to think about how to use them in SQL queries, but less for accuracy: in data lakes there is usually no notion of referential integrity, since there are no primary or foreign keys.
The diagram shows the okera_system.role_names
and okera_system.group_names
, that are simple lists where a distinct name of a role or group is stored with a unique ID.
The okera_system.role_group_ids
table links them together in a many-to-many relationship, connecting distinct role names with all the groups they have been grant to.
The DCL statements that create a role, group and connecting the two looks like this (but obviously this can be done via the WebUI's Roles tab as well).
Example: DCL statements creating a role and group, and connecting them
> CREATE ROLE developer_role;
> GRANT ROLE developer_role TO GROUP developer_group;
There is also a view that converts the role and group IDs to human-readable names, that is, okera_system.role_groups
. This view can be used to present a mapping list to interactive users.
The next set of datasets are related to the users that have (successfully) authenticated with one of the many Okera endpoints, and include a step called group resolution that determines the currently available groups of the user.
These groups are later matched with the groups shown in the previous set of tables, namely okera_system.group_names
.
A user commonly belongs to more than one group and not all of them may have a role assigned to them.
The following diagram shows the authentication and group lookup-related datasets.
The design is similar to the previous set of datasets: there are two simple list tables with an ID and a name, and then a table that joins the two in a many-to-many relationship, though here with an extra last_observed
field.
It records the time when a user authentication last presented the given group.
There is also a similar view, okera_system.groups_by_user
, which replaces the IDs for the groups and users with the actual text representation.
Note though, the view only shows the groups a user was authenticated with the last time they interacted with the cluster.
The additional okera_system.user_group_info
view provides a list of all groups a user ever was authenticated with, even at different times.
It omits the last_observed
column, just listing all registered user/group combinations, and does this using the text representation of both again.
You can make use of the base tables that back these views and write a query that shows you all groups of all users, together with the time they were recorded.
Example: List all groups of all users at any time
SELECT u.user_name, g.group_name, ug.last_observed
FROM okera_system.users u
JOIN okera_system.user_group_mappings ug ON ug.user_name_id = u.id
JOIN okera_system.groups g on ug.group_name_id = g.group_id
ORDER BY u.user_name, ug.last_observed
The result might look like this:
Finally, we can combine the audit logs table with this to compile a comprehensive list of all access to data, while also listing the groups that were used to execute the query.
Example: Combine audit logs with user groups
SELECT
al.start_time_utc,
u.user_name,
g.group_name,
al.statement
FROM okera_system.reporting_audit_logs al
JOIN okera_system.users u ON u.user_name = al.user
JOIN okera_system.user_group_mappings ug ON ug.user_name_id = u.id
JOIN okera_system.groups g on ug.group_name_id = g.group_id
WHERE
al.start_time_utc > (
SELECT max(ugm.last_observed)
FROM okera_system.user_group_mappings ugm
WHERE ugm.user_name_id = ug.user_name_id
AND ugm.last_observed <= al.start_time_utc
)
AND al.statement_type IN ('PLAN')
-- AND al.user IN ('janedoe', 'johndoe')
AND al.user NOT IN ('admin', 'okera')
AND al.ymd BETWEEN '2020-11-01' AND '2020-11-3'
AND g.group_name NOT IN ('__okera_public_group')
AND al.status = 'ok'
ORDER BY al.start_time_utc, u.user_name
This query combines the audit records with the groups the user had access to just before the event took place. In other words, it shows the groups a user was operating in for each read access to data.
The result might look like this:
There are many more ways you can slice and dice the audit and user-related information. This tutorial just scratched the surface of what is possible.