Tutorial: Analyzing Okera Audit Logs¶

This self-service tutorial shows you how you can access and query the Okera audit logs. It explains how to write SQL queries for important information stored in the logs, which is useful when managing a cluster or when performing an audit.

Difficulty: Intermediate
Time needed: 4 hour

Introduction¶

Clients, that is, users and applications, can use any of the available Okera endpoints to issue metadata requests and query managed datasets. Every access transaction from these clients is logged by the Okera audit engine as JSON records stored in text files. These files are uploaded regularly to a configurable storage service and made available as system tables and related views, allowing them to be queried by administrators.

The following shows the audit log-related datasets in the Okera UI.

Okera UI with audit log-related datasets

Data managers, such as data owners, stewards, or line-of-business (LOB) managers, often require insight into what users do with the data they manage. The audit log-related datasets provide a great source of information to create insights using standard SQL queries.

Audit Log Records¶

As explained in our Auditing documentation, the JSON records stored in the log file comprise a long list of columns that contain details about the request that was made.

The following shows an example where the user creates and then drops a database.

Example: Audit records showing a database being created and subsequently dropped

{
  "request_time":"2020-10-30 10:05:01.437238000",
  "request_id":"504717762e22e373:79a9d59cfeb9b2a5",
  "session_id":"ef4b2d937103c170:8a64f031fecfeab9",
  "start_unix_time":1604052301437,
  "end_unix_time":1604052301455,
  "auth_failure":false,
  "status":"ok",
  "user":"admin",
  "connected_user":"root",
  "client_network_address":"172.23.77.80:40374",
  "client_application":"okera-rest-server",
  "num_results_returned":0,
  "num_results_read":0,
  "bytes_scanned":0,
  "server_total_time_ms":0,
  "queue_time_ms":0,
  "planning_time_ms":0,
  "execution_time_ms":0,
  "statement_type":"CREATE_DATABASE",
  "for_reporting":true,
  "default_db":"default",
  "statement":"create database foobar",
  "ae_attribute":"",
  "ae_database":"",
  "ae_function":"",
  "ae_path":"",
  "ae_role":"",
  "ae_table":"",
  "ae_view":""
}
...
{
  "request_time":"2020-10-30 10:05:07.519814000",
  "request_id":"d44cf1ac90cb9d52:7811f9cb5065a29a",
  "session_id":"ca4532336b844b72:a4bbece430b38587",
  ...
  "statement_type":"DROP_DATABASE",
  ...
  "statement":"drop database foobar",
  "ae_attribute":"",
  "ae_database":"foobar",
  "ae_function":"",
  ...
}

The second record was abbreviated for the sake of brevity. Notice though how the record is containing the actual SQL statement, as well as the statement type, which is determined when the Okera Policy Engine (formerly known as the Planner) parses the query.

The audit records shown above were taken from the Policy Engine audit log files. Okera Enforcement Fleet workers emit their own audit records to separate log files, but to the same shared storage location. It means that both Policy Engine and Enforcement Fleet worker audit records are accessible through the same system tables, mitigating the need to merge them manually.

Here are the most important statement types the records may contain, along with the source emitting them and the instigating event.

Type	Kind	Source	Instigating Event
`PLAN`	DML	Policy Engine	Query is planned
`SCAN_START`	DML	Worker	Query task is executed
`SCAN_END`	DML	Worker	Query task completes
`INTERNAL_API`	DDL	Policy Engine	Metadata is returned to another Okera service
`DESCRIBE/DESCRIBE_DB`	DDL	Policy Engine	A client submits a `DESCRIBE` statement
`SHOW_{CREATE_TABLE,DBS,GRANT,ROLES,...}`	DDL	Policy Engine	A client submits a `SHOW` statement
`ALTER_{DATABASE,TABLE}`	DDL	Policy Engine	A client submits an `ALTER` statement
`CREATE_{DATABASE,TABLE,VIEW,...}`	DDL	Policy Engine	A client submits a `CREATE` statement
`DROP_{DATABASE,TABLE,VIEW,...}`	DDL	Policy Engine	A client submits a `DROP` statement
`GRANT_{PRIVILEGE,ROLE}`	DCL	Policy Engine	A client submits a `GRANT` statement
`REVOKE_{PRIVILEGE,ROLE}`	DCL	Policy Engine	A client submits a `REVOKE` statement
`{CREATE,ASSIGN,DELETE,...}_ATTRIBUTES`	DCL	Policy Engine	A client submits an ABAC-related statement
`UNKNOWN`	-	Policy Engine	A statement is not categorized or fails to parse

If you want to see all the statement types in your cluster, execute the following query.

SELECT statement_type, count(statement_type) AS cnt
FROM okera_system.audit_logs
GROUP BY statement_type
ORDER BY cnt DESC

Note: In this tutorial, we assume you are using the Presto SQL dialect, by either selecting the "Presto" option in the Okera UI's Workspace, or by using an external SQL client, such as DBeaver or SQLWorkbench/J.

Use Cases¶

The following sections introduce you to the audit log-related datasets and their content in more detail. Practical examples guide you through common use cases for data managers.

Correlate Audit Records for Queries¶

Each query in Okera goes through two phases: the planning phase and execution phase. The planning phase parses the query in the context of the user and plans the physical execution, while applying all necessary data access policies. After the query plan is compiled, it is returned to the caller, which then uses it to subsequently execute the plan in a series of query tasks. How this happens is client-dependent, but assuming a modern compute framework such as Spark or Presto, the query tasks are executed at some level of parallelization, where executors or worker processes run the tasks in parallel using CPU multithreading. This means that from an Okera perspective, the query triggers more than one audit event: one for the planning phase, and one or more for each task in the execution phase.

Example: Audit events for a simple SQL query

{
  "request_time":"2020-10-29 13:56:54.541251000",
  "request_id":"704d151f52e9f5f8:24c1fa9321700000",
  "session_id":"e49b9dc6af964db:60a778ce5801a186",
  ...
  "num_results_returned":1,...
  "planning_time_ms":0,
  "execution_time_ms":0,
  "statement_type":"PLAN",
  ...
  "statement":"select count(*) from `marketing`.`user_account_data`",
  "ae_attribute":"marketing.approved",
  "ae_database":"marketing",
  "ae_function":"",
  "ae_path":"",
  "ae_role":"",
  "ae_table":"marketing.user_account_data",
  "ae_view":""
}

{
  "request_time":"2020-10-29 13:56:54.733802000",
  "request_id":"704d151f52e9f5f8:24c1fa9321700001",
  "session_id":"e49b9dc6af964db:60a778ce5801a186",
    ...
  "statement_type":"SCAN_START",
  ...
  "statement":"select count(*) from `marketing`.`user_account_data`",
  ...
}

{
  "request_time":"2020-10-29 13:56:54.734082000",
  "request_id":"704d151f52e9f5f8:24c1fa9321700001",
  "session_id":"e49b9dc6af964db:60a778ce5801a186",
  ...
  "statement_type":"SCAN_END",
  ...
  "statement":"select count(*) from `marketing`.`user_account_data`",
  ...
}

In the example above, the tasks (just one in this example) emit two distinct events, one for the start of the read operation, and one for the end of it. The request_id value of the events has the following qualities:

Its format is two hexadecimal numbers divided by a colon: <prefix>:<postfix>.
For a specific client query, the request ID prefix (e.g. 704d151f52e9f5f8) is the same for all events of that query.
The postfix starts at zero in the right-most digits for the first query event and is incremented for each task (for example,24c1fa9321700000 for the PLAN event, and 24c1fa9321700001 for both the SCAN events of the same task) .

In other words, the SCAN start and end events have the same request ID. Since the end event is more interesting than the start, you can filter out that event completely. This allows you to correlate the events and query them to see the details for each query phase.

Example: Correlate plan and query task events

SELECT request_id, statement_type, statement 
FROM okera_system.steward_audit_logs
WHERE statement_type IN ('PLAN', 'SCAN_END') 
ORDER BY 1

The simple sort aligns all of the request ID prefixes next to each other, and groups the query-related events together.

Note that the query is using the okera_system.steward_audit_logs view, which implicitly is filtering the audit records for all rows for which the current user has access. This allows for this view to be accessible by many users, possibly even everyone, as each user only sees what they are allowed to see. Other views, such as the okera_system.reporting_audit_logs view (shown in the next use case) would not work. While it also filters on the source of the events, it only shows Policy Engine events, but not worker-related ones.

Find Events in Time¶

Another common task for auditors, managers, and administrators is to determine who accessed datasets through one of the Okera endpoints. This requires the use of the okera_system.audit_logs table and rather simple SQL queries to determine what happened during a specific period of time. First we need to look at the available columns in the audit logs table:

The columns of the audit logs table

There are a few things to note:

The columns that record the start (start_unix_time) and end time (end_unix_time) of an operation are defined as BIGINT. They are actually Unix epoch-based, recording the time elapsed since January 1st, 1970. For Okera, this elapsed time is given in millisecond granularity, while some clients or SQL functions may expect full seconds instead.
The audit logs table is partitioned by date (ymd) so that queries can take advantage of eliminating unwanted partitions altogether. Queries can either try to rely on the Policy Engine to detect predicates well enough to be able to prune partitions automatically, or they can simply specify the date range explicitly and force the pruning.
The user responsible for the audit record is shown in the user column and can be used to filter the audit records further.

For example, to find all audit records for a specific time period, write a simple SQL query combining this knowledge.

Example: Query all audit events for a specific time period

SELECT *
FROM okera_system.reporting_audit_logs
WHERE
  ymd BETWEEN '2020-09-17' AND '2020-09-18'
  AND start_unix_time BETWEEN 1600300801197 AND 1600387206546
  -- AND ae_table in ('marketing.transactions', 'creditcard.transactions_cc')
  AND user NOT IN ('okera', 'admin')

Note: The query also shows a comment line that can further filter the audit records by objects that were affected in the event. Experiment with the available columns to pinpoint what you need.

The results shown on the Okera UI's Workspace page might look like this:

Results of a simple audit logs table query

Converting epoch-based values into human-readable dates and back again is not easy. Use the built-in date conversion functions and some simple math to convert milliseconds to seconds (and back) where necessary.

Example: Using the built-in conversion functions to handle readable dates

SELECT
  from_unixtime(cast(start_unix_time/1000 as int)) as start_time,
  from_unixtime(cast(end_unix_time/1000 as int)) as end_time,
  auth_failure,
  status,
  user,
  client_application,
  statement_type
FROM okera_system.reporting_audit_logs
WHERE
  ymd BETWEEN '2020-09-17' AND '2020-09-18'
  AND start_unix_time BETWEEN to_unixtime(
    from_iso8601_timestamp('2020-09-17T00:00:00Z'))*1000
    AND to_unixtime(
    from_iso8601_timestamp('2020-09-18T00:00:00Z'))*1000
  AND user NOT IN ('okera', 'admin')

The output looks like this:

Converting dates to human-readable form

Finally, we can further simplify the query, as the okera_system.reporting_audit_logs view automatically converts column values (this capability is built in). For instance, start_unix_time is available as start_time_utc and allows you to completely drop the use of the conversion functions.

Example: Use the provided columns of the view to display and filter records

SELECT
  start_time_utc,
  end_time_utc,
  auth_failure,
  status,
  user,
  client_application,
  statement_type
FROM okera_system.reporting_audit_logs
WHERE
   ymd BETWEEN '2020-09-17' AND '2020-09-18'
   AND start_time_utc BETWEEN '2020-09-17' AND '2020-09-18'
   AND user not in ('okera', 'admin', 'root')

Use whichever option suits you best to access the audit logs table. Either way, you should be able to select the exact set of records based on time, user, and objects affected.

Permissions Granted to Users¶

Building on the previous use case, you might also ask who was assigned specific permissions to objects in the Schema Registry during a given period of time. This can be traced by looking further into the audit record values, which show the type of the statement executed. Okera uses this to filter for the data control language (DCL) events used to grant and revoke permissions, which, for Okera, means the granting of a role to a group of users or a specific user.

Example: Filtering audit records pertaining to permission changes

SELECT 
  start_time_utc,
  user,
  statement_type,
  statement
FROM okera_system.reporting_audit_logs
WHERE
  ymd BETWEEN '2020-10-05' AND '2020-10-06'
  AND (statement_type = 'GRANT_ROLE' OR statement_type = 'REVOKE_ROLE')

The output might look like this:

Events pertaining to permission changes

The following example applies some string manipulation to make the result more readable. It extracts the role and group from the statement into dedicated columns.

Example: Extracting role and group from the filtered events

SELECT 
  start_time_utc,
  statement_type AS action,
  regexp_extract(statement, 
    '(?i)(?:GRANT|REVOKE) ROLE `?([^;`]+)`? (?:FROM|TO) GROUP `?([^`;]+)`?[;]?', 1) 
    AS role,
  regexp_extract(statement, 
    '(?i)(?:GRANT|REVOKE) ROLE `?([^;`]+)`? (?:FROM|TO) GROUP `?([^`;]+)`?[;]?', 2) 
    AS `group`,
  `user` AS executing_user
FROM okera_system.reporting_audit_logs
WHERE
  ymd BETWEEN '2020-10-05' AND '2020-10-06'
  AND (statement_type = 'GRANT_ROLE' OR statement_type = 'REVOKE_ROLE')

The final result looks like this:

Extracting role and group names from the statement

Using the stored DCL-related records, you can also reconstruct the state of a role and group at any point in time.

Access Grouped by User and Date¶

Another common question that data administrators ask is: How many times did users access a specific registered dataset?

To answer this question, we can use the same approach as before, using the audit logs table or views to filter by date range, while also grouping by the user and building an aggregate.

Example: Compute number of accesses to datasets for specific users

SELECT
  ae_table AS query_table, 
  user AS query_user, 
  client_application AS app, 
  ymd, 
  count(*) AS count
FROM okera_system.reporting_audit_logs
WHERE
  ymd BETWEEN '2020-09-17' AND '2020-09-18'
GROUP BY ae_table, user, client_application, ymd
ORDER BY query_table, user, client_application, ymd ASC;

The result might look like this:

Number of accesses by time and user

The following examples widen or narrow the scope of the above query. Results of these queries are omitted for the sake of brevity.

Example: Narrow down result by adding a specific user

SELECT
  ae_table AS query_table,
  client_application AS app,
  ymd,
  count(*) AS count
FROM okera_system.reporting_audit_logs
WHERE
  ymd BETWEEN '2020-09-17' AND '2020-09-18'
  user IN ('janedoe', 'johndoe')
GROUP BY ae_table, user, client_application, ymd
ORDER BY query_table, user, client_application, ymd ASC;

Example: Broaden the scope by removing the application info

SELECT
  ae_table AS query_table, 
  user AS query_user,
  count(*) AS count
FROM okera_system.reporting_audit_logs
WHERE
  ymd BETWEEN '2020-09-17' AND '2020-09-18'
GROUP BY ae_table, user
ORDER BY query_table, user ASC;

Example: Adding the statement the user ran

SELECT
  user AS query_user,
  ae_table AS query_table,
  statement,
 count(*) AS count
FROM okera_system.reporting_audit_logs
WHERE
  ymd BETWEEN '2020-09-17' AND '2020-09-18'
GROUP BY user, ae_table, statement
ORDER BY user, query_table ASC;

Note that some rows do not have any values for the query_table column. This is due to the nature of the commands that were run by the user: some statements do not contain a specific table, for instance, GRANT ROLE ... TO GROUP ... statements. This can be fixed by further filtering on the statement type, and, if desired, all the way down to just the reading of data.

Example: Filtering only read access events

SELECT
  user AS query_user,
  ae_table AS query_table,
  statement,
 count(*) AS count
FROM okera_system.reporting_audit_logs
WHERE
  ymd BETWEEN '2020-09-17' AND '2020-09-18'
  AND statement_type in ('PLAN')
GROUP BY user, ae_table, statement
ORDER BY user, query_table ASC;

Adding the filter for PLAN-type events, the result is now comprised of only the audit records that were created by SELECT statements, that is, read access to the table:

Filtering read events only

Roles, Groups and Users¶

The system database okera_system includes other interesting datasets that are useful for data administrators, in addition to the main audit logs table, . A whole set of tables relates to the users issuing the query (the user column in the audit records) and their known groups, as well as roles that are associated with them. The following diagram explains the relationship of the first set of datasets.

Note: The diagrams use the standard entity relationship diagram (ERD) notation. This serves more as a help to think about how to use them in SQL queries, but less for accuracy: in data lakes there is usually no notion of referential integrity, since there are no primary or foreign keys.

ERD of roles and groups tables with alternative view

The diagram shows the okera_system.role_names and okera_system.group_names, that are simple lists where a distinct name of a role or group is stored with a unique ID. The okera_system.role_group_ids table links them together in a many-to-many relationship, connecting distinct role names with all the groups they have been grant to. The DCL statements that create a role, group and connecting the two looks like this (but obviously this can be done via the WebUI's Roles tab as well).

Example: DCL statements creating a role and group, and connecting them

> CREATE ROLE developer_role;
> GRANT ROLE developer_role TO GROUP developer_group;

There is also a view that converts the role and group IDs to human-readable names, that is, okera_system.role_groups. This view can be used to present a mapping list to interactive users.

The next set of datasets are related to the users that have (successfully) authenticated with one of the many Okera endpoints, and include a step called group resolution that determines the currently available groups of the user. These groups are later matched with the groups shown in the previous set of tables, namely okera_system.group_names. A user commonly belongs to more than one group and not all of them may have a role assigned to them. The following diagram shows the authentication and group lookup-related datasets.

ERD of users and groups tables with alternative views

The design is similar to the previous set of datasets: there are two simple list tables with an ID and a name, and then a table that joins the two in a many-to-many relationship, though here with an extra last_observed field. It records the time when a user authentication last presented the given group.

There is also a similar view, okera_system.groups_by_user, which replaces the IDs for the groups and users with the actual text representation. Note though, the view only shows the groups a user was authenticated with the last time they interacted with the cluster.

The additional okera_system.user_group_info view provides a list of all groups a user ever was authenticated with, even at different times. It omits the last_observed column, just listing all registered user/group combinations, and does this using the text representation of both again.

You can make use of the base tables that back these views and write a query that shows you all groups of all users, together with the time they were recorded.

Example: List all groups of all users at any time

SELECT u.user_name, g.group_name, ug.last_observed 
FROM okera_system.users u 
JOIN okera_system.user_group_mappings ug ON ug.user_name_id = u.id 
JOIN okera_system.groups g on ug.group_name_id = g.group_id 
ORDER BY u.user_name, ug.last_observed

The result might look like this:

List all groups of all users at any time

Finally, we can combine the audit logs table with this to compile a comprehensive list of all access to data, while also listing the groups that were used to execute the query.

Example: Combine audit logs with user groups

SELECT 
  al.start_time_utc, 
  u.user_name, 
  g.group_name, 
  al.statement
FROM okera_system.reporting_audit_logs al
JOIN okera_system.users u ON u.user_name = al.user
JOIN okera_system.user_group_mappings ug ON ug.user_name_id = u.id
JOIN okera_system.groups g on ug.group_name_id = g.group_id 
WHERE 
  al.start_time_utc > (
    SELECT max(ugm.last_observed) 
    FROM okera_system.user_group_mappings ugm
    WHERE ugm.user_name_id = ug.user_name_id
      AND ugm.last_observed <= al.start_time_utc 
  )
  AND al.statement_type IN ('PLAN')
-- AND al.user IN ('janedoe', 'johndoe')
  AND al.user NOT IN ('admin', 'okera')
  AND al.ymd BETWEEN '2020-11-01' AND '2020-11-3'
  AND g.group_name NOT IN ('__okera_public_group')
  AND al.status = 'ok'
ORDER BY al.start_time_utc, u.user_name

This query combines the audit records with the groups the user had access to just before the event took place. In other words, it shows the groups a user was operating in for each read access to data.

The result might look like this:

Combine audit logs with user groups

There are many more ways you can slice and dice the audit and user-related information. This tutorial just scratched the surface of what is possible.