Privacy and Security Functions¶

This document provides information about Okera's privacy and security functions. To see examples of these functions used in policies see Permission Examples.

The following table summarizes Okera's privacy functions.

Transformation Type	Function	Description
AES decryption	`aes_decrypt()`	Decrypts Okera strings using Advanced Encryption Standard (AES). This does not work with strings from other data sources. This can be restricted for use by Okera admins only.
AES encryption	`aes_encrypt()`	Encrypts Okera strings using Advanced Encryption Standard (AES). This does not work with strings from other data sources. Use of this function is restricted to Okera admins only.
Differential Privacy	`diff_privacy()`	Enables differential privacy for numeric values.
Hash (FNV)	`fnv_hash()`	Hashes the given value using the FNV algorithm. Maintains referential integrity.
Format preserving masking	`fp_random`	Replaces original value with a random format preserving value. Does not maintain referential integrity.
Tokenization - format preserving	`fp_ref_tokenize()`	Replaces value with a format preserving token that maintains referential integrity. Note `tokenize()` only maintains referential integrity for `STRING` and `BIGINT` types, whereas `fp_ref_tokenize()` does it for all types.
Hash (Murmur)	`hash()`	Hashes the given value using the Murmur algorithm. Maintains referential integrity.
Mask	`mask()`	Completely masks a value.
Mask ccn	`mask_ccn()`	Masks everything except the last 4 characters, useful for credit card numbers.
Random	`nfp_random`	Replaces original value with a random non-format preserving value. Does not maintain referential integrity.
Tokenization - non-format preserving	`nfp_ref_tokenize()`	Replaces value with a non-format preserving token that maintains referential integrity.
Null	`null()`	Replaces original value with null.
PHI age	`phi_age()`	Compliant with the HIPAA safe-harbor standard, this caps ages shown to 90. If a person is over 90 years old, 90 is shown.
PHI date	`phi_date()`	Compliant with the HIPAA safe-harbor standard, this rounds the date to 1/1 (January 1) of the current year.
PHI date of birth	`phi_dob()`	Compliant with the HIPAA safe-harbor standard, this rounds the date to 1/1 (January 1) of a person's birth year, unless the person is over 90. If a person is over 90, the birth year is the year at which they turned 90.
PHI zip code	`phi_zip3()`	Compliant with the HIPAA safe-harbor standard, this shows only 3 digits and is fully redaacted (000) for very small zipcodes. In addition, five-digit numeric zip codes are supported (however, if the zip codes are in string format, this limitation does not apply).
Random ccn	`random_ccn()`	Generates a random credit card number.
Hash (SHA2)	`sha2()`	Hashes the given value using the SHA2 algorithm. Maintains referential integrity.
Tokenization - format preserving	`tokenize()`	Replaces value with a format preserving token that maintains referential integrity. Note `tokenize()` only maintains referential integrity for `STRING` and `BIGINT` types, whereas `fp_ref_tokenize()` does it for all types.
Zero	`zero()`	Replaces original value with an empty value.

Prioritization of Transformations¶

You can prioritize transformations when multiple transformations are applied to a single column. If you use autotagging, multiple tags can be assigned to a single column, each with its own transformations. Multiple transformations applied to a single column make the column inaccessible with a conflict.

To resolve this problem, prioritize the transformation functions. The prioritization applies across the entire system. Wherever multiple transformations occur for any column, the prioritization is applied.

To prioritize transformations, add the TRANSFORM_UDF_PRIORITIES configuration property to your configuration settings. Values for this property are the transformation function names, specified in sequence, separated by commas. The last function name in the list has priority over all the preceding functions in the list as well as priority over functions not included in the list.

Note: You do not need to specify all available function names in the configuration property. Functions that are not listed are given the lowest priority. However, if multiple unlisted functions are applied to a single column, a column conflict (errors) will occur because the functions have not been prioritized.

In the following example, sha2 has the highest priority and will be applied to the column data, when it is requested for a column with any other transformation function. Likewise, tokenize will be given priority and applied to the column data when it is requested for a column with null or mask or any other transformation function except sha2. Finally, fp_random (not in the list) will be given lower priority than null, mask, tokenize, or sha2. If fp_random and another unlisted transformation function are specified for a column, a column conflict will occur.

TRANSFORM_UDF_PRIORITIES: null,mask,tokenize,sha2

Valid transformation function names that can be specified for TRANSFORM_UDF_PRIORITIES are shown in the following table. Their default priority is also shown (if you do not override that priority in TRANSFORM_UDF_PRIORITIES). Higher priorities override lower priorities (for example, mask transformations override zero transformations when applied to the same column).

Function Name	Default Priority
`aes_decrypt`	6
`aes_encrypt`	7
`diff_privacy`	13
`fnv_hash`	5
`fp_random`	18
`fp_ref_tokenize`	9
`hash`	4
`mask`	11
`mask_ccn`	12
`nfp_random`	19
`nfp_ref_tokenize`	10
`null`	1
`phi_age`	14
`phi_date`	15
`phi_dob`	16
`phi_zip3`	17
`random_ccn`	20
`sha2`	3
`tokenize`	8
`zero`	2

Descriptions of each of these transformation functions are provided above.

Restrict Use of Privacy Functions¶

You can restrict use of Okera's privacy functions and user-defined functions (UDFs) to Okera administrators only. To activate this feature, add the RESTRICTED_UDFS configuration parameter to your Okera configuration file. Valid values for this parameter are a comma-separated list of function names. Use of any functions listed in the parameter require administrator privileges. In the following example, the aes_decrypt and nfp_ref_tokenize privacy functions can only be used by administrators.

RESTRICTED_UDFS: aes_decrypt,nfp_ref_tokenize

Masking¶

Masking, or suppression, is an anonymization technique where parts of the data are completely hidden with random characters or other data.

mask(<STRING | INT | DATE>, INT <start>, INT <length>)

Type	Datatypes	Masking behavior
String	STRING, TEXT, VARCHAR	Format-preserving, replaces alphanumeric characters with X. You may optionally specify a character limit starting at position `<start>` (1-based) and ending after `<length>` characters.
Numerical	BIGINT, DECIMAL, INT, FLOAT	Uses a zero value
Date & Time	DATE, TIMESTAMP	Uses epoch time zero. 01.01.1970 00:00:00.000000

SQL Example: Using the mask() function

> SELECT mask("5000 4000 3000 1000", 6, 4);
5000 XXXX 3000 1000

> SELECT mask("5000 4000 3000 1000", 6, 9);
5000 XXXXXXXXX 1000

Mask ccn¶

STRING mask_ccn(STRING)

Masks all but that last four digits of a given credit card number.

SQL Example: Using the mask_ccn() function

> SELECT mask_ccn("5000 4000 3000 1000")
XXXXXXXXXXXXXXX1000

> SELECT mask_ccn("5000400030001000")
XXXXXXXXXXXX1000

Random ccn¶

STRING random_ccn()

Returns a random credit card number, divided by dashes.

SQL Example: Using the random_ccn() function

> SELECT random_ccn();
5580-5293-0763-8227

> SELECT random_ccn();
3484-8051-1437-2629

Random¶

nfp_random(<primitive_type>)

Applies a random, non-format preserving transformation to the given value. Does not maintain referential integrity and values will always be different, even in the same table.

SQL Example: Using the nfp_random() function

> SELECT nfp_random('Jane Smith');
6947408657753046000

> SELECT nfp_random('Jane Smith');
7021543741362542000

Format-Preserving Masking¶

STRING fp_random(STRING)

Applies a format-preserving anonymization to the given text. Does not maintain referential integrity and values will always be different, even in the same table.

SQL Example: Using the fp_random() function

> SELECT fp_random('Jane Smith');
Hubq Bntql

> SELECT fp_random('5500 0000 0000 0004');
1467 1315 1117 7939

Here's what an example column with the same values would look like:

Name	With fp_random transformation
Kory Camacho	Fhxr Vyduylz
Kory Camacho	Impe Wngubgs

Null¶

null()

Replace values with null.

> SELECT null("5000 4000 3000 1000");
null

Zero¶

zero()

Replace values with a blank value.

> SELECT zero("5000 4000 3000 1000");
>

Tokenization¶

Tokenization is pseudonymization technique that substitutes a value that may be sensitive, with a non-sensitive token. Okera offers different tokenization functions depending on if you want format-preservation, or to maintain referential integrity. Referential integrity means the resulting token is always the same for a given value across tables, so that analytical correlations (such as joins) are still possible.

Note: Format Preserving indicates whether the type and general format of the data will be preserved. Referential Integrity indicates whether tokenized columns can be used for joins across tables. Stable Results indicates whether queries against the same data by the same user on the same table will produce the same results.

Function	Format Preserving	Referential Integrity	Stable Results
`tokenize()`	Yes	STRING & BIGINT types only	Yes
`fp_ref_tokenize`	Yes	Yes	Yes
`fp_noref_tokenize`	Yes	No	Yes
`nfp_ref_tokenize`	No	Yes	Yes
`nfp_noref_tokenize`	No	No	Yes

Note: Users manually executing any tokenization function in a query will not see the same result as if executed inside a view or as a transformation function in a policy. This removes the possibility of a user working backwards and "undoing" the anonymization.

Configure Referential Integrity¶

By default, functions that preserve referential integrity produce the same output for all users, and without change over time when executed inside a view or as a transformation function in a policy.

This behavior is configurable in two ways:

Pass explicit modifier arguments to the function. For example, the following view definition results in every user seeing different results when they query <db>.<view>, and those results change every hour:
```
CREATE VIEW <db>.<view> SELECT
fp_ref_tokenize(<col>, signed_user(true), signed_period_sec(3600)
FROM <db>.<table>
```
Change the system defaults:
- TOKENIZATION_DEFAULT_SHARING_UNIQUE_PER_USER - when set to true the system will default to unique values per user when executed inside a view or as a transformation function in a policy. The default is false.
- TOKENIZATION_DEFAULT_PERIOD_SEC - when set to 0 the results will change on every invocation. When set to a positive integer K, the results will change every K seconds. The default is -1, which disables the period change.

See an example of a tokenization policy in Permission Examples.

Format-Preserving Tokenization¶

With Referential Integrity¶

STRING tokenize(STRING) BIGINT tokenize(BIGINT)

fp_ref_tokenize(<primitive_type>)

Note: The default tokenize() function only maintains referential integrity for STRING and BIGINT data types. This is because referential integrity on low cardinality types such as boolean may be too easy to reverse identify. If you wish to maintain referential integrity across all data types, use the fp_ref_tokenize() function.

Here is an example using the tokenize() function:

> SELECT tokenize("125.1.10.34")
847.9.84.92

In the example, referential integrity is preserved, so joins are possible across tables:

Name	With `fp_ref_tokenize()`
Elaine Wilkerson	Xdvvys Qaybengxv
Kory Camacho	Avxj Yptqnhf
Kory Camacho	Avxj Yptqnhf

Without Referential Integrity¶

fp_noref_tokenize(<primitive_type>)

Here is an example using the fp_noref_tokenize() function:

> SELECT fp_noref_tokenize("125.1.10.34")
1529789670085611000

In this example, although values are stable inside each table, they are different across tables, which means referential integrity is not maintained :

Base Table 01

Name	With `fp_noref_tokenize()`
Elaine Wilkerson	Qbkxvm Ozumpmkvg
Kory Camacho	Hqos Vfjrxam
Kory Camacho	Hqos Vfjrxam

Base Table 02

Name	With fp_noref_tokenize()
Elaine Wilkerson	Pqwzvw Mtplysmj
Kory Camacho	Kbxw Abichog
Kory Camacho	Kbxw Abichog

Non-Format-Preserving Tokenization¶

With Referential Integrity¶

nfp_ref_tokenize(<primitive_type>)

Example: Using the nfp_ref_tokenize() function

> SELECT nfp_ref_tokenize("125.1.10.34")
7501817423217723000

Take the following example, note that referential integrity is preserved, so joins are possible across datasets, but format is not preserved:

Take the following example. Format is not preserved as text characters have been replaced with numerical ones. Referential integrity is preserved so joins are possible across tables:

Name	With nfp_ref_tokenize()
Elaine Wilkerson	1732252590825781000
Kory Camacho	259850744261204400
Kory Camacho	259850744261204400

Without Referential Integrity¶

nfp_noref_tokenize(<primitive_type>)

Here is an example using the nfp_noref_tokenize() function:

> SELECT nfp_noref_tokenize("125.1.10.34")
1529789670085611000

In this example, format is not preserved because text characters have been replaced with numerical ones. Although the tokenized values are stable inside each table, they are different across tables, which means referential integrity is not maintained:

Table 01

Name	With nfp_noref_tokenize()
Elaine Wilkerson	8407409921477382000
Kory Camacho	3042039274287704600
Kory Camacho	3042039274287704600

Table 02

Name	With nfp_ref_tokenize()
Elaine Wilkerson	1732252590825781000
Kory Camacho	259850744261204400
Kory Camacho	259850744261204400

Hash¶

hash()

Hashing returns a fixed size token output from an input of any size according to a mathematical algorithm. Hashed values are irreversible and not format preserving, but they maintain referential integrity. Hashed values are not unique per user and every user will see the same hash for a given value.

SHA2¶

BIGINT sha2(<primitive_type>)

Applies the SHA2 hash algorithm to the input value. The resulting value is not reversible, but for a specific input the resulting value is always the same for all users and therefore maintains referential integrity for analytical correlations.

SQL Example: Using the sha2() function

> SELECT fnv_hash("125.1.10.34")
6088146908894507000

FNV Hash¶

BIGINT fnv_hash(<primitive_type>)

This hash function applies a FNV based hash algorithm to the input value. Similar to MD5 but non-cryptographic and faster. The resulting value is not reversible, but for a specific input the resulting value is always the same for all users and therefore maintains referential integrity for analytical correlations.

SQL Example: Using the fnv_hash() function

> SELECT fnv_hash("125.1.10.34")
-7425713518659373000

Differential Privacy¶

<numeric_type> diff_privacy(<numeric_type> <value>, INT <noise>)

Allows to add a statistical noise (or jitter) to a numeric value, while retaining the same properties when aggregated. For example, it is possible to add +/- 10 years to the individual values of a column containing birthdays, but when calculating the average age across all records you still get close to the same result.

Careful attention has been taken to ensure that this is not susceptible to repeated queries. For example a query like SELECT salary WHERE id = 1234 will always return the same value, not a normally distributed random value around the true one. Otherwise, a malicious user could simply rerun queries and average results. This is implemented by making sure the "same query" always returns the same noise, that is, you get no additional information from repeated runs.

Example: Using the diff_privacy() function

> SELECT diff_privacy(50, 10);
54

Custom User-Defined Functions (UDF)¶

If you wish to create your own functions for security, see Extending Okera.