Privacy and Security Functions

This document is for administrators and data stewards who are looking for information about Okera's privacy and security functions. To see examples of these functions used in policies see our policy examples guide.

Privacy functions summary

Transform Function Description
Mask mask() Completely masks a value, crossing out alpha-numerical characters.
Mask ccn mask_ccn() Masks everything except the last 4 characters, useful for credit card numbers.
Random ccn random_ccn() Generates a random credit card number.
Null null() Replaces original value with null.
Random nfp_random Replaces original value with a random non-format preserving value. Does not maintain referential integrity.
Format preserving masking fp_random Replaces original value with a random format preserving value. Does not maintain referential integrity.
Tokenization - format preserving tokenize() fp_ref_tokenize() Replaces value with a format preserving token that maintains referential integrity. Note tokenize() only maintains referential integrity for STRING and BIGINT types, whereas fp_ref_tokenize() does it for all types.
Tokenization - non-format preserving nfp_ref_tokenize() Replaces value with a non-format preserving token that maintains referential integrity.
Hash (Murmur) hash() Hashes the given value using the Murmur algorithm. Maintains referential integrity.
Hash (SHA2) sha2() Hashes the given value using the SHA2 algorithm. Maintains referential integrity.
Hash (FNV) fnv_hash() Hashes the given value using the FNV algorithm. Maintains referential integrity.
Zero zero() Replaces original value with an empty value.
Differential Privacy diff_privacy() Enables differential privacy for numeric values.

Masking

Masking or suppression is anonymization technique where parts of the data are completely hidden with random characters or other data.

Mask

STRING mask(STRING, INT <start>, INT <length>)

Completely masks the characters in a given string value. You can provide optional arguments for the character <start> and total <length> of masked characters.

SQL Example: Using the mask() function

> SELECT mask("5000 4000 3000 1000");
XXXX XXXX XXXX XXXX

> SELECT mask("5000 4000 3000 1000", 6, 9);
5000 XXXXXXXXX 1000

Mask ccn

STRING mask_ccn(STRING)

Masks all but that last four digits of a given credit card number.

SQL Example: Using the mask_ccn() function

> SELECT mask_ccn("5000 4000 3000 1000")
XXXXXXXXXXXXXXX1000

> SELECT mask_ccn("5000400030001000")
XXXXXXXXXXXX1000

Random ccn

STRING random_ccn()

Returns a random credit card number, divided by dashes.

SQL Example: Using the random_ccn() function

> SELECT random_ccn();
5580-5293-0763-8227

> SELECT random_ccn();
3484-8051-1437-2629

Random

nfp_random(<primitive_type>)

Applies a random, non-format preserving transformation to the given value. Does not maintain referential integrity and values will always be different, even in the same table.

SQL Example: Using the nfp_random() function

> SELECT nfp_random('Jane Smith');
6947408657753046000

> SELECT nfp_random('Jane Smith');
7021543741362542000

Format preserving masking

STRING fp_random(STRING)

Applies a format-preserving anonymization to the given text. Does not maintain referential integrity and values will always be different, even in the same table.

SQL Example: Using the fp_random() function

> SELECT fp_random('Jane Smith');
Hubq Bntql

> SELECT fp_random('5500 0000 0000 0004');
1467 1315 1117 7939

Here's what an example column with the same values would look like:

Name With fp_random transformation
Kory Camacho Fhxr Vyduylz
Kory Camacho Impe Wngubgs

Null

null()

Replace values with null.

> SELECT null("5000 4000 3000 1000");
null

Zero

zero()

Replace values with a blank value.

> SELECT zero("5000 4000 3000 1000");
>

Tokenization

Tokenization is pseudonymization technique that substitutes a value that may be sensitive, with a non-sensitive token. Okera offers different tokenization functions depending on if you want format-preservation, or to maintain referential integrity. Referential integrity means the resulting token is always the same for a given value across tables, so that analytical correlations i.e joins are still possible.

Note

  • Format Preserving means the type and general format of the data will be preserved.
  • Referential Integrity means that tokenized columns can be used for joins across tables.
  • Stable Results means that queries against the same data by the same user on the same table will produce the same results.
Function Format Preserving Referential Integrity Stable Results
tokenize() Yes STRING & BIGINT types only Yes
fp_ref_tokenize Yes Yes Yes
fp_noref_tokenize Yes No Yes
nfp_ref_tokenize No Yes Yes
nfp_noref_tokenize No No Yes

Note

Users manually executing any tokenization function in a query will not see the same result as if executed inside a view or as a transformation function in a policy. This removes the possibility of a user working backwards and "undoing" the anonymization.

Configuring referential integrity

By default, functions that preserve referential integrity will produce the same output for all users, and without changing over time when executed inside a view or as a transformation function in a policy.

This behavior is configurable in two ways:

  • Passing explicit modifier arguments to the function. For example, the following view definition:
CREATE VIEW <db>.<view> SELECT
fp_ref_tokenize(<col>, signed_user(true), signed_period_sec(3600)
FROM <db>.<table>

will result in every user seeing different results when querying <db>.<view>, and those results will change every hour.

  • Changing the system defaults:
    • TOKENIZATION_DEFAULT_SHARING_UNIQUE_PER_USER - when set to true the system will default to unique values per user when executed inside a view or as a transformation function in a policy. The default is false.
    • TOKENIZATION_DEFAULT_PERIOD_SEC - when set to 0 the results will change on every invocation. When set to a positive integer K, the results will change every K seconds. The default is -1, which disables the period change.

See an example of a tokenization policy here.

Format preserving tokenization

With referential integrity

STRING tokenize(STRING) BIGINT tokenize(BIGINT)

fp_ref_tokenize(<primitive_type>)

Note

The default tokenize() function only maintains for STRING and BIGINT data types. This is because referential integrity on low cardinality types such as boolean may be too easy to reverse identify. If you do wish to maintain referential integrity across all data types, please use the fp_ref_tokenize() function.

Example: Using the tokenize() function

> SELECT tokenize("125.1.10.34")
847.9.84.92

Take the following example, note that referential integrity is preserved, so joins are possible across tables:

Name With fp_ref_tokenize()
Elaine Wilkerson Xdvvys Qaybengxv
Kory Camacho Avxj Yptqnhf
Kory Camacho Avxj Yptqnhf

Without referential integrity

fp_noref_tokenize(<primitive_type>)

Example: Using the fp_noref_tokenize() function

> SELECT fp_noref_tokenize("125.1.10.34")
1529789670085611000

Take the following example. Although values are stable inside each table, they are different across tables, which means referential integrity is not maintained :

Base table 01

Name With fp_noref_tokenize()
Elaine Wilkerson Qbkxvm Ozumpmkvg
Kory Camacho Hqos Vfjrxam
Kory Camacho Hqos Vfjrxam

Base table 02

Name With fp_noref_tokenize()
Elaine Wilkerson Pqwzvw Mtplysmj
Kory Camacho Kbxw Abichog
Kory Camacho Kbxw Abichog

Non format preserving tokenization

With referential integrity

nfp_ref_tokenize(<primitive_type>)

Example: Using the nfp_ref_tokenize() function

> SELECT nfp_ref_tokenize("125.1.10.34")
7501817423217723000

Take the following example, note that referential integrity is preserved, so joins are possible across datasets, but format is not preserved:

Take the following example. Format is not preserved as text characters have been replaced with numerical ones. Referential integrity is preserved so joins are possible across tables:

Name With nfp_ref_tokenize()
Elaine Wilkerson 1732252590825781000
Kory Camacho 259850744261204400
Kory Camacho 259850744261204400

Without referential integrity

nfp_noref_tokenize(<primitive_type>)

Example: Using the nfp_noref_tokenize() function

> SELECT nfp_noref_tokenize("125.1.10.34")
1529789670085611000

Take the following example. Format is not preserved as text characters have been replaced with numerical ones. Although the tokenized values are stable inside each table, they are different across tables, which means referential integrity is not maintained:

Table 01

Name With nfp_noref_tokenize()
Elaine Wilkerson 8407409921477382000
Kory Camacho 3042039274287704600
Kory Camacho 3042039274287704600

Table 02

Name With nfp_ref_tokenize()
Elaine Wilkerson 1732252590825781000
Kory Camacho 259850744261204400
Kory Camacho 259850744261204400

Hash

hash()

Hashing returns a fixed size token output from an input of any size according to a mathematical algorithm. Hashed values are irreversible and not format preserving, but they maintain referential integrity. Hashed values are not unique per user and every user will see the same hash for a given value.

SHA2

BIGINT sha2(<primitive_type>)

Applies the SHA2 hash algorithm to the input value. The resulting value is not reversible, but for a specific input the resulting value is always the same for all users and therefore maintains referential integrity for analytical correlations.

SQL Example: Using the sha2() function

> SELECT fnv_hash("125.1.10.34")
6088146908894507000

FNV Hash

BIGINT fnv_hash(<primitive_type>)

This hash function applies a FNV based hash algorithm to the input value. Similar to MD5 but non-cryptographic and faster. The resulting value is not reversible, but for a specific input the resulting value is always the same for all users and therefore maintains referential integrity for analytical correlations.

SQL Example: Using the fnv_hash() function

> SELECT fnv_hash("125.1.10.34")
-7425713518659373000

Differential Privacy

<numeric_type> diff_privacy(<numeric_type> <value>, INT <noise>)

Allows to add a statistical noise (or jitter) to a numeric value, while retaining the same properties when aggregated. For example, it is possible to add +/- 10 years to the individual values of a column containing birthdays, but when calculating the average age across all records you still get close to the same result.

Careful attention has been taken to ensure that this is not susceptible to repeated queries. For example a query like SELECT salary WHERE id = 1234 will always return the same value, not a normally distributed random value around the true one. Otherwise, a malicious user could simply rerun queries and average results. This is implemented by making sure the "same query" always returns the same noise, that is, you get no additional information from repeated runs.

Example: Using the diff_privacy() function

> SELECT diff_privacy(50, 10);
54

Custom user defined functions

If you wish to create your own functions for security, see Extending ODAS.