Privacy and Security Functions¶
This document is for administrators and data stewards who are looking for information about Okera's privacy and security functions. To see examples of these functions used in policies see our policy examples guide.
Privacy functions summary¶
Transform | Function | Description |
---|---|---|
Mask | mask() |
Completely masks a value, crossing out alpha-numerical characters. |
Mask ccn | mask_ccn() |
Masks everything except the last 4 characters, useful for credit card numbers. |
Random ccn | random_ccn() |
Generates a random credit card number. |
Null | null() |
Replaces original value with null. |
Random | nfp_random |
Replaces original value with a random non-format preserving value. Does not maintain referential integrity. |
Format preserving masking | fp_random |
Replaces original value with a random format preserving value. Does not maintain referential integrity. |
Tokenization - format preserving | tokenize() fp_ref_tokenize() |
Replaces value with a format preserving token that maintains referential integrity. Note tokenize() only maintains referential integrity for STRING and BIGINT types, whereas fp_ref_tokenize() does it for all types. |
Tokenization - non-format preserving | nfp_ref_tokenize() |
Replaces value with a non-format preserving token that maintains referential integrity. |
Hash (Murmur) | hash() |
Hashes the given value using the Murmur algorithm. Maintains referential integrity. |
Hash (SHA2) | sha2() |
Hashes the given value using the SHA2 algorithm. Maintains referential integrity. |
Hash (FNV) | fnv_hash() |
Hashes the given value using the FNV algorithm. Maintains referential integrity. |
Zero | zero() |
Replaces original value with an empty value. |
Differential Privacy | diff_privacy() |
Enables differential privacy for numeric values. |
Masking¶
Masking or suppression is anonymization technique where parts of the data are completely hidden with random characters or other data.
Mask¶
STRING mask(STRING, INT <start>, INT <length>)
Completely masks the characters in a given string value. You can provide optional arguments for the character <start>
and total <length>
of masked characters.
SQL Example: Using the mask()
function
> SELECT mask("5000 4000 3000 1000");
XXXX XXXX XXXX XXXX
> SELECT mask("5000 4000 3000 1000", 6, 9);
5000 XXXXXXXXX 1000
Mask ccn¶
STRING mask_ccn(STRING)
Masks all but that last four digits of a given credit card number.
SQL Example: Using the mask_ccn()
function
> SELECT mask_ccn("5000 4000 3000 1000")
XXXXXXXXXXXXXXX1000
> SELECT mask_ccn("5000400030001000")
XXXXXXXXXXXX1000
Random ccn¶
STRING random_ccn()
Returns a random credit card number, divided by dashes.
SQL Example: Using the random_ccn()
function
> SELECT random_ccn();
5580-5293-0763-8227
> SELECT random_ccn();
3484-8051-1437-2629
Random¶
nfp_random(<primitive_type>)
Applies a random, non-format preserving transformation to the given value. Does not maintain referential integrity and values will always be different, even in the same table.
SQL Example: Using the nfp_random()
function
> SELECT nfp_random('Jane Smith');
6947408657753046000
> SELECT nfp_random('Jane Smith');
7021543741362542000
Format preserving masking¶
STRING fp_random(STRING)
Applies a format-preserving anonymization to the given text. Does not maintain referential integrity and values will always be different, even in the same table.
SQL Example: Using the fp_random()
function
> SELECT fp_random('Jane Smith');
Hubq Bntql
> SELECT fp_random('5500 0000 0000 0004');
1467 1315 1117 7939
Here's what an example column with the same values would look like:
Name | With fp_random transformation |
---|---|
Kory Camacho | Fhxr Vyduylz |
Kory Camacho | Impe Wngubgs |
Null¶
null()
Replace values with null.
> SELECT null("5000 4000 3000 1000");
null
Zero¶
zero()
Replace values with a blank value.
> SELECT zero("5000 4000 3000 1000");
>
Tokenization¶
Tokenization is pseudonymization technique that substitutes a value that may be sensitive, with a non-sensitive token. Okera offers different tokenization functions depending on if you want format-preservation, or to maintain referential integrity. Referential integrity means the resulting token is always the same for a given value across tables, so that analytical correlations i.e joins are still possible.
Note
Format Preserving
means the type and general format of the data will be preserved.Referential Integrity
means that tokenized columns can be used for joins across tables.Stable Results
means that queries against the same data by the same user on the same table will produce the same results.
Function | Format Preserving | Referential Integrity | Stable Results |
---|---|---|---|
tokenize() |
Yes | STRING & BIGINT types only | Yes |
fp_ref_tokenize |
Yes | Yes | Yes |
fp_noref_tokenize |
Yes | No | Yes |
nfp_ref_tokenize |
No | Yes | Yes |
nfp_noref_tokenize |
No | No | Yes |
Note
Users manually executing any tokenization function in a query will not see the same result as if executed inside a view or as a transformation function in a policy. This removes the possibility of a user working backwards and "undoing" the anonymization.
Configuring referential integrity¶
By default, functions that preserve referential integrity will produce the same output for all users, and without changing over time when executed inside a view or as a transformation function in a policy.
This behavior is configurable in two ways:
- Passing explicit modifier arguments to the function. For example, the following view definition:
CREATE VIEW <db>.<view> SELECT
fp_ref_tokenize(<col>, signed_user(true), signed_period_sec(3600)
FROM <db>.<table>
will result in every user seeing different results when querying <db>.<view>
, and those results will change every hour.
- Changing the system defaults:
TOKENIZATION_DEFAULT_SHARING_UNIQUE_PER_USER
- when set totrue
the system will default to unique values per user when executed inside a view or as a transformation function in a policy. The default isfalse
.TOKENIZATION_DEFAULT_PERIOD_SEC
- when set to0
the results will change on every invocation. When set to a positive integerK
, the results will change everyK
seconds. The default is-1
, which disables the period change.
See an example of a tokenization policy here.
Format preserving tokenization¶
With referential integrity¶
STRING tokenize(STRING)
BIGINT tokenize(BIGINT)
fp_ref_tokenize(<primitive_type>)
Note
The default tokenize() function only maintains referential integrity for STRING
and BIGINT
data types.
This is because referential integrity on low cardinality types such as boolean
may be too easy to reverse identify.
If you do wish to maintain referential integrity across all data types, please use the fp_ref_tokenize()
function.
Example: Using the tokenize()
function
> SELECT tokenize("125.1.10.34")
847.9.84.92
Take the following example, note that referential integrity is preserved, so joins are possible across tables:
Name | With fp_ref_tokenize() |
---|---|
Elaine Wilkerson | Xdvvys Qaybengxv |
Kory Camacho | Avxj Yptqnhf |
Kory Camacho | Avxj Yptqnhf |
Without referential integrity¶
fp_noref_tokenize(<primitive_type>)
Example: Using the fp_noref_tokenize()
function
> SELECT fp_noref_tokenize("125.1.10.34")
1529789670085611000
Take the following example. Although values are stable inside each table, they are different across tables, which means referential integrity is not maintained :
Base table 01
Name | With fp_noref_tokenize() |
---|---|
Elaine Wilkerson | Qbkxvm Ozumpmkvg |
Kory Camacho | Hqos Vfjrxam |
Kory Camacho | Hqos Vfjrxam |
Base table 02
Name | With fp_noref_tokenize() |
---|---|
Elaine Wilkerson | Pqwzvw Mtplysmj |
Kory Camacho | Kbxw Abichog |
Kory Camacho | Kbxw Abichog |
Non format preserving tokenization¶
With referential integrity¶
nfp_ref_tokenize(<primitive_type>)
Example: Using the nfp_ref_tokenize()
function
> SELECT nfp_ref_tokenize("125.1.10.34")
7501817423217723000
Take the following example, note that referential integrity is preserved, so joins are possible across datasets, but format is not preserved:
Take the following example. Format is not preserved as text characters have been replaced with numerical ones. Referential integrity is preserved so joins are possible across tables:
Name | With nfp_ref_tokenize() |
---|---|
Elaine Wilkerson | 1732252590825781000 |
Kory Camacho | 259850744261204400 |
Kory Camacho | 259850744261204400 |
Without referential integrity¶
nfp_noref_tokenize(<primitive_type>)
Example: Using the nfp_noref_tokenize()
function
> SELECT nfp_noref_tokenize("125.1.10.34")
1529789670085611000
Take the following example. Format is not preserved as text characters have been replaced with numerical ones. Although the tokenized values are stable inside each table, they are different across tables, which means referential integrity is not maintained:
Table 01
Name | With nfp_noref_tokenize() |
---|---|
Elaine Wilkerson | 8407409921477382000 |
Kory Camacho | 3042039274287704600 |
Kory Camacho | 3042039274287704600 |
Table 02
Name | With nfp_ref_tokenize() |
---|---|
Elaine Wilkerson | 1732252590825781000 |
Kory Camacho | 259850744261204400 |
Kory Camacho | 259850744261204400 |
Hash¶
hash()
Hashing returns a fixed size token output from an input of any size according to a mathematical algorithm. Hashed values are irreversible and not format preserving, but they maintain referential integrity. Hashed values are not unique per user and every user will see the same hash for a given value.
SHA2¶
BIGINT sha2(<primitive_type>)
Applies the SHA2 hash algorithm to the input value. The resulting value is not reversible, but for a specific input the resulting value is always the same for all users and therefore maintains referential integrity for analytical correlations.
SQL Example: Using the sha2()
function
> SELECT fnv_hash("125.1.10.34")
6088146908894507000
FNV Hash¶
BIGINT fnv_hash(<primitive_type>)
This hash function applies a FNV based hash algorithm to the input value. Similar to MD5 but non-cryptographic and faster. The resulting value is not reversible, but for a specific input the resulting value is always the same for all users and therefore maintains referential integrity for analytical correlations.
SQL Example: Using the fnv_hash()
function
> SELECT fnv_hash("125.1.10.34")
-7425713518659373000
Differential Privacy¶
<numeric_type> diff_privacy(<numeric_type> <value>, INT <noise>)
Allows to add a statistical noise (or jitter) to a numeric value, while retaining the same properties when aggregated. For example, it is possible to add +/- 10 years to the individual values of a column containing birthdays, but when calculating the average age across all records you still get close to the same result.
Careful attention has been taken to ensure that this is not susceptible to repeated queries.
For example a query like SELECT salary WHERE id = 1234
will always return the same value, not a normally distributed random value around the true one.
Otherwise, a malicious user could simply rerun queries and average results.
This is implemented by making sure the "same query" always returns the same noise, that is, you get no additional information from repeated runs.
Example: Using the diff_privacy()
function
> SELECT diff_privacy(50, 10);
54
Custom user defined functions¶
If you wish to create your own functions for security, see Extending ODAS.