Privacy and Security Functions¶
This document provides information about Okera's privacy and security functions. To see examples of these functions used in policies see Permission Examples.
The following table summarizes Okera's privacy functions.
Transformation Type | Function | Description |
---|---|---|
AES decryption | aes_decrypt() |
Decrypts Okera strings using Advanced Encryption Standard (AES). This does not work with strings from other data sources. This can be restricted for use by Okera admins only. |
AES encryption | aes_encrypt() |
Encrypts Okera strings using Advanced Encryption Standard (AES). This does not work with strings from other data sources. Use of this function is restricted to Okera admins only. |
Differential Privacy | diff_privacy() |
Enables differential privacy for numeric values. |
Hash (FNV) | fnv_hash() |
Hashes the given value using the FNV algorithm. Maintains referential integrity. |
Format preserving masking | fp_random |
Replaces original value with a random format preserving value. Does not maintain referential integrity. |
Tokenization - format preserving | fp_ref_tokenize() |
Replaces value with a format preserving token that maintains referential integrity. Note tokenize() only maintains referential integrity for STRING and BIGINT types, whereas fp_ref_tokenize() does it for all types. |
Hash (Murmur) | hash() |
Hashes the given value using the Murmur algorithm. Maintains referential integrity. |
Mask | mask() |
Completely masks a value. |
Mask ccn | mask_ccn() |
Masks everything except the last 4 characters, useful for credit card numbers. |
Random | nfp_random |
Replaces original value with a random non-format preserving value. Does not maintain referential integrity. |
Tokenization - non-format preserving | nfp_ref_tokenize() |
Replaces value with a non-format preserving token that maintains referential integrity. |
Null | null() |
Replaces original value with null. |
PHI age | phi_age() |
Compliant with the HIPAA safe-harbor standard, this caps ages shown to 90. If a person is over 90 years old, 90 is shown. |
PHI date | phi_date() |
Compliant with the HIPAA safe-harbor standard, this rounds the date to 1/1 (January 1) of the current year. |
PHI date of birth | phi_dob() |
Compliant with the HIPAA safe-harbor standard, this rounds the date to 1/1 (January 1) of a person's birth year, unless the person is over 90. If a person is over 90, the birth year is the year at which they turned 90. |
PHI zip code | phi_zip3() |
Compliant with the HIPAA safe-harbor standard, this shows only 3 digits and is fully redaacted (000) for very small zipcodes. In addition, five-digit numeric zip codes are supported (however, if the zip codes are in string format, this limitation does not apply). |
Random ccn | random_ccn() |
Generates a random credit card number. |
Hash (SHA2) | sha2() |
Hashes the given value using the SHA2 algorithm. Maintains referential integrity. |
Tokenization - format preserving | tokenize() |
Replaces value with a format preserving token that maintains referential integrity. Note tokenize() only maintains referential integrity for STRING and BIGINT types, whereas fp_ref_tokenize() does it for all types. |
Zero | zero() |
Replaces original value with an empty value. |
Prioritization of Transformations¶
You can prioritize transformations when multiple transformations are applied to a single column. If you use autotagging, multiple tags can be assigned to a single column, each with its own transformations. Multiple transformations applied to a single column make the column inaccessible with a conflict.
To resolve this problem, prioritize the transformation functions. The prioritization applies across the entire system. Wherever multiple transformations occur for any column, the prioritization is applied.
To prioritize transformations, add the TRANSFORM_UDF_PRIORITIES
configuration property to your configuration settings. Values for this property are the transformation function names, specified in sequence, separated by commas. The last function name in the list has priority over all the preceding functions in the list as well as priority over functions not included in the list.
Note: You do not need to specify all available function names in the configuration property. Functions that are not listed are given the lowest priority. However, if multiple unlisted functions are applied to a single column, a column conflict (errors) will occur because the functions have not been prioritized.
In the following example, sha2
has the highest priority and will be applied to the column data, when it is requested for a column with any other transformation function. Likewise, tokenize
will be given priority and applied to the column data when it is requested for a column with null
or mask
or any other transformation function except sha2
. Finally, fp_random
(not in the list) will be given lower priority than null
, mask
, tokenize
, or sha2
. If fp_random
and another unlisted transformation function are specified for a column, a column conflict will occur.
TRANSFORM_UDF_PRIORITIES: null,mask,tokenize,sha2
Valid transformation function names that can be specified for TRANSFORM_UDF_PRIORITIES
are shown in the following table. Their default priority is also shown (if you do not override that priority in TRANSFORM_UDF_PRIORITIES
). Higher priorities override lower priorities (for example, mask
transformations override zero
transformations when applied to the same column).
Function Name | Default Priority |
---|---|
aes_decrypt |
6 |
aes_encrypt |
7 |
diff_privacy |
13 |
fnv_hash |
5 |
fp_random |
18 |
fp_ref_tokenize |
9 |
hash |
4 |
mask |
11 |
mask_ccn |
12 |
nfp_random |
19 |
nfp_ref_tokenize |
10 |
null |
1 |
phi_age |
14 |
phi_date |
15 |
phi_dob |
16 |
phi_zip3 |
17 |
random_ccn |
20 |
sha2 |
3 |
tokenize |
8 |
zero |
2 |
Descriptions of each of these transformation functions are provided above.
Restrict Use of Privacy Functions¶
You can restrict use of Okera's privacy functions and user-defined functions (UDFs) to Okera administrators only. To activate this feature, add the RESTRICTED_UDFS
configuration parameter to your Okera configuration file. Valid values for this parameter are a comma-separated list of function names. Use of any functions listed in the parameter require administrator privileges. In the following example, the aes_decrypt
and nfp_ref_tokenize
privacy functions can only be used by administrators.
RESTRICTED_UDFS: aes_decrypt,nfp_ref_tokenize
Masking¶
Masking, or suppression, is an anonymization technique where parts of the data are completely hidden with random characters or other data.
mask(<STRING | INT | DATE>, INT <start>, INT <length>)
Type | Datatypes | Masking behavior |
---|---|---|
String | STRING, TEXT, VARCHAR | Format-preserving, replaces alphanumeric characters with X. You may optionally specify a character limit starting at position <start> (1-based) and ending after <length> characters. |
Numerical | BIGINT, DECIMAL, INT, FLOAT | Uses a zero value |
Date & Time | DATE, TIMESTAMP | Uses epoch time zero. 01.01.1970 00:00:00.000000 |
SQL Example: Using the mask()
function
> SELECT mask("5000 4000 3000 1000", 6, 4);
5000 XXXX 3000 1000
> SELECT mask("5000 4000 3000 1000", 6, 9);
5000 XXXXXXXXX 1000
Mask ccn¶
STRING mask_ccn(STRING)
Masks all but that last four digits of a given credit card number.
SQL Example: Using the mask_ccn()
function
> SELECT mask_ccn("5000 4000 3000 1000")
XXXXXXXXXXXXXXX1000
> SELECT mask_ccn("5000400030001000")
XXXXXXXXXXXX1000
Random ccn¶
STRING random_ccn()
Returns a random credit card number, divided by dashes.
SQL Example: Using the random_ccn()
function
> SELECT random_ccn();
5580-5293-0763-8227
> SELECT random_ccn();
3484-8051-1437-2629
Random¶
nfp_random(<primitive_type>)
Applies a random, non-format preserving transformation to the given value. Does not maintain referential integrity and values will always be different, even in the same table.
SQL Example: Using the nfp_random()
function
> SELECT nfp_random('Jane Smith');
6947408657753046000
> SELECT nfp_random('Jane Smith');
7021543741362542000
Format-Preserving Masking¶
STRING fp_random(STRING)
Applies a format-preserving anonymization to the given text. Does not maintain referential integrity and values will always be different, even in the same table.
SQL Example: Using the fp_random()
function
> SELECT fp_random('Jane Smith');
Hubq Bntql
> SELECT fp_random('5500 0000 0000 0004');
1467 1315 1117 7939
Here's what an example column with the same values would look like:
Name | With fp_random transformation |
---|---|
Kory Camacho | Fhxr Vyduylz |
Kory Camacho | Impe Wngubgs |
Null¶
null()
Replace values with null.
> SELECT null("5000 4000 3000 1000");
null
Zero¶
zero()
Replace values with a blank value.
> SELECT zero("5000 4000 3000 1000");
>
Tokenization¶
Tokenization is pseudonymization technique that substitutes a value that may be sensitive, with a non-sensitive token. Okera offers different tokenization functions depending on if you want format-preservation, or to maintain referential integrity. Referential integrity means the resulting token is always the same for a given value across tables, so that analytical correlations (such as joins) are still possible.
Note:
Format Preserving
indicates whether the type and general format of the data will be preserved.Referential Integrity
indicates whether tokenized columns can be used for joins across tables.Stable Results
indicates whether queries against the same data by the same user on the same table will produce the same results.
Function | Format Preserving | Referential Integrity | Stable Results |
---|---|---|---|
tokenize() |
Yes | STRING & BIGINT types only | Yes |
fp_ref_tokenize |
Yes | Yes | Yes |
fp_noref_tokenize |
Yes | No | Yes |
nfp_ref_tokenize |
No | Yes | Yes |
nfp_noref_tokenize |
No | No | Yes |
Note: Users manually executing any tokenization function in a query will not see the same result as if executed inside a view or as a transformation function in a policy. This removes the possibility of a user working backwards and "undoing" the anonymization.
Configure Referential Integrity¶
By default, functions that preserve referential integrity produce the same output for all users, and without change over time when executed inside a view or as a transformation function in a policy.
This behavior is configurable in two ways:
-
Pass explicit modifier arguments to the function. For example, the following view definition results in every user seeing different results when they query
<db>.<view>
, and those results change every hour:CREATE VIEW <db>.<view> SELECT fp_ref_tokenize(<col>, signed_user(true), signed_period_sec(3600) FROM <db>.<table>
-
Change the system defaults:
TOKENIZATION_DEFAULT_SHARING_UNIQUE_PER_USER
- when set totrue
the system will default to unique values per user when executed inside a view or as a transformation function in a policy. The default isfalse
.TOKENIZATION_DEFAULT_PERIOD_SEC
- when set to0
the results will change on every invocation. When set to a positive integerK
, the results will change everyK
seconds. The default is-1
, which disables the period change.
See an example of a tokenization policy in Permission Examples.
Format-Preserving Tokenization¶
With Referential Integrity¶
STRING tokenize(STRING)
BIGINT tokenize(BIGINT)
fp_ref_tokenize(<primitive_type>)
Note: The default
tokenize()
function only maintains referential integrity forSTRING
andBIGINT
data types. This is because referential integrity on low cardinality types such asboolean
may be too easy to reverse identify. If you wish to maintain referential integrity across all data types, use thefp_ref_tokenize()
function.
Here is an example using the tokenize()
function:
> SELECT tokenize("125.1.10.34")
847.9.84.92
In the example, referential integrity is preserved, so joins are possible across tables:
Name | With fp_ref_tokenize() |
---|---|
Elaine Wilkerson | Xdvvys Qaybengxv |
Kory Camacho | Avxj Yptqnhf |
Kory Camacho | Avxj Yptqnhf |
Without Referential Integrity¶
fp_noref_tokenize(<primitive_type>)
Here is an example using the fp_noref_tokenize()
function:
> SELECT fp_noref_tokenize("125.1.10.34")
1529789670085611000
In this example, although values are stable inside each table, they are different across tables, which means referential integrity is not maintained :
Base Table 01
Name | With fp_noref_tokenize() |
---|---|
Elaine Wilkerson | Qbkxvm Ozumpmkvg |
Kory Camacho | Hqos Vfjrxam |
Kory Camacho | Hqos Vfjrxam |
Base Table 02
Name | With fp_noref_tokenize() |
---|---|
Elaine Wilkerson | Pqwzvw Mtplysmj |
Kory Camacho | Kbxw Abichog |
Kory Camacho | Kbxw Abichog |
Non-Format-Preserving Tokenization¶
With Referential Integrity¶
nfp_ref_tokenize(<primitive_type>)
Example: Using the nfp_ref_tokenize()
function
> SELECT nfp_ref_tokenize("125.1.10.34")
7501817423217723000
Take the following example, note that referential integrity is preserved, so joins are possible across datasets, but format is not preserved:
Take the following example. Format is not preserved as text characters have been replaced with numerical ones. Referential integrity is preserved so joins are possible across tables:
Name | With nfp_ref_tokenize() |
---|---|
Elaine Wilkerson | 1732252590825781000 |
Kory Camacho | 259850744261204400 |
Kory Camacho | 259850744261204400 |
Without Referential Integrity¶
nfp_noref_tokenize(<primitive_type>)
Here is an example using the nfp_noref_tokenize()
function:
> SELECT nfp_noref_tokenize("125.1.10.34")
1529789670085611000
In this example, format is not preserved because text characters have been replaced with numerical ones. Although the tokenized values are stable inside each table, they are different across tables, which means referential integrity is not maintained:
Table 01
Name | With nfp_noref_tokenize() |
---|---|
Elaine Wilkerson | 8407409921477382000 |
Kory Camacho | 3042039274287704600 |
Kory Camacho | 3042039274287704600 |
Table 02
Name | With nfp_ref_tokenize() |
---|---|
Elaine Wilkerson | 1732252590825781000 |
Kory Camacho | 259850744261204400 |
Kory Camacho | 259850744261204400 |
Hash¶
hash()
Hashing returns a fixed size token output from an input of any size according to a mathematical algorithm. Hashed values are irreversible and not format preserving, but they maintain referential integrity. Hashed values are not unique per user and every user will see the same hash for a given value.
SHA2¶
BIGINT sha2(<primitive_type>)
Applies the SHA2 hash algorithm to the input value. The resulting value is not reversible, but for a specific input the resulting value is always the same for all users and therefore maintains referential integrity for analytical correlations.
SQL Example: Using the sha2()
function
> SELECT fnv_hash("125.1.10.34")
6088146908894507000
FNV Hash¶
BIGINT fnv_hash(<primitive_type>)
This hash function applies a FNV based hash algorithm to the input value. Similar to MD5 but non-cryptographic and faster. The resulting value is not reversible, but for a specific input the resulting value is always the same for all users and therefore maintains referential integrity for analytical correlations.
SQL Example: Using the fnv_hash()
function
> SELECT fnv_hash("125.1.10.34")
-7425713518659373000
Differential Privacy¶
<numeric_type> diff_privacy(<numeric_type> <value>, INT <noise>)
Allows to add a statistical noise (or jitter) to a numeric value, while retaining the same properties when aggregated. For example, it is possible to add +/- 10 years to the individual values of a column containing birthdays, but when calculating the average age across all records you still get close to the same result.
Careful attention has been taken to ensure that this is not susceptible to repeated queries.
For example a query like SELECT salary WHERE id = 1234
will always return the same value, not a normally distributed random value around the true one.
Otherwise, a malicious user could simply rerun queries and average results.
This is implemented by making sure the "same query" always returns the same noise, that is, you get no additional information from repeated runs.
Example: Using the diff_privacy()
function
> SELECT diff_privacy(50, 10);
54
Custom User-Defined Functions (UDF)¶
If you wish to create your own functions for security, see Extending Okera.