Skip to content

Privacy and Security Functions

This document provides information about Okera's privacy and security functions. To see examples of these functions used in policies see Permission Examples.

The following table summarizes Okera's privacy functions.

Transformation Type Function Description
AES decryption aes_decrypt() Decrypts Okera strings using Advanced Encryption Standard (AES). This does not work with strings from other data sources. This can be restricted for use by Okera admins only.
AES encryption aes_encrypt() Encrypts Okera strings using Advanced Encryption Standard (AES). This does not work with strings from other data sources. Use of this function is restricted to Okera admins only.
Differential Privacy diff_privacy() Enables differential privacy for numeric values.
Hash (FNV) fnv_hash() Hashes the given value using the FNV algorithm. Maintains referential integrity.
Format preserving masking fp_random Replaces original value with a random format preserving value. Does not maintain referential integrity.
Tokenization - format preserving fp_ref_tokenize() Replaces value with a format preserving token that maintains referential integrity. Note tokenize() only maintains referential integrity for STRING and BIGINT types, whereas fp_ref_tokenize() does it for all types.
Hash (Murmur) hash() Hashes the given value using the Murmur algorithm. Maintains referential integrity.
Mask mask() Completely masks a value.
Mask ccn mask_ccn() Masks everything except the last 4 characters, useful for credit card numbers.
Random nfp_random Replaces original value with a random non-format preserving value. Does not maintain referential integrity.
Tokenization - non-format preserving nfp_ref_tokenize() Replaces value with a non-format preserving token that maintains referential integrity.
Null null() Replaces original value with null.
PHI age phi_age() Compliant with the HIPAA safe-harbor standard, this caps ages shown to 90. If a person is over 90 years old, 90 is shown.
PHI date phi_date() Compliant with the HIPAA safe-harbor standard, this rounds the date to 1/1 (January 1) of the current year.
PHI date of birth phi_dob() Compliant with the HIPAA safe-harbor standard, this rounds the date to 1/1 (January 1) of a person's birth year, unless the person is over 90. If a person is over 90, the birth year is the year at which they turned 90.
PHI zip code phi_zip3() Compliant with the HIPAA safe-harbor standard, this shows only 3 digits and is fully redaacted (000) for very small zipcodes. In addition, five-digit numeric zip codes are supported (however, if the zip codes are in string format, this limitation does not apply).
Random ccn random_ccn() Generates a random credit card number.
Hash (SHA2) sha2() Hashes the given value using the SHA2 algorithm. Maintains referential integrity.
Tokenization - format preserving tokenize() Replaces value with a format preserving token that maintains referential integrity. Note tokenize() only maintains referential integrity for STRING and BIGINT types, whereas fp_ref_tokenize() does it for all types.
Zero zero() Replaces original value with an empty value.

Prioritization of Transformations

You can prioritize transformations when multiple transformations are applied to a single column. If you use autotagging, multiple tags can be assigned to a single column, each with its own transformations. Multiple transformations applied to a single column make the column inaccessible with a conflict.

To resolve this problem, prioritize the transformation functions. The prioritization applies across the entire system. Wherever multiple transformations occur for any column, the prioritization is applied.

To prioritize transformations, add the TRANSFORM_UDF_PRIORITIES configuration property to your configuration settings. Values for this property are the transformation function names, specified in sequence, separated by commas. The last function name in the list has priority over all the preceding functions in the list as well as priority over functions not included in the list.

Note: You do not need to specify all available function names in the configuration property. Functions that are not listed are given the lowest priority. However, if multiple unlisted functions are applied to a single column, a column conflict (errors) will occur because the functions have not been prioritized.

In the following example, sha2 has the highest priority and will be applied to the column data, when it is requested for a column with any other transformation function. Likewise, tokenize will be given priority and applied to the column data when it is requested for a column with null or mask or any other transformation function except sha2. Finally, fp_random (not in the list) will be given lower priority than null, mask, tokenize, or sha2. If fp_random and another unlisted transformation function are specified for a column, a column conflict will occur.

TRANSFORM_UDF_PRIORITIES: null,mask,tokenize,sha2

Valid transformation function names that can be specified for TRANSFORM_UDF_PRIORITIES are shown in the following table. Their default priority is also shown (if you do not override that priority in TRANSFORM_UDF_PRIORITIES). Higher priorities override lower priorities (for example, mask transformations override zero transformations when applied to the same column).

Function Name Default Priority
aes_decrypt 6
aes_encrypt 7
diff_privacy 13
fnv_hash 5
fp_random 18
fp_ref_tokenize 9
hash 4
mask 11
mask_ccn 12
nfp_random 19
nfp_ref_tokenize 10
null 1
phi_age 14
phi_date 15
phi_dob 16
phi_zip3 17
random_ccn 20
sha2 3
tokenize 8
zero 2

Descriptions of each of these transformation functions are provided above.

Restrict Use of Privacy Functions

You can restrict use of Okera's privacy functions and user-defined functions (UDFs) to Okera administrators only. To activate this feature, add the RESTRICTED_UDFS configuration parameter to your Okera configuration file. Valid values for this parameter are a comma-separated list of function names. Use of any functions listed in the parameter require administrator privileges. In the following example, the aes_decrypt and nfp_ref_tokenize privacy functions can only be used by administrators.

RESTRICTED_UDFS: aes_decrypt,nfp_ref_tokenize

Masking

Masking, or suppression, is an anonymization technique where parts of the data are completely hidden with random characters or other data.

mask(<STRING | INT | DATE>, INT <start>, INT <length>)

Type Datatypes Masking behavior
String STRING, TEXT, VARCHAR Format-preserving, replaces alphanumeric characters with X. You may optionally specify a character limit starting at position <start> (1-based) and ending after <length> characters.
Numerical BIGINT, DECIMAL, INT, FLOAT Uses a zero value
Date & Time DATE, TIMESTAMP Uses epoch time zero. 01.01.1970 00:00:00.000000

SQL Example: Using the mask() function

> SELECT mask("5000 4000 3000 1000", 6, 4);
5000 XXXX 3000 1000

> SELECT mask("5000 4000 3000 1000", 6, 9);
5000 XXXXXXXXX 1000

Mask ccn

STRING mask_ccn(STRING)

Masks all but that last four digits of a given credit card number.

SQL Example: Using the mask_ccn() function

> SELECT mask_ccn("5000 4000 3000 1000")
XXXXXXXXXXXXXXX1000

> SELECT mask_ccn("5000400030001000")
XXXXXXXXXXXX1000

Random ccn

STRING random_ccn()

Returns a random credit card number, divided by dashes.

SQL Example: Using the random_ccn() function

> SELECT random_ccn();
5580-5293-0763-8227

> SELECT random_ccn();
3484-8051-1437-2629

Random

nfp_random(<primitive_type>)

Applies a random, non-format preserving transformation to the given value. Does not maintain referential integrity and values will always be different, even in the same table.

SQL Example: Using the nfp_random() function

> SELECT nfp_random('Jane Smith');
6947408657753046000

> SELECT nfp_random('Jane Smith');
7021543741362542000

Format-Preserving Masking

STRING fp_random(STRING)

Applies a format-preserving anonymization to the given text. Does not maintain referential integrity and values will always be different, even in the same table.

SQL Example: Using the fp_random() function

> SELECT fp_random('Jane Smith');
Hubq Bntql

> SELECT fp_random('5500 0000 0000 0004');
1467 1315 1117 7939

Here's what an example column with the same values would look like:

Name With fp_random transformation
Kory Camacho Fhxr Vyduylz
Kory Camacho Impe Wngubgs

Null

null()

Replace values with null.

> SELECT null("5000 4000 3000 1000");
null

Zero

zero()

Replace values with a blank value.

> SELECT zero("5000 4000 3000 1000");
>

Tokenization

Tokenization is pseudonymization technique that substitutes a value that may be sensitive, with a non-sensitive token. Okera offers different tokenization functions depending on if you want format-preservation, or to maintain referential integrity. Referential integrity means the resulting token is always the same for a given value across tables, so that analytical correlations (such as joins) are still possible.

Note: Format Preserving indicates whether the type and general format of the data will be preserved. Referential Integrity indicates whether tokenized columns can be used for joins across tables. Stable Results indicates whether queries against the same data by the same user on the same table will produce the same results.

Function Format Preserving Referential Integrity Stable Results
tokenize() Yes STRING & BIGINT types only Yes
fp_ref_tokenize Yes Yes Yes
fp_noref_tokenize Yes No Yes
nfp_ref_tokenize No Yes Yes
nfp_noref_tokenize No No Yes

Note: Users manually executing any tokenization function in a query will not see the same result as if executed inside a view or as a transformation function in a policy. This removes the possibility of a user working backwards and "undoing" the anonymization.

Configure Referential Integrity

By default, functions that preserve referential integrity produce the same output for all users, and without change over time when executed inside a view or as a transformation function in a policy.

This behavior is configurable in two ways:

  1. Pass explicit modifier arguments to the function. For example, the following view definition results in every user seeing different results when they query <db>.<view>, and those results change every hour:

    CREATE VIEW <db>.<view> SELECT
    fp_ref_tokenize(<col>, signed_user(true), signed_period_sec(3600)
    FROM <db>.<table>
    
  2. Change the system defaults:

    • TOKENIZATION_DEFAULT_SHARING_UNIQUE_PER_USER - when set to true the system will default to unique values per user when executed inside a view or as a transformation function in a policy. The default is false.
    • TOKENIZATION_DEFAULT_PERIOD_SEC - when set to 0 the results will change on every invocation. When set to a positive integer K, the results will change every K seconds. The default is -1, which disables the period change.

See an example of a tokenization policy in Permission Examples.

Format-Preserving Tokenization

With Referential Integrity

STRING tokenize(STRING) BIGINT tokenize(BIGINT)

fp_ref_tokenize(<primitive_type>)

Note: The default tokenize() function only maintains referential integrity for STRING and BIGINT data types. This is because referential integrity on low cardinality types such as boolean may be too easy to reverse identify. If you wish to maintain referential integrity across all data types, use the fp_ref_tokenize() function.

Here is an example using the tokenize() function:

> SELECT tokenize("125.1.10.34")
847.9.84.92

In the example, referential integrity is preserved, so joins are possible across tables:

Name With fp_ref_tokenize()
Elaine Wilkerson Xdvvys Qaybengxv
Kory Camacho Avxj Yptqnhf
Kory Camacho Avxj Yptqnhf

Without Referential Integrity

fp_noref_tokenize(<primitive_type>)

Here is an example using the fp_noref_tokenize() function:

> SELECT fp_noref_tokenize("125.1.10.34")
1529789670085611000

In this example, although values are stable inside each table, they are different across tables, which means referential integrity is not maintained :

Base Table 01

Name With fp_noref_tokenize()
Elaine Wilkerson Qbkxvm Ozumpmkvg
Kory Camacho Hqos Vfjrxam
Kory Camacho Hqos Vfjrxam

Base Table 02

Name With fp_noref_tokenize()
Elaine Wilkerson Pqwzvw Mtplysmj
Kory Camacho Kbxw Abichog
Kory Camacho Kbxw Abichog

Non-Format-Preserving Tokenization

With Referential Integrity

nfp_ref_tokenize(<primitive_type>)

Example: Using the nfp_ref_tokenize() function

> SELECT nfp_ref_tokenize("125.1.10.34")
7501817423217723000

Take the following example, note that referential integrity is preserved, so joins are possible across datasets, but format is not preserved:

Take the following example. Format is not preserved as text characters have been replaced with numerical ones. Referential integrity is preserved so joins are possible across tables:

Name With nfp_ref_tokenize()
Elaine Wilkerson 1732252590825781000
Kory Camacho 259850744261204400
Kory Camacho 259850744261204400

Without Referential Integrity

nfp_noref_tokenize(<primitive_type>)

Here is an example using the nfp_noref_tokenize() function:

> SELECT nfp_noref_tokenize("125.1.10.34")
1529789670085611000

In this example, format is not preserved because text characters have been replaced with numerical ones. Although the tokenized values are stable inside each table, they are different across tables, which means referential integrity is not maintained:

Table 01

Name With nfp_noref_tokenize()
Elaine Wilkerson 8407409921477382000
Kory Camacho 3042039274287704600
Kory Camacho 3042039274287704600

Table 02

Name With nfp_ref_tokenize()
Elaine Wilkerson 1732252590825781000
Kory Camacho 259850744261204400
Kory Camacho 259850744261204400

Hash

hash()

Hashing returns a fixed size token output from an input of any size according to a mathematical algorithm. Hashed values are irreversible and not format preserving, but they maintain referential integrity. Hashed values are not unique per user and every user will see the same hash for a given value.

SHA2

BIGINT sha2(<primitive_type>)

Applies the SHA2 hash algorithm to the input value. The resulting value is not reversible, but for a specific input the resulting value is always the same for all users and therefore maintains referential integrity for analytical correlations.

SQL Example: Using the sha2() function

> SELECT fnv_hash("125.1.10.34")
6088146908894507000

FNV Hash

BIGINT fnv_hash(<primitive_type>)

This hash function applies a FNV based hash algorithm to the input value. Similar to MD5 but non-cryptographic and faster. The resulting value is not reversible, but for a specific input the resulting value is always the same for all users and therefore maintains referential integrity for analytical correlations.

SQL Example: Using the fnv_hash() function

> SELECT fnv_hash("125.1.10.34")
-7425713518659373000

Differential Privacy

<numeric_type> diff_privacy(<numeric_type> <value>, INT <noise>)

Allows to add a statistical noise (or jitter) to a numeric value, while retaining the same properties when aggregated. For example, it is possible to add +/- 10 years to the individual values of a column containing birthdays, but when calculating the average age across all records you still get close to the same result.

Careful attention has been taken to ensure that this is not susceptible to repeated queries. For example a query like SELECT salary WHERE id = 1234 will always return the same value, not a normally distributed random value around the true one. Otherwise, a malicious user could simply rerun queries and average results. This is implemented by making sure the "same query" always returns the same noise, that is, you get no additional information from repeated runs.

Example: Using the diff_privacy() function

> SELECT diff_privacy(50, 10);
54

Custom User-Defined Functions (UDF)

If you wish to create your own functions for security, see Extending Okera.