Linkage Schema

As CLKs are usually used for privacy preserving linkage, it is important that participating organisations agree on how raw personally identifiable information is encoded to create the CLKs. The linkage schema allows putting more emphasis on particular features and provides a basic level of data validation.

We call the configuration of how to create CLKs a linkage schema. The organisations agree on a linkage schema to ensure that their respective CLKs have been created in the same way.

This aims to be an open standard such that different client implementations could take the schema and create identical CLKs given the same data (and secret keys).

The linkage schema is a detailed description of exactly how to carry out the encoding operation, along with any configuration for the low level hashing itself.

The format of the linkage schema is defined in a separate JSON Schema specification document - schemas/v3.json.

Earlier versions of the linkage schema will continue to work, internally they are converted to the latest version (currently v3).

Basic Structure

A linkage schema consists of three parts:

  • version, contains the version number of the hashing schema.
  • clkConfig, CLK wide configuration, independent of features.
  • features, an array of configuration specific to individual features.

Example Schema

{
  "version": 3,
  "clkConfig": {
    "l": 1024,
    "kdf": {
      "type": "HKDF",
      "hash": "SHA256",
      "salt": "SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==",
      "info": "",
      "keySize": 64
    }
  },
  "features": [
    {
      "identifier": "INDEX",
      "ignored": true
    },
    {
      "identifier": "NAME freetext",
      "format": {
        "type": "string",
        "encoding": "utf-8",
        "case": "mixed",
        "minLength": 3
      },
      "hashing": {
        "comparison": {
          "type": "ngram",
          "n": 2
        },
        "strategy": {
            "bitsPerFeature": 100
        },
        "hash": {"type": "doubleHash"}
      }
    },
    {
      "identifier": "DOB YYYY/MM/DD",
      "format": {
        "type": "date",
        "description": "Numbers separated by slashes, in the year, month, day order",
        "format": "%Y/%m/%d"
      },
      "hashing": {
        "comparison": {
          "type": "ngram",
          "n": 1,
          "positional": true
        },
        "strategy": {
          "bitsPerFeature": 200
        },
        "hash": {"type": "doubleHash"}
      }
    },
    {
      "identifier": "GENDER M or F",
      "format": {
        "type": "enum",
        "values": ["M", "F"]
      },
      "hashing": {
        "comparison": {
          "type": "ngram",
          "n": 1
        },
        "strategy": {
          "bitsPerFeature": 400
        },
        "hash": {"type": "doubleHash"}
      }
    }
  ]
}

A more advanced example can be found here.

Schema Components

Version

Integer value which describes the version of the hashing schema.

clkConfig

Describes the general construction of the CLK.

name type optional description
l integer no the length of the CLK in bits
kdf KDF no defines the key derivation function used to generate individual secrets for each feature derived from the master secret
xorFolds integer yes number of XOR folds (as proposed in [Schnell2016]).

KDF

We currently only support HKDF (for a basic description, see https://en.wikipedia.org/wiki/HKDF).

name type optional description
type string no must be set to “HKDF”
hash enum yes hash function used by HKDF, either “SHA256” or “SHA512”
salt string yes base64 encoded bytes
info string yes base64 encoded bytes
keySize integer yes size of the generated keys in bytes

features

A feature is either described by a featureConfig, or alternatively, it can be ignored by the clkhash library by defining a ignoreFeature section.

ignoreFeature

If defined, then clkhash will ignore this feature.

name type optional description
identifier string no the name of the feature
ignored boolean no has to be set to “True”
description string yes free text, ignored by clkhash

featureConfig

Each feature is configured by:

  • identifier, the human readable name. E.g. "First Name".
  • description, a human readable description of this feature.
  • format, describes the expected format of the values of this feature
  • hashing, configures the hashing
name type optional description
identifier string no the name of the feature
description string yes free text, ignored by clkhash
hashing hashingConfig no configures feature specific hashing parameters
format one of: textFormat, textPatternFormat, numberFormat, dateFormat, enumFormat no describes the expected format of the feature values

hashingConfig

name type optional description
comparison one of: n-gram comparison, exact comparison, numeric comparison no specifies the comparison technique for this feature.
strategy one of: BitsPerTokenStrategy, BitsPerFeatureStrategy no the strategy for assigning bits to the encoding.
hash one of: DoubleHash BlakeHash yes specifies the hash function for inserting bits into the Bloom filter, defaults to bake hash
missingValue missingValue yes allows to define how missing values are handled

Strategies

A strategy defines how often a token is inserted into the Bloom filter.

BitsPerTokenStrategy

Insert every token bitsPerToken number of times.

name type optional description
bitsPerToken integer no max number of indices per token

BitsPerFeatureStrategy

Same number of insertions for each value of this feature, irrespective of the actual number of tokens. The number of filter insertions for a token is computed by dividing bitsPerFeature equally amongst the tokens.

name type optional description
bitsPerFeature integer no max number of indices per feature

Hash

Describes and configures the hash that is used to encode the n-grams.

Choose one of:

DoubleHash

as described in [Schnell2011].

name type optional description
type string no must be set to “doubleHash”
prevent_singularity boolean yes see discussion in https://github.com/data61/clkhash/issues/33

BlakeHash

the (default) option

name type optional description
type string no must be set to “blakeHash”

missingValue

Data sets are not always complete – they can contain missing values. If specified, then clkhash will not check the format for these missing values, and will optionally replace the sentinel with the replaceWith value.

name type optional description
sentinel string no the sentinel value indicates missing data, e.g. ‘Null’, ‘N/A’, ‘’, …
replaceWith string yes specifies the value clkhash should use instead of the sentinel value.

n-gram comparison

Approximate string matching with n-gram tokenization. Also see the API docs for NgramComparison

name type optional description
type string no has to be ‘ngram’
n integer no The ‘n’ in n-gram
positional boolean yes positional n-grams also contains the position of the n-gram within the string

exact comparison

Exact string matching. Also see the API docs for ExactComparison

name type optional description
type string no has to be ‘exact’

numeric comparison

Numerical comparisons of integers or floating point numbers such that the distance between two numbers relate to the similarity of the produced tokens. Also see the API docs for NumericComparison

textFormat

name type optional description
type string no has to be “string”
encoding enum yes one of “ascii”, “utf-8”, “utf-16”, “utf-32”. Default is “utf-8”.
case enum yes one of “upper”, “lower”, “mixed”.
minLength integer yes positive integer describing the minimum length of the input string.
maxLength integer yes positive integer describing the maximum length of the input string.
description string yes free text, ignored by clkhash.

textPatternFormat

name type optional description
type string no has to be “string”
encoding enum yes one of “ascii”, “utf-8”, “utf-16”, “utf-32”. Default is “utf-8”.
pattern string no a regular expression describing the input format.
description string yes free text, ignored by clkhash.

numberFormat

name type optional description
type string no has to be “integer”
minimum integer yes integer describing the lower bound of the input values.
maximum integer yes integer describing the upper bound of the input values.
description string yes free text, ignored by clkhash.

dateFormat

A date is described by an ISO C89 compatible strftime() format string. For example, the format string for the internet date format as described in rfc3339, would be ‘%Y-%m-%d’. The clkhash library will convert the given date to the ‘%Y%m%d’ representation for hashing, as any fill character like ‘-‘ or ‘/’ do not add to the uniqueness of an entity.

name type optional description
type string no has to be “date”
format string no ISO C89 compatible format string, eg: for 1989-11-09 the format is ‘%Y-%m-%d’
description string yes free text, ignored by clkhash.

The following subset contains the most useful format codes:

directive meaning example
%Y Year with century as a decimal number 1984, 3210, 0001
%y Year without century, zero-padded 00, 09, 99
%m Month as a zero-padded decimal number 01, 12
%d Day of the month, zero-padded 01, 25, 31

enumFormat

name type optional description
type string no has to be “enum”
values array no an array of items of type “string”
description string yes free text, ignored by clkhash.