Tutorial for CLI tool `anonlink-client`¶

For this tutorial we are going to process a data set for private linkage with clkhash using the command line tool anonlink.

Note you can also use the Python API.

The Python package recordlinkage has a tutorial linking data sets in the clear, we will try duplicate that in a privacy preserving setting.

First install clkhash, recordlinkage and a few data science tools (pandas and numpy).

$ pip install -U anonlink-client recordlinkage numpy pandas

[1]:

import json
import numpy as np
import pandas as pd
import itertools

[2]:

import recordlinkage
from recordlinkage.datasets import load_febrl4

Data Exploration¶

First we have a look at the dataset.

[3]:

dfA, dfB = load_febrl4()

dfA.head()

[3]:

	given_name	surname	street_number	address_1	address_2	suburb	postcode	state	date_of_birth	soc_sec_id
rec_id
rec-1070-org	michaela	neumann	8	stanley street	miami	winston hills	4223	nsw	19151111	5304218
rec-1016-org	courtney	painter	12	pinkerton circuit	bega flats	richlands	4560	vic	19161214	4066625
rec-4405-org	charles	green	38	salkauskas crescent	kela	dapto	4566	nsw	19480930	4365168
rec-1288-org	vanessa	parr	905	macquoid place	broadbridge manor	south grafton	2135	sa	19951119	9239102
rec-3585-org	mikayla	malloney	37	randwick road	avalind	hoppers crossing	4552	vic	19860208	7207688

Note that for computing this linkage we will not use the social security id column or the rec_id index.

[4]:

dfA.columns

[4]:

Index(['given_name', 'surname', 'street_number', 'address_1', 'address_2',
       'suburb', 'postcode', 'state', 'date_of_birth', 'soc_sec_id'],
      dtype='object')

[5]:

dfA.to_csv('PII_a.csv')

Hashing Schema Definition¶

A hashing schema instructs clkhash how to treat each column for generating CLKs. A detailed description of the hashing schema can be found in the api docs. We will ignore the columns ‘rec_id’ and ‘soc_sec_id’ for CLK generation.

[6]:

with open("../_static/febrl_schema_v3_overweight.json") as f:
    print(f.read())

{
  "version": 3,
  "clkConfig": {
    "l": 1024,
    "kdf": {
      "type": "HKDF",
      "hash": "SHA256",
        "info": "c2NoZW1hX2V4YW1wbGU=",
        "salt": "SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==",
        "keySize": 64
    }
  },
  "features": [
    {
      "identifier": "rec_id",
      "ignored": true
    },
    {
      "identifier": "given_name",
      "format": { "type": "string", "encoding": "utf-8", "maxLength": 64 },
      "hashing": { "comparison": {"type": "ngram", "n": 2}, "strategy": {"bitsPerFeature": 300}, "hash": {"type": "doubleHash"} }
    },
    {
      "identifier": "surname",
      "format": { "type": "string", "encoding": "utf-8", "maxLength": 64 },
      "hashing": { "comparison": {"type": "ngram", "n": 2}, "strategy": {"bitsPerFeature": 300}, "hash": {"type": "doubleHash"} }
    },
    {
      "identifier": "street_number",
      "format": { "type": "integer" },
      "hashing": { "comparison": {"type": "ngram", "n": 1, "positional": true}, "strategy": {"bitsPerFeature": 300}, "missingValue": {"sentinel": ""} }
    },
    {
      "identifier": "address_1",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "comparison": {"type": "ngram", "n": 2}, "strategy": {"bitsPerFeature":  300} }
    },
    {
      "identifier": "address_2",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "comparison": {"type": "ngram", "n": 2}, "strategy": {"bitsPerFeature":  300} }
    },
    {
      "identifier": "suburb",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "comparison": {"type": "ngram", "n": 2}, "strategy": {"bitsPerFeature":  300} }
    },
    {
      "identifier": "postcode",
      "format": { "type": "integer", "minimum": 100, "maximum": 9999 },
      "hashing": { "comparison": {"type": "ngram", "n": 1, "positional": true}, "strategy": {"bitsPerFeature":  300} }
    },
    {
      "identifier": "state",
      "format": { "type": "string", "encoding": "utf-8", "maxLength": 3 },
      "hashing": { "comparison": {"type": "ngram", "n": 2}, "strategy": {"bitsPerFeature": 300} }
    },
    {
      "identifier": "date_of_birth",
      "format": { "type": "integer" },
      "hashing": { "comparison": {"type": "ngram", "n": 1, "positional": true}, "strategy": {"bitsPerFeature":  300}, "missingValue": {"sentinel": ""} }
    },
    {
      "identifier": "soc_sec_id",
      "ignored": true
    }
  ]
}

Validate the schema¶

The command line tool can check that the linkage schema is valid:

[7]:

!anonlink validate-schema "../_static/febrl_schema_v3_overweight.json"

schema is valid

Hash the data¶

We can now hash our Personally Identifiable Information (PII) data from the CSV file using our defined linkage schema. We must provide two secret keys to this command - these keys have to be used by both parties hashing data. For this toy example we will use the secret ‘secret’, for real data, make sure that the secret contains enough entropy, as knowledge of this secret is sufficient to reconstruct the PII information from a CLK!

Also, do not share these keys with anyone, except the other participating party.

[8]:

# NBVAL_IGNORE_OUTPUT
!anonlink hash "PII_a.csv" secret "../_static/febrl_schema_v3_overweight.json" "clks_a.json"

CLK data written to clks_a.json

Inspect the output¶

clkhash has hashed the PII, creating a Cryptographic Longterm Key for each entity. The stats output shows that the mean popcount (number of bits set) is quite high (949 out of 1024) which can effect accuracy.

You can reduce the popcount by modify the ‘strategy’ for the different fields. It allows to tune the contribution of a column to the CLK. This can be used to de-emphasise columns which are less suitable for linkage (e.g. information that changes frequently).

[9]:

# NBVAL_IGNORE_OUTPUT
!anonlink describe "clks_a.json"

    --------------------------------------------------------------------------------------------------------------------------
    |                                                       popcounts                                                        |
    --------------------------------------------------------------------------------------------------------------------------

 593|                                                    o
 562|                                                    o
 531|                                                    o
 500|                                                 o  o
 469|                                                 o  o
 437|                                                 o oo
 406|                                                 o oo
 375|                                                 oooo o
 344|                                                 oooooo
 313|                                                 oooooo
 281|                                               o oooooo
 250|                                               oooooooo
 219|                                               oooooooo
 188|                                               oooooooo
 157|                                             o ooooooooo
 125|                                          o  ooooooooooo
  94|                                        o oo ooooooooooooo
  63|                                     ooooooooooooooooooooo
  32|                                   oooooooooooooooooooooooo
   1| o  o     o       ooooooooooooooooooooooooooooooooooooooooooo
     ------------------------------------------------------------
     8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
     3 3 3 4 4 5 5 6 6 7 7 8 8 9 9 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7
     0 4 9 4 9 4 9 3 8 3 8 3 8 2 7 2 7 2 7 1 6 1 6 1 6 0 5 0 5 0
       . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
       8 6 5 3 1 0 8 6 5 3 1 0 8 6 5 3 1 0 8 6 5 3 1 0 8 6 5 3 1

-------------------------
|        Summary        |
-------------------------
|   observations: 5000  |
| min value: 830.000000 |
|   mean : 944.245800   |
| max value: 975.000000 |
-------------------------

First, we will reduce the value of bits_per_feature for each feature.

[10]:

with open("../_static/febrl_schema_v3_reduced.json") as f:
    print(f.read())

{
  "version": 3,
  "clkConfig": {
    "l": 1024,
    "kdf": {
      "type": "HKDF",
      "hash": "SHA256",
        "info": "c2NoZW1hX2V4YW1wbGU=",
        "salt": "SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==",
        "keySize": 64
    }
  },
  "features": [
    {
      "identifier": "rec_id",
      "ignored": true
    },
    {
      "identifier": "given_name",
      "format": { "type": "string", "encoding": "utf-8", "maxLength": 64 },
      "hashing": { "comparison": {"type": "ngram", "n": 2}, "strategy": {"bitsPerFeature": 200}, "hash": {"type": "doubleHash"} }
    },
    {
      "identifier": "surname",
      "format": { "type": "string", "encoding": "utf-8", "maxLength": 64 },
      "hashing": { "comparison": {"type": "ngram", "n": 2}, "strategy": {"bitsPerFeature": 200}, "hash": {"type": "doubleHash"} }
    },
    {
      "identifier": "street_number",
      "format": { "type": "integer" },
      "hashing": { "comparison": {"type": "ngram", "n": 1, "positional": true}, "strategy": {"bitsPerFeature": 200}, "missingValue": {"sentinel": ""} }
    },
    {
      "identifier": "address_1",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "comparison": {"type": "ngram", "n": 2}, "strategy": {"bitsPerFeature":  200} }
    },
    {
      "identifier": "address_2",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "comparison": {"type": "ngram", "n": 2}, "strategy": {"bitsPerFeature":  200} }
    },
    {
      "identifier": "suburb",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "comparison": {"type": "ngram", "n": 2}, "strategy": {"bitsPerFeature":  200} }
    },
    {
      "identifier": "postcode",
      "format": { "type": "integer", "minimum": 100, "maximum": 9999 },
      "hashing": { "comparison": {"type": "ngram", "n": 1, "positional": true}, "strategy": {"bitsPerFeature":  200} }
    },
    {
      "identifier": "state",
      "format": { "type": "string", "encoding": "utf-8", "maxLength": 3 },
      "hashing": { "comparison": {"type": "ngram", "n": 2}, "strategy": {"bitsPerFeature": 200} }
    },
    {
      "identifier": "date_of_birth",
      "format": { "type": "integer" },
      "hashing": { "comparison": {"type": "ngram", "n": 1, "positional": true}, "strategy": {"bitsPerFeature":  200}, "missingValue": {"sentinel": ""} }
    },
    {
      "identifier": "soc_sec_id",
      "ignored": true
    }
  ]
}

[11]:

# NBVAL_IGNORE_OUTPUT
!anonlink hash "PII_a.csv" secret "../_static/febrl_schema_v3_reduced.json" "clks_a.json"

CLK data written to clks_a.json

And now we will modify the bits_per_feature values again, this time de-emphasising the contribution of the address related columns.

[12]:

with open("../_static/febrl_schema_v3_final.json") as f:
    print(f.read())

{
  "version": 3,
  "clkConfig": {
    "l": 1024,
    "kdf": {
      "type": "HKDF",
      "hash": "SHA256",
        "info": "c2NoZW1hX2V4YW1wbGU=",
        "salt": "SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==",
        "keySize": 64
    }
  },
  "features": [
    {
      "identifier": "rec_id",
      "ignored": true
    },
    {
      "identifier": "given_name",
      "format": { "type": "string", "encoding": "utf-8", "maxLength": 64 },
      "hashing": { "comparison": {"type": "ngram", "n": 2}, "strategy": {"bitsPerFeature": 200}, "hash": {"type": "doubleHash"} }
    },
    {
      "identifier": "surname",
      "format": { "type": "string", "encoding": "utf-8", "maxLength": 64 },
      "hashing": { "comparison": {"type": "ngram", "n": 2}, "strategy": {"bitsPerFeature": 200}, "hash": {"type": "doubleHash"} }
    },
    {
      "identifier": "street_number",
      "format": { "type": "integer" },
      "hashing": { "comparison": {"type": "ngram", "n": 1, "positional": true}, "strategy": {"bitsPerFeature": 100}, "missingValue": {"sentinel": ""} }
    },
    {
      "identifier": "address_1",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "comparison": {"type": "ngram", "n": 2}, "strategy": {"bitsPerFeature":  100} }
    },
    {
      "identifier": "address_2",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "comparison": {"type": "ngram", "n": 2}, "strategy": {"bitsPerFeature":  100} }
    },
    {
      "identifier": "suburb",
      "format": { "type": "string", "encoding": "utf-8" },
      "hashing": { "comparison": {"type": "ngram", "n": 2}, "strategy": {"bitsPerFeature":  100} }
    },
    {
      "identifier": "postcode",
      "format": { "type": "integer", "minimum": 50, "maximum": 9999 },
      "hashing": { "comparison": {"type": "ngram", "n": 1, "positional": true}, "strategy": {"bitsPerFeature":  100} }
    },
    {
      "identifier": "state",
      "format": { "type": "string", "encoding":  "utf-8"},
      "hashing": {"comparison": {"type": "ngram", "n": 2, "positional": true}, "strategy": {"bitsPerFeature": 100}, "missingValue": {"sentinel":  ""}
      }
    },
    {
      "identifier": "date_of_birth",
      "format": { "type": "integer" },
      "hashing": { "comparison": {"type": "ngram", "n": 1, "positional": true}, "strategy": {"bitsPerFeature":  200}, "missingValue": {"sentinel": ""} }
    },
    {
      "identifier": "soc_sec_id",
      "ignored": true
    }
  ]
}

[13]:

# NBVAL_IGNORE_OUTPUT
!anonlink hash "PII_a.csv" secret "../_static/febrl_schema_v3_final.json" "clks_a.json"

CLK data written to clks_a.json

Great, now approximately half the bits are set in each CLK.

Each CLK is serialized in a JSON friendly base64 format:

[14]:

# If you have jq tool installed:
#!jq .clks[0] clks_a.json

import json
json.load(open("clks_a.json"))['clks'][0]

[14]:

'eliv99lhdvGu27399h/5bV+NHSvr+Yf/EObeO/+32f9RsWvu/0Y1f3Jvyvj+12pp9De18P9dSA8/3xztXqiTXvt/+pFVb3+vVeRiR3+Z//X3v9XzE/9/u/X//6P9qMumsbnl+f1y9U93ON+99f6Pf5WX13zR/nN/0/9yo//v2Hk='

Hash data set B¶

Now we hash the second dataset using the same keys and same schema.

[15]:

# NBVAL_IGNORE_OUTPUT
dfB.to_csv("PII_b.csv")

!anonlink hash "PII_b.csv" secret "../_static/febrl_schema_v3_final.json" "clks_b.json"

CLK data written to clks_b.json

Find matches between the two sets of CLKs¶

We have generated two sets of CLKs which represent entity information in a privacy-preserving way. The more similar two CLKs are, the more likely it is that they represent the same entity.

For this task we will use the entity service, which is provided by Data61. The necessary steps are as follows: - The analyst creates a new project with the output type ‘groups’. They will receive a set of credentials from the server. - The analyst then distributes the update_tokens to the participating data providers. - The data providers then individually upload their respective CLKs. - The analyst can create runs with various thresholds (and other settings) - After the entity service successfully computed the mapping, it can be accessed by providing the result_token

First we check the status of an entity service:

[16]:

# NBVAL_IGNORE_OUTPUT
SERVER = 'https://anonlink-client-ci.easd.data61.xyz/'
!anonlink status --server={SERVER}

{"project_count": 3, "rate": 2205592, "status": "ok"}

The analyst creates a new project on the entity service by providing the hashing schema and result type. The server returns a set of credentials which provide access to the further steps for project.

[17]:

# NBVAL_IGNORE_OUTPUT
!anonlink create-project --server={SERVER} --schema "../_static/febrl_schema_v3_final.json" --output "credentials.json" --type "groups" --name "tutorial"

Project created

The returned credentials contain a - project_id, which identifies the project - result_token, which gives access to the result, once computed - upload_tokens, one for each provider, allows uploading CLKs.

[18]:

# NBVAL_IGNORE_OUTPUT
credentials = json.load(open("credentials.json", 'rt'))
print(json.dumps(credentials, indent=4))

{
    "project_id": "e3603d3ff21199bc76e441f1fc21fa352b7b723a42566af2",
    "result_token": "aa5dd1bc498c24faff7108f0912192e7576cf38e20a07be4",
    "update_tokens": [
        "e3c98e45c3b50c7a23e2ba89e9d13071f9903bc0bb1bac09",
        "c82bca998aec6c2c5b04d5d39e3dde9d496dddbb5fed57d7"
    ]
}

Uploading the CLKs to the entity service¶

Each party individually uploads its respective CLKs to the entity service. They need to provide the project_id, which identifies the correct results, and an update_token.

By default, data is uploaded to object store, if this feature is enabled in the entity service, since it has a more reliable uploading mechanism especially with large size datasets. You can also specify to upload to entity service directly by adding the flag --to_entityservice to the command upload. A message will be printed to tell users if the encodings and blocks are uploaded to object store:

Anonlink client: Uploading to the external object store - MINIO

or they are uploaded to entity service:

Anonlink client: Uploading to entity service

Note that the progress bar is only available when uploading to object store. Uploading to entity service is performed with requests which does not support streaming upload yet.

[19]:

# NBVAL_IGNORE_OUTPUT
!anonlink upload \
       --project="{credentials['project_id']}" \
        --apikey="{credentials['update_tokens'][0]}" \
        --output "upload_a.json" \
        --server="{SERVER}" \
        "clks_a.json"

uploads
Anonlink client: Uploading to the external object store - MINIO
Upload clks_a.json: |####################| 0.84 MB/0.84 MB 100% [elapsed: 00:00 left: 00:00,  1.22 MB/sec]

[20]:

# NBVAL_IGNORE_OUTPUT
!anonlink upload \
       --project="{credentials['project_id']}" \
        --apikey="{credentials['update_tokens'][1]}" \
        --output "upload_b.json" \
        --server="{SERVER}" \
       "clks_b.json"

uploads
Anonlink client: Uploading to the external object store - MINIO
Upload clks_b.json: |####################| 0.84 MB/0.84 MB 100% [elapsed: 00:00 left: 00:00,  1.59 MB/sec]

Now that the CLK data has been uploaded the analyst can create one or more runs. Here we will start by calculating a mapping with a threshold of 0.9:

[21]:

# NBVAL_IGNORE_OUTPUT
!anonlink create --verbose  \
    --server="{SERVER}" \
    --output "run_info.json" \
    --threshold=0.9 \
    --project="{credentials['project_id']}" \
    --apikey="{credentials['result_token']}" \
    --name="CLI tutorial run A"

Connecting to Entity Matching Server: https://anonlink-client-ci.easd.data61.xyz/

[22]:

# NBVAL_IGNORE_OUTPUT
run_info = json.load(open("run_info.json", 'rt'))
run_info

[22]:

{'name': 'CLI tutorial run A',
 'notes': 'Run created by anonlink-client 0.1.2',
 'run_id': '7f0e4570d6d36286cd78f3f5b325adbf1d5e024f82df3803',
 'threshold': 0.9}

Results¶

Now after some delay (depending on the size) we can fetch the results. This can be done with anonlink:

[23]:

!anonlink results --watch \
        --project="{credentials['project_id']}" \
        --apikey="{credentials['result_token']}" \
        --run="{run_info['run_id']}" \
        --server="{SERVER}" \
        --output results.txt

State: running
Stage (2/3): compute similarity scores
Progress: 100.00%
State: running
Stage (2/3): compute similarity scores
Progress: 100.00%
State: running
Stage (3/3): compute output
State: completed
Stage (3/3): compute output
Downloading result
Received result

[24]:

def extract_matches(file):
    with open(file, 'rt') as f:
        results = json.load(f)['groups']
    # each entry in `results` looks like this: '((0, 4039), (1, 2689))'.
    # The format is ((dataset_id, row_id), (dataset_id, row_id))
    # As we only have two parties in this example, we can remove the dataset_ids.
    # Also, turning the solution into a set will make it easier to assess the
    # quality of the matching.
    found_matches = set((a, b) for ((_, a), (_, b)) in results)
    print('The service linked {} entities.'.format(len(found_matches)))
    return found_matches

found_matches = extract_matches('results.txt')

The service linked 4051 entities.

Let’s investigate some of those matches and the overall matching quality. In this case we have the ground truth so we can compute the precision and recall.

Fortunately, the febrl4 datasets contain record ids which tell us the correct linkages. Using this information we are able to create a set of the true matches.

[25]:

# rec_id in dfA has the form 'rec-1070-org'. We only want the number. Additionally, as we are
# interested in the position of the records, we create a new index which contains the row numbers.
dfA_ = dfA.rename(lambda x: x[4:-4], axis='index').reset_index()
dfB_ = dfB.rename(lambda x: x[4:-6], axis='index').reset_index()
# now we can merge dfA_ and dfB_ on the record_id.
a = pd.DataFrame({'ida': dfA_.index, 'rec_id': dfA_['rec_id']})
b = pd.DataFrame({'idb': dfB_.index, 'rec_id': dfB_['rec_id']})
dfj = a.merge(b, on='rec_id', how='inner').drop(columns=['rec_id'])
# and build a set of the corresponding row numbers.
true_matches = set((row[0], row[1]) for row in dfj.itertuples(index=False))

[26]:

def describe_matching_quality(found_matches, show_examples=False):
    if show_examples:
        print('idx_a, idx_b,     rec_id_a,       rec_id_b')
        print('---------------------------------------------')
        for a_i, b_i in itertools.islice(found_matches, 10):
            print('{:3}, {:6}, {:>15}, {:>15}'.format(a_i+1, b_i+1, a.iloc[a_i]['rec_id'], b.iloc[b_i]['rec_id']))
        print('---------------------------------------------')

    tp = len(found_matches & true_matches)
    fp = len(found_matches - true_matches)
    fn = len(true_matches - found_matches)

    precision = tp / (tp + fp)
    recall = tp / (tp + fn)


    print('Precision: {:.2f}, Recall: {:.2f}'.format(precision, recall))

[27]:

# NBVAL_IGNORE_OUTPUT
describe_matching_quality(found_matches, True)

idx_a, idx_b,     rec_id_a,       rec_id_b
---------------------------------------------
3170,    259,            3730,            3730
733,   2003,            4239,            4239
1685,   3323,            2888,            2888
4550,   3627,            4216,            4216
1875,   2991,            4391,            4391
3928,   2377,            3493,            3493
4928,   4656,             276,             276
334,    945,            4848,            4848
2288,   4331,            3491,            3491
4088,   2454,            1850,            1850
---------------------------------------------
Precision: 1.00, Recall: 0.81

Precision tells us about how many of the found matches are actual matches. The score of 1.0 means that we did perfectly in this respect, however, recall, the measure of how many of the actual matches were correctly identified, is quite low with only 81%.

Let’s go back and create another run with a threshold value of 0.8.

[28]:

# NBVAL_IGNORE_OUTPUT
!anonlink create --verbose  \
    --server="{SERVER}" \
    --output "run_info.json" \
    --threshold=0.8 \
    --project="{credentials['project_id']}" \
    --apikey="{credentials['result_token']}" \
    --name="CLI tutorial run B"

run_info = json.load(open('run_info.json', 'rt'))

Connecting to Entity Matching Server: https://anonlink-client-ci.easd.data61.xyz/

[29]:

!anonlink results --watch \
        --project="{credentials['project_id']}" \
        --apikey="{credentials['result_token']}" \
        --run="{run_info['run_id']}" \
        --server="{SERVER}" \
        --output results.txt

State: running
Stage (2/3): compute similarity scores
Progress: 100.00%
State: running
Stage (2/3): compute similarity scores
Progress: 100.00%
State: running
Stage (3/3): compute output
State: completed
Stage (3/3): compute output
Downloading result
Received result

[30]:

found_matches = extract_matches('results.txt')

describe_matching_quality(found_matches)

The service linked 4962 entities.
Precision: 1.00, Recall: 0.99

Great, for this threshold value we get a precision of 100% and a recall of 99%.

The explanation is that when the information about an entity differs slightly in the two datasets (e.g. spelling errors, abbrevations, missing values, …) then the corresponding CLKs will differ in some number of bits as well. For the datasets in this tutorial the perturbations are such that only 80% of the derived CLK pairs overlap more than 90% (the first threshold). Whereas 99% of all matching pairs overlap more than 80%.

If we keep reducing the threshold value, then we will start to observe mistakes in the found matches – the precision decreases (if an entry in dataset A has no match in dataset B, but we keep reducing the threshold, eventually a comparison with an entry in B will be above the threshold leading to a false match). But at the same time the recall value will keep increasing for a while, as a lower threshold allows for more of the actual matches to be found. However, as our example dataset only contains matches (every entry in A has a match in B), this phenomenon cannot be observered. With the threshold 0.72 we identify all matches but one correctly (at the cost of a longer execution time).

[31]:

# NBVAL_IGNORE_OUTPUT
!anonlink create --verbose  \
    --server="{SERVER}" \
    --output "run_info.json" \
    --threshold=0.72 \
    --project="{credentials['project_id']}" \
    --apikey="{credentials['result_token']}" \
    --name="CLI tutorial run B"

run_info = json.load(open("run_info.json", 'rt'))

Connecting to Entity Matching Server: https://anonlink-client-ci.easd.data61.xyz/

[32]:

!anonlink results --watch \
        --project="{credentials['project_id']}" \
        --apikey="{credentials['result_token']}" \
        --run="{run_info['run_id']}" \
        --server="{SERVER}" \
        --output results.txt

State: running
Stage (2/3): compute similarity scores
State: running
Stage (2/3): compute similarity scores
State: running
Stage (2/3): compute similarity scores
Progress: 100.00%
State: running
Stage (3/3): compute output
State: completed
Stage (3/3): compute output
Downloading result
Received result

[33]:

found_matches = extract_matches('results.txt')

describe_matching_quality(found_matches)

The service linked 4998 entities.
Precision: 1.00, Recall: 1.00

It is important to choose an appropriate threshold for the amount of perturbations present in the data.

Feel free to go back to the CLK generation and experiment on how different setting will affect the matching quality.

Cleanup¶

Finally to remove the results from the service delete the individual runs, or remove the uploaded data and all runs by deleting the entire project.

[34]:

# NBVAL_IGNORE_OUTPUT
# Deleting a run
!anonlink delete --project="{credentials['project_id']}" \
        --apikey="{credentials['result_token']}" \
        --run="{run_info['run_id']}" \
        --server="{SERVER}"

Run deleted

[35]:

# NBVAL_IGNORE_OUTPUT
# Deleting a project
!anonlink delete-project --project="{credentials['project_id']}" \
        --apikey="{credentials['result_token']}" \
        --server="{SERVER}"

Project deleted

[ ]:

Tutorial for CLI tool anonlink-client¶