-
Notifications
You must be signed in to change notification settings - Fork 3
Pseudonymization
In the Imagen and c-VEDA projects, we use first-level pseudonymization PSC1 identifiers, to be used by acquisition centers to pseudonymize datasets before sending them to the databank, and PSC2 identifiers, to further pseudonymize datasets before sending them to research scientists.
This section describes the generation of PSC1 and PSC2 identifiers for the c-VEDA project.
The intermediate and final files listed in this section are stored under /cveda/databank/framework/psc
.
We first used Python script cveda_databank/psc/cveda_generate_psc1.py
to generate 10-digit codes such as:
- the first 3 digits are 0, followed by a non-zero digit,
- the Damerau–Levenshtein distance between codes is at least 3.
cveda_generate_psc1.py > psc1-10-digit-3-zero.txt
We assigned batches of the above 10-digit codes to c-VEDA centres by prepending a 2-digit code specific to each centre, as described in the following table, resulting in 12-digit PSC1 codes.
ID | CENTRE | # PSC1 2016 | # PSC1 2018 |
---|---|---|---|
11 | PGIMER | 1000 | 500 |
12 | IMPHAL | 750 | 400 |
13 | KOLKATA | 1600 | |
14 | RISHIVALLEY | 1200 | |
15 | MYSORE | 1500 | 250 |
16 | NIMHANS | 1500 | 500 |
17 | SJRI | 1950 | |
18 | MUMBAI | 500 |
We used Unix shell commands for that:
sed -n -e '1,1000{s/^/11/p}' psc1-10-digit-3-zero.txt > PSC1_PGIMER_2016-07-05.txt
sed -n -e '1001,1750{s/^/12/p}' psc1-10-digit-3-zero.txt > PSC1_IMPHAL_2016-07-05.txt
sed -n -e '1751,3350{s/^/13/p}' psc1-10-digit-3-zero.txt > PSC1_KOLKATA_2016-07-05.txt
sed -n -e '3351,4550{s/^/14/p}' psc1-10-digit-3-zero.txt > PSC1_RISHIVALLEY_2016-07-05.txt
sed -n -e '4551,6050{s/^/15/p}' psc1-10-digit-3-zero.txt > PSC1_MYSORE_2016-07-05.txt
sed -n -e '6051,7550{s/^/16/p}' psc1-10-digit-3-zero.txt > PSC1_NIMHANS_2016-07-05.txt
sed -n -e '7551,9500{s/^/17/p}' psc1-10-digit-3-zero.txt > PSC1_SJRI_2016-07-05.txt
sed -n -e '9501,10000{s/^/18/p}' psc1-10-digit-3-zero.txt > PSC1_MUMBAI_2016-07-05.txt
sed -n -e '10001,10500{s/^/11/p}' psc1-10-digit-3-zero.txt > PSC1_PGIMER_2018-06-23.txt
sed -n -e '10501,10900{s/^/12/p}' psc1-10-digit-3-zero.txt > PSC1_IMPHAL_2018-06-23.txt
sed -n -e '10901,11150{s/^/15/p}' psc1-10-digit-3-zero.txt > PSC1_MYSORE_2018-06-23.txt
sed -n -e '11151,11650{s/^/16/p}' psc1-10-digit-3-zero.txt > PSC1_NIMHANS_2018-06-23.txt
We used Python script cveda_databank/psc/cveda_generate_psc2.py
to generate 10-digit codes such as:
- the first 2 digits are 0, followed by a non-zero digit,
- the Damerau–Levenshtein distance between codes is at least 3,
- the PSC2 codes of the Imagen project (file
psc2-imagen.txt
below) are taken into account as existing codes and not re-used in c-VEDA.
We let this script run until obtaining in excess of 100,000 10-digit codes, then killed it:
cat psc1-10-digit-3-zero.txt psc2-imagen.txt | cveda_generate_psc2.py > psc2-10-digit-2-zero.tmp
Then we discarded codes containing sequences of 4 repeating digits and kept exactly 100,000 10-digit PSC2 codes:
cat psc2-10-digit-2-zero.tmp | egrep -v '(1111|2222|3333|4444|5555|6666|7777|8888|9999|0000)' | shuf | head -100000 > psc2-10-digit-2-zero.txt
rm psc2-10-digit-2-zero.tmp
Since we have 10,000 PSC1 identifiers, use 10,000 of the 100,000 10-digit PSC2 codes, prepend “00” to obtain 12-digit codes, and shuffle them so that we cannot easily infer the conversion table:
shuf psc2-10-digit-2-zero.txt | sed -n -e '1,10000{s/^/00/p}' > psc2.tmp
Also prepare a temporary file with the existing 10,000 PSC1 codes:
cat PSC1_*2016-07-05.txt | sort > psc1.tmp
Finally create the conversion table and delete temporary files:
paste -d ',' psc1.tmp psc2.tmp | sort > psc2psc_2016-07-12.txt
rm psc1.tmp psc2.tmp
We provide a list of valid identifiers to help end-users detect and investigate possible identifier errors.