Skip to content

Pseudonymization

Dimitri Papadopoulos Orfanos edited this page Sep 23, 2019 · 21 revisions

In the Imagen and c-VEDA projects, we use first-level pseudonymization PSC1 identifiers, to be used by acquisition centers to pseudonymize datasets before sending them to the databank, and PSC2 identifiers, to further pseudonymize datasets before sending them to research scientists.

This section describes the generation of PSC1 and PSC2 identifiers for the c-VEDA project.

The intermediate and final files listed in this section are stored under /cveda/databank/framework/psc.

PSC1 identifiers generation

10-digit code generation

We first used Python script cveda_databank/psc/cveda_generate_psc1.py to generate 10-digit codes such as:

cveda_generate_psc1.py > psc1-10-digit-3-zero.txt

12-digit code creation and assignment to centres

We assigned batches of the above 10-digit codes to c-VEDA centres by prepending a 2-digit code specific to each centre, as described in the following table, resulting in 12-digit PSC1 codes.

ID CENTRE # PSC1 2016 # PSC1 2018
11 PGIMER 1000 500
12 IMPHAL 750 400
13 KOLKATA 1600
14 RISHIVALLEY 1200
15 MYSORE 1500 250
16 NIMHANS 1500 500
17 SJRI 1950
18 MUMBAI 500

We used Unix shell commands for that:

sed -n -e '1,1000{s/^/11/p}' psc1-10-digit-3-zero.txt > PSC1_PGIMER_2016-07-05.txt
sed -n -e '1001,1750{s/^/12/p}' psc1-10-digit-3-zero.txt > PSC1_IMPHAL_2016-07-05.txt
sed -n -e '1751,3350{s/^/13/p}' psc1-10-digit-3-zero.txt > PSC1_KOLKATA_2016-07-05.txt
sed -n -e '3351,4550{s/^/14/p}' psc1-10-digit-3-zero.txt > PSC1_RISHIVALLEY_2016-07-05.txt
sed -n -e '4551,6050{s/^/15/p}' psc1-10-digit-3-zero.txt > PSC1_MYSORE_2016-07-05.txt
sed -n -e '6051,7550{s/^/16/p}' psc1-10-digit-3-zero.txt > PSC1_NIMHANS_2016-07-05.txt
sed -n -e '7551,9500{s/^/17/p}' psc1-10-digit-3-zero.txt > PSC1_SJRI_2016-07-05.txt
sed -n -e '9501,10000{s/^/18/p}' psc1-10-digit-3-zero.txt > PSC1_MUMBAI_2016-07-05.txt

sed -n -e '10001,10500{s/^/11/p}' psc1-10-digit-3-zero.txt > PSC1_PGIMER_2018-06-23.txt
sed -n -e '10501,10900{s/^/12/p}' psc1-10-digit-3-zero.txt > PSC1_IMPHAL_2018-06-23.txt
sed -n -e '10901,11150{s/^/15/p}' psc1-10-digit-3-zero.txt > PSC1_MYSORE_2018-06-23.txt
sed -n -e '11151,11650{s/^/16/p}' psc1-10-digit-3-zero.txt > PSC1_NIMHANS_2018-06-23.txt

PSC2 identifiers generation

10-digit code generation

We used Python script cveda_databank/psc/cveda_generate_psc2.py to generate 10-digit codes such as:

  • the first 2 digits are 0, followed by a non-zero digit,
  • the Damerau–Levenshtein distance between codes is at least 3,
  • the PSC2 codes of the Imagen project (file psc2-imagen.txt below) are taken into account as existing codes and not re-used in c-VEDA.

We let this script run until obtaining in excess of 100,000 10-digit codes, then killed it:

cat psc1-10-digit-3-zero.txt psc2-imagen.txt | cveda_generate_psc2.py > psc2-10-digit-2-zero.tmp

Then we discarded codes containing sequences of 4 repeating digits and kept exactly 100,000 10-digit PSC2 codes:

cat psc2-10-digit-2-zero.tmp | egrep -v '(1111|2222|3333|4444|5555|6666|7777|8888|9999|0000)' | shuf | head -100000 > psc2-10-digit-2-zero.txt
rm psc2-10-digit-2-zero.tmp

PSC1–PSC2 conversion table

Since we have 10,000 PSC1 identifiers, use 10,000 of the 100,000 10-digit PSC2 codes, prepend “00” to obtain 12-digit codes, and shuffle them so that we cannot easily infer the conversion table:

shuf psc2-10-digit-2-zero.txt | sed -n -e '1,10000{s/^/00/p}' > psc2.tmp

Also prepare a temporary file with the existing 10,000 PSC1 codes:

cat PSC1_*2016-07-05.txt | sort > psc1.tmp

Finally create the conversion table and delete temporary files:

paste -d ',' psc1.tmp psc2.tmp | sort > psc2psc_2016-07-12.txt
rm psc1.tmp psc2.tmp 

Valid PSC2 identifiers

We provide a list of valid identifiers to help end-users detect and investigate possible identifier errors.