Pseudonymization

In the Imagen and c-VEDA projects, we use first-level pseudonymization PSC1 identifiers, to be used by acquisition centers to pseudonymize datasets before sending them to the databank, and PSC2 identifiers, to further pseudonymize datasets before sending them to research scientists.

This section describes the generation of PSC1 and PSC2 identifiers for the c-VEDA project.

The intermediate and final files listed in this section are stored under /cveda/databank/framework/psc.

PSC1 identifiers generation

10-digit code generation

We first used Python script cveda_databank/psc/cveda_generate_psc1.py to generate 10-digit codes such as:

the first 3 digits are 0, followed by a non-zero digit,
the Damerau–Levenshtein distance between codes is at least 3.

cveda_generate_psc1.py > psc1-10-digit-3-zero.txt

12-digit code creation and assignment to centres

We assigned batches of the above 10-digit codes to c-VEDA centres by prepending a 2-digit code specific to each centre, as described in the following table, resulting in 12-digit PSC1 codes.

ID	CENTRE	# PSC1 2016	# PSC1 2018
11	PGIMER	1000	500
12	IMPHAL	750	400
13	KOLKATA	1600
14	RISHIVALLEY	1200
15	MYSORE	1500	250
16	NIMHANS	1500	500
17	SJRI	1950
18	MUMBAI	500

We used Unix shell commands for that:

sed -n -e '1,1000{s/^/11/p}' psc1-10-digit-3-zero.txt > PSC1_PGIMER_2016-07-05.txt
sed -n -e '1001,1750{s/^/12/p}' psc1-10-digit-3-zero.txt > PSC1_IMPHAL_2016-07-05.txt
sed -n -e '1751,3350{s/^/13/p}' psc1-10-digit-3-zero.txt > PSC1_KOLKATA_2016-07-05.txt
sed -n -e '3351,4550{s/^/14/p}' psc1-10-digit-3-zero.txt > PSC1_RISHIVALLEY_2016-07-05.txt
sed -n -e '4551,6050{s/^/15/p}' psc1-10-digit-3-zero.txt > PSC1_MYSORE_2016-07-05.txt
sed -n -e '6051,7550{s/^/16/p}' psc1-10-digit-3-zero.txt > PSC1_NIMHANS_2016-07-05.txt
sed -n -e '7551,9500{s/^/17/p}' psc1-10-digit-3-zero.txt > PSC1_SJRI_2016-07-05.txt
sed -n -e '9501,10000{s/^/18/p}' psc1-10-digit-3-zero.txt > PSC1_MUMBAI_2016-07-05.txt

sed -n -e '10001,10500{s/^/11/p}' psc1-10-digit-3-zero.txt > PSC1_PGIMER_2018-06-23.txt
sed -n -e '10501,10900{s/^/12/p}' psc1-10-digit-3-zero.txt > PSC1_IMPHAL_2018-06-23.txt
sed -n -e '10901,11150{s/^/15/p}' psc1-10-digit-3-zero.txt > PSC1_MYSORE_2018-06-23.txt
sed -n -e '11151,11650{s/^/16/p}' psc1-10-digit-3-zero.txt > PSC1_NIMHANS_2018-06-23.txt

PSC2 identifiers generation

10-digit code generation

We used Python script cveda_databank/psc/cveda_generate_psc2.py to generate 10-digit codes such as:

the first 2 digits are 0, followed by a non-zero digit,
the Damerau–Levenshtein distance between codes is at least 3,
the PSC2 codes of the Imagen project (file psc2-imagen.txt below) are taken into account as existing codes and not re-used in c-VEDA.

We let this script run until obtaining in excess of 100,000 10-digit codes, then killed it:

cat psc1-10-digit-3-zero.txt psc2-imagen.txt | cveda_generate_psc2.py > psc2-10-digit-2-zero.tmp

Then we discarded codes containing sequences of 4 repeating digits and kept exactly 100,000 10-digit PSC2 codes:

cat psc2-10-digit-2-zero.tmp | egrep -v '(1111|2222|3333|4444|5555|6666|7777|8888|9999|0000)' | shuf | head -100000 > psc2-10-digit-2-zero.txt
rm psc2-10-digit-2-zero.tmp

PSC1–PSC2 conversion table

Since we have 10,000 PSC1 identifiers, use 10,000 of the 100,000 10-digit PSC2 codes, prepend “00” to obtain 12-digit codes, and shuffle them so that we cannot easily infer the conversion table:

shuf psc2-10-digit-2-zero.txt | sed -n -e '1,10000{s/^/00/p}' > psc2.tmp

Also prepare a temporary file with the existing 10,000 PSC1 codes:

cat PSC1_*2016-07-05.txt | sort > psc1.tmp

Finally create the conversion table and delete temporary files:

paste -d ',' psc1.tmp psc2.tmp | sort > psc2psc_2016-07-12.txt
rm psc1.tmp psc2.tmp

Valid PSC2 identifiers

We provide a list of valid identifiers to help end-users detect and investigate possible identifier errors.

c-VEDA web site

c-VEDA database

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pseudonymization

PSC1 identifiers generation

10-digit code generation

12-digit code creation and assignment to centres

PSC2 identifiers generation

10-digit code generation

PSC1–PSC2 conversion table

Valid PSC2 identifiers

Clone this wiki locally