To enable quality test data in lower environments, there is a need bring down datasets from production. However, the datasets contain Personally identifiable information (PII) and would need to be masked before copying them over. A cofigurable app was required for this purpose.
- Delimited files
- by column names
- by column positions
- Fixed-width files by column positions
- Oracle Relational Tables (generates masked delimited file)
- Character Substitution - Random
- Character Substitution - Deterministic
- Character Shuffling - Random
- Character Shuffling - Deterministic
- Configure Oracle keys (if Oracle masking is required).
- Setup metadata for masking.
Seven Python files:
application.py
: Main application script.db_oracle.py
: Module | Oracle connectivity.mask.py
: Module | Retrieves metadata for masking.metadata.py
: Module | Retrieves metadata for masking.traverse_file_dl.py
: Module | Traverses delimited files.traverse_file_fw.py
: Module | Traverses fixed-width files.traverse_table.py
: Module | Traverses Oracle relational tables.
If Oracle masking functionality is to be leveraged, create an Oracle key file in the format specified below:
Format of oracle.key file:
[
{
"USER": "<PLACEHOLDER>",
"PASSWORD": "<PLACEHOLDER>",
"CONNECT_STRING": "<PLACEHOLDER>"
}
]
Three JSONs:
metadata_file_dl.json
: Metadata for delimited files.metadata_file_fw.json
: Metadata for fixed-width files.metadata_table.json
: Metadata for relational tables.
Update JSONs and furnish the details of objects to be masked. Samples provided below:
Delimited Files [mask by column names]: metadata_file_dl.json
[
{
"file_name": "sampledata.csv",
"delimiter": ",",
"header_present": "Yes",
"header_column_count": "2",
"trailer_present": "Yes",
"trailer_column_count": "2",
"date_format": "MM/DD/YYYY",
"mask_by_column_name": "Yes",
"mask_by_column_position": "No",
"masking":
{
"columns":
[
{ "name": "street", "position":0, "type": "Shuffle" },
{ "name": "city", "position":0, "type": "SubstitutionChar" },
{ "name": "zip", "position":0, "type": "ShuffleDet" },
{ "name": "email", "position":0, "type": "SubstitutionChar" },
{ "name": "telno", "position":0, "type": "SubstitutionChar" }
]
}
}
]
Delimited Files [mask by column positions]: metadata_file_dl.json
[
{
"file_name": "sampledata_pos.csv",
"delimiter": ",",
"header_present": "Yes",
"header_column_count": "2",
"trailer_present": "Yes",
"trailer_column_count": "2",
"date_format": "MM/DD/YYYY",
"mask_by_column_name": "No",
"mask_by_column_position": "Yes",
"masking":
{
"columns":
[
{ "name": "", "position": "2", "type": "Shuffle" },
{ "name": "", "position": "3", "type": "SubstitutionChar" },
{ "name": "", "position": "6", "type": "Shuffle" }
]
}
}
]
Fixed-width Files: metadata_file_fw.json
[
{
"file_name": "sampledata_fw.dat",
"header_present": "Yes",
"header_column_count": "2",
"trailer_present": "Yes",
"trailer_column_count": "2",
"date_format": "MM/DD/YYYY",
"record_length": 47,
"masking":
{
"columns":
[
{ "position_start": 2, "position_end": 10, "type": "Shuffle" },
{ "position_start": 26, "position_end": 33, "type": "SubstitutionChar" }
]
}
}
]
Oracle Tables: metadata_table.json
[
{
"table_name": "zmt_collections",
"schema": "PY",
"filter": "WHERE PERIOD = '201805'",
"masking":
{
"columns":
[
{ "name": "COLLECTION_ID", "position":0, "type": "Shuffle" },
{ "name": "TITLE", "position":0, "type": "SubstitutionChar" }
]
}
}
]
- Additional masking routines
- Detailed Statistics
- Visualization & Notifications