Noni is a Data Anonymization Tool that enables creating an anonymized database from an existing database using synthetic data.
It's main use case is to create secure development databases from existing data. For example, when a company wants to provide a database for development purposes to third-parties without disclosing data.
It is composed of:
- A database spec extractor that builds a specification file from the data characteristics and database structure
- A database builder, that takes the specification file as input, creates tables and generates similar data
Currently, only Postgres databases without custom types are supported but it is possible to implement other SQL implementations.
Noni requires an external HTTP API providing semantic classification. One of such providers is a SATO fork, which is available here. Download the pretrained model available and follow the install instructions to run it.
To make the dependency management easier considering it uses an older Python version, an Open Container Image is avaliable in this repository, avoiding the need to install a specific python version and create a virtual env. For more information on SATO, see the original paper here.
A single command installation is pending.
Noni consists of two main Python applications: the extractor and the generator.
The extractor loads database information from environment variables. See scripts/extract.sh
script for a reference on how to run the extraction.
The connection string for the output database must be in the OUTPUT_DATABASE_URL
environment variable. Make sure the database exists before starting the generator.
To run the generator, run main.py
script from the command line, passing the JSON file generated by the extraction plus
--data
and --structure
parameters.
These allows to toggle if data and/or database structure will be written during this generator execution.
cd noni/generator
python main.py ..\extractor\output.json --structure --data
It is possible to replace SATO with any API that receives csv files and returns a JSON list of the semantic types of the columns, as long as the types are constrained to the type78 list of types.