This repository contains the data and notebooks for the Short paper submission 'Detection of Redacted Text in Legal Documents' to the 2023 Edition of the TPDL conference.
To be able to run the code and experiments in this repository, follow these steps:
-
Install Anaconda:
- Visit the Anaconda website and download the installer for your operating system.
- Follow the installation instructions provided for your specific OS.
-
Clone this repository:
git clone https://github.com/RubenvanHeusden/TPDLTextRedaction.git
- Navigate to the project directory:
cd TPDLTextRedaction
-
Create a new Anaconda environment:
Open a terminal (or Anaconda Prompt on Windows) and run the following command, which installs the requirements according to the environment file we provide:
conda env create -f environment.yml
-
Activate the environment:
conda activate text_redaction_env
-
Alternative using pip If you prefer using pip, you can also install the environment using the requirements file we supplied
pip install -r requirements.txt
-
Running Jupyter Notebook: If you haven't worked with Jupyter Notebook yet, you should set up jupyter so that you can select the right kernel and work with the packages we just installed.
ipython kernel install --name "text_redaction_env" --user
-
Additional Requirements If you want to run the demo and convert PDF files yourself, you should first install
poppler
on your system. -
Running the Demo To run the demo locally you need Poppler, and then you can run the command below from the
scripts
folder:python run_redaction_detector.py --pdf_path ../examples/Woo-verzoek_Geredigeerd.pdf --output_path demo_output.pdf
Which will run the demo on the pdf in theexamples
directory and put an annotated version like the one on the live demo in the root folder of the directory. We have also included an enhancement to the algorithm with the--exclude_tables
argument, which will useimage2table
to detect tables in pages, and skip these for redaction detection, as these often confuse the algorithm and can cause too much text to be labelled as redacted text. This is still disabled by default to be equal to the version in the paper, but can be enabled by passing--exclude_tables True
.
notebooks/
: Contains Jupyter Notebook files.Experiments.ipynb
: Notebook containing the main experiments and explanation of the algorithm.
datasets/
: Contains source code files.data.csv
: Contains the data with the labels of the different pagesimages
: contains the PNG images of the pages.gold_standard.json
: Contains the json file of the manual annotations from the research.
scripts
: Directory containing scripts to run the detection algorithm automatically on a variety of inputs, with automatic PNG conversion, as well as a file containing the algorithm as a class for convenienceexamples
: contains an example pdf for the demo