Terms Generator is a Python program designed to extract specific terms and their definitions from input files (PDF, DOCX, TXT). It cross-references these terms against dictionaries provided in JSON or CSV format and generates an output file containing the matched terms and their definitions in a specified format (DOCX, HTML, or TXT).
Useful for when you have to create a list of terms used for a document, you can easily make and use custom dictionaries!
I would like to thank CyberVeilig Nederland for creating the cybersecurity dictionary that helped inspire this project.
- Python 3.6 or higher
-
Clone the repository:
git clone https://github.com/Lennaert89/TermsGenerator.git cd TermsGenerator
-
Create a virtual environment:
python -m venv venv
-
Activate the virtual environment:
- On Windows:
.\venv\Scripts\activate
- On macOS/Linux:
source venv/bin/activate
- On Windows:
-
Install the required packages:
pip install -r requirements.txt
The program can be run from the command line with the following syntax:
python term_extractor.py <input_file> <dict_files> [--output_format <format>] [--output_file <output>] [--log <log_file>] [--verbosity <level>] [--language <lang>]
input_path
: The input file to scan (DOCX, PDF, TXT) or a directory to recursively scan.dict_paths
: One or more dictionary files to use (CSV, JSON), or a directory in which all .json and .csv files will be used as dictionaries.--interactive
: Run the program in interactive mode.--output_format
: The output format (docx
,html
,txt
,md
). Default isdocx
.--output_file
: The output file name. If not specified, defaults tooutput.<format>
.--log
: The log file to write to. If not specified, logs will only be printed to the terminal.--verbosity
: The verbosity level of logging. Default isINFO
.--language
: The language for the 'Also see' text (en
for English,nl
for Dutch). Default isen
.
To use the interactive mode, run the script with the --interactive flag:
python term_extractor.py --interactive
You will be prompted to enter the required inputs interactively.
- For a PDF input file and JSON dictionary file, with output in DOCX format:
python term_extractor.py input.pdf dictionary.json --output_format docx --output_file output.docx --verbosity DEBUG
- For a DOCX input file and CSV dictionary file, with output in HTML format:
python term_extractor.py input.docx dictionary.csv --output_format html --output_file output.html --log my_log_file.log --verbosity INFO
- For a TXT input file and multiple dictionary files, with output in TXT format:
python term_extractor.py input.txt dictionary1.json dictionary2.csv --output_format txt --output_file output.txt --verbosity WARNING
- For a PDF input file and JSON dictionary file, with output in DOCX format and Dutch language for references:
python TermExtractor.py input.pdf dictionary.json --output_format docx --output_file output.docx --language nl
Create a JSON file with an array of dictionaries, each containing the fields word
, meaning
, and reference
. For example:
[
{
"word": "Access Control List (ACL)",
"meaning": "A list specifying who can access and perform operations on a resource.",
"reference": "Authorization, Permissions"
},
{
"word": "Account hijacking",
"meaning": "The act of gaining unauthorized access to someone else's account.",
"reference": "Identity theft, Phishing"
}
]
Create a CSV file with columns word
, meaning
, and reference
. For example:
word,meaning,reference
Access Control List (ACL),A list specifying who can access and perform operations on a resource,Authorization
Account hijacking,The act of gaining unauthorized access to someone else's account,Identity theft
Feel free to customize the dictionaries with your own terms and definitions. Enjoy using Terms Generator!