-
Notifications
You must be signed in to change notification settings - Fork 0
convert_doc_to_txt
Valerio Arnaboldi edited this page Apr 5, 2018
·
1 revision
This script converts documents from pdf or CAS format to plain text. This can be helpful in case the same set of documents needs to be used to train or test classifiers more than once. Since tpclassifier internally converts documents to text files, converting them manually before importing them in the classifiers can save a lot of time.
The script converts a single file provided in input and prints the text it contains to standard output. To save the result to file, the output must be redirected.
python3 convert_doc_to_txt.py -f pdf path/to/input/file > /path/to/output.txt
The option -f defines the input file format and can be "pdf", "cas_pdf", or "cas_xml".