Skip to content

convert_doc_to_txt

Valerio Arnaboldi edited this page Apr 5, 2018 · 1 revision

This script converts documents from pdf or CAS format to plain text. This can be helpful in case the same set of documents needs to be used to train or test classifiers more than once. Since tpclassifier internally converts documents to text files, converting them manually before importing them in the classifiers can save a lot of time.

Convert a file to txt

The script converts a single file provided in input and prints the text it contains to standard output. To save the result to file, the output must be redirected.

python3 convert_doc_to_txt.py -f pdf path/to/input/file > /path/to/output.txt

The option -f defines the input file format and can be "pdf", "cas_pdf", or "cas_xml".

Clone this wiki locally