Skip to content
This repository has been archived by the owner on Jun 8, 2024. It is now read-only.

Commit

Permalink
pdf_to_text.py (#608)
Browse files Browse the repository at this point in the history
* Program to convert pdfs to text files

Supports command line input
text file does not have to be specified if user just wants to see the text

Signed-off-by: jooossshhhh <[email protected]>

* fixed format

Signed-off-by: jooossshhhh <[email protected]>

Signed-off-by: jooossshhhh <[email protected]>
  • Loading branch information
jooossshhhh authored Dec 16, 2022
1 parent 67247a8 commit a561a83
Show file tree
Hide file tree
Showing 3 changed files with 70 additions and 0 deletions.
18 changes: 18 additions & 0 deletions Docs_Format_conversion_Scripts/pdf_to_text/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# PDF to Text Converter

This tool will take a PDF file as input and output the text from the PDF into a text file. The PDF text is also printed in stdout.

## Requirements
-[PyPDF2](https://pypi.org/project/PyPDF2/)

## Usage

### Convert PDF to Text file
```bash
python3 pdf_to_text.py -p <PATH TO PDF> -o <PATH FOR OUTPUT TEXT>
```

e.g.
```bash
python3 pdf_to_text.py - p /home/username/Documents/sample.pdf -o /home/username/Documents/sample.txt
```
51 changes: 51 additions & 0 deletions Docs_Format_conversion_Scripts/pdf_to_text/pdf_to_text.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
import PyPDF2
import argparse

parser = argparse.ArgumentParser(
description = ' A program to convert PDF to Text'
)
parser.add_argument(
'-p',
'--path',
type=str,
help='The full path of the PDf to convert',
required = True
)
parser.add_argument(
'-o',
'--output',
type=str,
help='Output text file name. If not specified the text will just be printed out',
required=False
)

args = parser.parse_args()
path = args.path
text_file = args.output


#read example pdf in binary mode
pdfFileObj = open(path,'rb')

#create reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

#get number of pages for the pdf
pages = pdfReader.numPages

pdfText = []
#extract text from pdf file and append it to list obj
for page_num in range(pages):
pageObj = pdfReader.getPage(page_num)
#text from pdf and other strings to make it look cleaner on the output
text = pageObj.extractText() + '\n\nPage ' + str(page_num + 1) + '\n' + '*' * 80 + '\n'
pdfText.append(text)
print(text)
if text_file:
#write each obj from the list to text doc
with open(text_file,'w', encoding="utf-8") as f:
for page in pdfText:
f.write(page)
#close pdf object
pdfFileObj.close()

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
PyPDF2

0 comments on commit a561a83

Please sign in to comment.