This repository has been archived by the owner on Jun 8, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 273
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Program to convert pdfs to text files Supports command line input text file does not have to be specified if user just wants to see the text Signed-off-by: jooossshhhh <[email protected]> * fixed format Signed-off-by: jooossshhhh <[email protected]> Signed-off-by: jooossshhhh <[email protected]>
- Loading branch information
1 parent
67247a8
commit a561a83
Showing
3 changed files
with
70 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
# PDF to Text Converter | ||
|
||
This tool will take a PDF file as input and output the text from the PDF into a text file. The PDF text is also printed in stdout. | ||
|
||
## Requirements | ||
-[PyPDF2](https://pypi.org/project/PyPDF2/) | ||
|
||
## Usage | ||
|
||
### Convert PDF to Text file | ||
```bash | ||
python3 pdf_to_text.py -p <PATH TO PDF> -o <PATH FOR OUTPUT TEXT> | ||
``` | ||
|
||
e.g. | ||
```bash | ||
python3 pdf_to_text.py - p /home/username/Documents/sample.pdf -o /home/username/Documents/sample.txt | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
import PyPDF2 | ||
import argparse | ||
|
||
parser = argparse.ArgumentParser( | ||
description = ' A program to convert PDF to Text' | ||
) | ||
parser.add_argument( | ||
'-p', | ||
'--path', | ||
type=str, | ||
help='The full path of the PDf to convert', | ||
required = True | ||
) | ||
parser.add_argument( | ||
'-o', | ||
'--output', | ||
type=str, | ||
help='Output text file name. If not specified the text will just be printed out', | ||
required=False | ||
) | ||
|
||
args = parser.parse_args() | ||
path = args.path | ||
text_file = args.output | ||
|
||
|
||
#read example pdf in binary mode | ||
pdfFileObj = open(path,'rb') | ||
|
||
#create reader object | ||
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) | ||
|
||
#get number of pages for the pdf | ||
pages = pdfReader.numPages | ||
|
||
pdfText = [] | ||
#extract text from pdf file and append it to list obj | ||
for page_num in range(pages): | ||
pageObj = pdfReader.getPage(page_num) | ||
#text from pdf and other strings to make it look cleaner on the output | ||
text = pageObj.extractText() + '\n\nPage ' + str(page_num + 1) + '\n' + '*' * 80 + '\n' | ||
pdfText.append(text) | ||
print(text) | ||
if text_file: | ||
#write each obj from the list to text doc | ||
with open(text_file,'w', encoding="utf-8") as f: | ||
for page in pdfText: | ||
f.write(page) | ||
#close pdf object | ||
pdfFileObj.close() | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
PyPDF2 |