Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probate Parsing Solution #8

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions wills/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
## Prerequisites and how to run the project

### Prerequisites
To execute the project, following libraries would be required:

- [Tesseract](https://github.com/tesseract-ocr/tesseract) - You can download it using ```sudo apt-get install tesseract```. To run the tesseract application on Python, another library needs to be installed using ```pip``` called ```pytesseract```. Simply, ```pip3 install pytesseract```.

- [Pillow](https://pillow.readthedocs.io/en/5.1.x/) - To download Python's Imaging Library, you can use ```pip3 install pillow```

- [Tkinter](https://docs.python.org/3/library/tkinter.html) - It a Python binder to the Tk GUI. To view the results of the NER which is viewed as a graph you call install python-tkinter using ```sudo apt-get install python3-tkinter```.

- [Natural Language Toolkit (NLTK)](http://www.nltk.org/) - To run the Named Entity Recognition, you need to make use of Python's Standard Library for Natural Language Processing. You can install it by simply, ```pip3 install nltk```.
- You would need to download the standard chunkers, taggers for entity recognition namely ```punkt```, ```averaged_perceptron_taggers```, ```maxent_ne_chunker``` and ```words```. For that - run a python shell in your terminal and follow the steps below :
``` python
import nltk
nltk.download('punkt')
nltk.download('words')
nltk.download('maxent_ne_chunker')
nltk.download('averaged_perceptron_taggers')
```
- [Stanford NER Library](https://nlp.stanford.edu/software/CRF-NER.shtml#Download) - You can download the 7 class recognizer from the link and move it to the ```/usr/bin``` folder. Please ensure that you have ```java```, ```jre``` and ```jdk``` installed to run this wrapper library.

### Execute and Run
To run the NLTK Standard NER Library, in the terminal, type ```python3 nltk_ner.py -i <path to image here>```.

To run the Stanford NER Tagger, in the terminal type ```python3 stanford_ner.py -i <path to image here>```.

#### As an output, all the will entries of the scanned image is retrieved along with their named entities classified.
57 changes: 57 additions & 0 deletions wills/nltk_ner.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
from PIL import Image
import pytesseract
import argparse
import cv2
import os
import csv
import nltk

ocr = []


def processLanguage(contentArray):
try:
tokenized = nltk.word_tokenize(contentArray)
tagged = nltk.pos_tag(tokenized)
print(tagged)

namedEnt = nltk.ne_chunk(tagged)
namedEnt.draw()

except Exception as e:
print(str(e))


ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image", required=True,
help="path to input image to be OCR'd")
args = vars(ap.parse_args())

reader = csv.reader(open('data/gold/extracted_wills.csv'))
for row in reader:
filename = args["image"].split('/')[-1].split('.')[0]
if(row[0] == filename):

# extract the (x,y) rectangular coordinates for each entry in the probate book
x1 = int(row[2])
y1 = int(row[3])
x2 = int(row[4])
y2 = int(row[5])

# load the example image and convert it to grayscale and crop
image = cv2.imread(args["image"])
image = image[y1:y2, x1:x2]
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# write the grayscale image to disk as a temporary file so we can
# apply OCR to it
filename1 = "{}.png".format(os.getpid())
cv2.imwrite(filename1, gray)

# load the image as a PIL/Pillow image, apply OCR, and then delete
# the temporary file
text = pytesseract.image_to_string(Image.open(filename1))
print(text)
os.remove(filename1)
# apply nltk entity recognizer to the data extracted by ocr
processLanguage(text)
46 changes: 46 additions & 0 deletions wills/ocr.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
from PIL import Image
import pytesseract
import argparse
import cv2
import os

# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image", required=True,
help="path to input image to be OCR'd")
ap.add_argument("-p", "--preprocess", type=str, default="thresh",
help="type of preprocessing to be done")
args = vars(ap.parse_args())

# load the example image and convert it to grayscale
image = cv2.imread(args["image"])
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

cv2.imshow("Image", gray)

# check to see if we should apply thresholding to preprocess the
# image
if args["preprocess"] == "thresh":
gray = cv2.threshold(gray, 0, 255,
cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]

# make a check to see if median blurring should be done to remove
# noise
elif args["preprocess"] == "blur":
gray = cv2.medianBlur(gray, 3)

# write the grayscale image to disk as a temporary file so we can
# apply OCR to it
filename = "{}.png".format(os.getpid())
cv2.imwrite(filename, gray)

# load the image as a PIL/Pillow image, apply OCR, and then delete
# the temporary file
text = pytesseract.image_to_string(Image.open(filename))
os.remove(filename)
print(text)

# show the output images
# cv2.imshow("Image", image)
cv2.imshow("Output", gray)
cv2.waitKey(0)
51 changes: 51 additions & 0 deletions wills/stanford_ner.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
from PIL import Image
import pytesseract
import argparse
import cv2
import os
import csv
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize

ocr = []

ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image", required=True,
help="path to input image to be OCR'd")
args = vars(ap.parse_args())

reader = csv.reader(open('data/gold/extracted_wills.csv'))
for row in reader:
filename = args["image"].split('/')[-1].split('.')[0]
if(row[0] == filename):

# extract the (x,y) rectangular coordinates for each entry in the probate book
x1 = int(row[2])
y1 = int(row[3])
x2 = int(row[4])
y2 = int(row[5])

# load the example image and convert it to grayscale and crop
image = cv2.imread(args["image"])
image = image[y1:y2, x1:x2]
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# write the grayscale image to disk as a temporary file so we can
# apply OCR to it
filename1 = "{}.png".format(os.getpid())
cv2.imwrite(filename1, gray)

# load the image as a PIL/Pillow image, apply OCR, and then delete
# the temporary file
text = pytesseract.image_to_string(Image.open(filename1))
print(text)
os.remove(filename1)

# apply stanford entity recognizer to the data extracted by ocr

st = StanfordNERTagger('/usr/bin/stanford-ner-2018-02-27/classifiers/english.muc.7class.distsim.crf.ser.gz', '/usr/bin/stanford-ner-2018-02-27/stanford-ner.jar', encoding='utf-8')

tokenized_text = word_tokenize(text)
classified_text = st.tag(tokenized_text)

print(classified_text)