Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functionality to extract PDF text from specific regions #62

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
Extract pdf text by areas
Pavlos Melissinos committed Jan 11, 2021
commit 5e2a817134aec2100c259734b178656c11cbd1bd
19 changes: 18 additions & 1 deletion src/pdfboxing/text.clj
Original file line number Diff line number Diff line change
@@ -1,10 +1,27 @@
(ns pdfboxing.text
(:require [pdfboxing.common :as common])
(:import org.apache.pdfbox.text.PDFTextStripper))
(:import (org.apache.pdfbox.text PDFTextStripper
PDFTextStripperByArea)
(java.awt Rectangle)))

(defn extract
"get text from a PDF document"
[pdfdoc]
(with-open [doc (common/obtain-document pdfdoc)]
(-> (PDFTextStripper.)
(.getText doc))))

(defn- area-text [doc {:keys [x y w h page-number] :as area}]
(let [page-number (or page-number 0)
rectangle (Rectangle. x y w h)
pdpage (.getPage doc page-number)
textstripper (doto (PDFTextStripperByArea.)
(.addRegion "region" rectangle)
(.extractRegions pdpage))]
(.getTextForRegion textstripper "region")))

(defn extract-by-areas
"get text from a specified area of a PDF document"
[pdfdoc areas]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @PavlosMelissinos

Can you tell me what was your thinking here?

Why is pdfdoc an argument on it's own and areas is a map?

Why can't it all go into a map?

My thinking is that if you're passing a map around, where all the arguments are in the map, you don't have to think about the position of your arguments.

Thanks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's clearer this way. extract-by-areas is an operation on a pdf document and the coordinates are just parameters. Sure, they're crucial, but they don't have the same weight as the actual document.

I don't have very strong feelings about this though, it's your library 🙂

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started off using mostly rest arguments for the functions in the library.

Then I accepted some PRs which used strict arity.

Let me think about this for a bit and see what option/approach to take, because once this is merged it'll be good to provide the least amount of surprise.

Thanks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. Yeah I could make it variadic if you'd prefer that. That would be consistent with split-pdf and other functions!

(with-open [doc (common/obtain-document pdfdoc)]
(doall (map #(area-text doc %) areas))))
17 changes: 16 additions & 1 deletion test/pdfboxing/text_test.clj
Original file line number Diff line number Diff line change
@@ -1,9 +1,24 @@
(ns pdfboxing.text-test
(:require [clojure.test :refer [deftest is]]
[pdfboxing.text :refer [extract]]))
[pdfboxing.text :refer [extract extract-by-areas]]))

(def line-separator (System/getProperty "line.separator"))

(deftest text-extraction
(is (= (str "Hello, this is pdfboxing.text" line-separator)
(extract "test/pdfs/hello.pdf"))))

(deftest text-extract-by-areas
(let [areas [{:x 150
:y 100
:w 260
:h 40
:page-number 0}
{:x 380
:y 500
:w 27
:h 23
:page-number 4}]]
(is (= ["Clojure 1.6 Cheat Sheet (v21)\n"
"*ns*\n"]
(extract-by-areas "test/pdfs/multi-page.pdf" areas)))))