how to extract text from 2 column pdf file #890

bugm · 2023-05-24T03:50:00Z

bugm
May 24, 2023

Hello guys, suppose I have pdf page looks like

I want to extract the text in the order like

4. Results and discussion
4.1. Experimental setup and simulation environment
To test the performance of the prototype, an experiment setup
for the biaxial-pendulum vibration energy harvester is established
and the schematic is shown in Fig. 6. The experimental setup
mainly includes a six-DOF platform, an inertial measurement unit
4.4. Loaded results from unidirectional excitation
The unidirectional excitation experiments under loaded condi
tions are carried out to examine the loaded performance of the
prototype. Fig. 9 shows the output voltage waveforms of the energy
harvester when the excitation frequency is set to 1 Hz, 1.5 Hz and
2 Hz and the excitation amplitude is set to 0.05 m and 0.07 m,
respectively. The peak values in Fig. 9 (a), (b), and (c) are 1.7 V,10.1 V

which means extract text for left column top to bottom and then for right column top to bottom. I have tried with page.extract_text() method and did not find the way to get my desired result. I think maybe I can parse it according to the middle x coordinate for page and the x coordinate for each char. Before this I want to know if there is any built-in function in pdfplumber or any other easier way to solve it?
Thanks a lot!

jsvine · 2023-05-24T17:23:08Z

jsvine
May 24, 2023
Maintainer

Hi @bugm, and thanks for your interest in this library. Does this guidance here help?:

3 replies

bugm May 25, 2023
Author

@jsvine
Yes, it helps and thanks for your answer.
By the way , I meet the situation that a pdf page contains both one column line and two columns line. For example, the abstract is one column line and then the text is two columns line. So I have to find the top of the two columns part.
Now I trying to set different top var and check the 3% middle crop with text extraction on it to see if there's any text or not. Do you have any suggest about it?

jsvine May 25, 2023
Maintainer

Interesting! Another possible approach would be to use .extract_words(...), and to examine the position of words that overlap with the center.

bugm May 26, 2023
Author

fine, I will try it and thanks a lot !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to extract text from 2 column pdf file #890

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

how to extract text from 2 column pdf file #890

bugm May 24, 2023

Replies: 1 comment · 3 replies

jsvine May 24, 2023 Maintainer

bugm May 25, 2023 Author

jsvine May 25, 2023 Maintainer

bugm May 26, 2023 Author

bugm
May 24, 2023

Replies: 1 comment 3 replies

jsvine
May 24, 2023
Maintainer

bugm May 25, 2023
Author

jsvine May 25, 2023
Maintainer

bugm May 26, 2023
Author