Reproducible example with tests - difficulty delineating Q&As in transcript due to redactions #822
MaxPowerWasTaken
started this conversation in
Ask for help with specific PDFs
Replies: 1 comment
-
just updating that I just placed a 300 point bounty on this question at Stack Overflow, in case anyone here is interested in that sort of thing... |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I posted the question below on Stack Overflow (https://stackoverflow.com/q/75528441/1870832), but wanted to raise it here as well:
I have the following python script I wrote as a reproducible example of my current pdf-parsing hangup. It:
running the python code below generates the following output:
Your mission, should you choose to accept it: get this passage10 test to pass without breaking one of the previous tests. I'm hoping there's a clever regex or other modification in extract_q_a_locations below that will do the trick, but I'm open to any solution that passes all these tests, as I chose these test passages deliberately.
A little background on this transcript text, in case it's not as fun reading to you as it is to me: Sometimes a passage starts with a "Q" or "A", and sometimes it starts with a name (e.g. "Ms. Cheney."). The test that's failing, for passage 10, is where a question is asked by a staff member whose name is then redacted. The only way I've managed to get that test to pass has inadvertently broken one of the other tests, because not all redactions indicate the start of a question. (Note: in the pdf/ocr library I'm using, pdfplumber, redacted text usually shows up as just a bunch of extra spaces).
Code below:
Beta Was this translation helpful? Give feedback.
All reactions