Enhanced handling of line breaks in pdf #127

zkn365 · 2024-09-06T12:13:01Z

I used the following function to enhance the handling of line breaks in pdf after it converted into markdown. I hope it could be considered in the next revision, thanks!

def remove_pdf_newlines(text):
    # Convert Windows-style newlines to Unix style
    text = text.replace('\r\n', '\n')
    # Merge lines that do not end with a period, question mark, or exclamation point
    text = re.sub(r'(?<![.!?])\n(?=[a-zA-Z])', ' ', text)
    # Preserve newlines between paragraphs
    text = re.sub(r'\n\s*\n', '\n\n', text)
    # Remove trailing whitespace characters from lines
    text = re.sub(r'[ \t]+$', '', text, flags=re.MULTILINE)
    return text.strip()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhanced handling of line breaks in pdf #127

Enhanced handling of line breaks in pdf #127

zkn365 commented Sep 6, 2024

Enhanced handling of line breaks in pdf #127

Enhanced handling of line breaks in pdf #127

Comments

zkn365 commented Sep 6, 2024