Skip to content

Commit

Permalink
Merge pull request #229 from WycliffeAssociates/preserve-poetry-forma…
Browse files Browse the repository at this point in the history
…tted-text

Preserve text formatted as poetry in STET
  • Loading branch information
linearcombination authored Oct 23, 2024
2 parents 8281b47 + c36fceb commit 6076d14
Showing 1 changed file with 5 additions and 2 deletions.
7 changes: 5 additions & 2 deletions backend/document/stet/stet.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,9 +71,12 @@ def split_chapter_into_verses(chapter: USFMChapter) -> dict[str, str]:
verse_number = re.search(r'<sup class="versemarker">(\d+)</sup>', verse_span)
if verse_number:
verse_number_ = verse_number.group(1)
# Remove all <sup> and <div> tags and their content from the verse text
# Remove all <sup> tags and their content from the verse text
verse_text = re.sub(r"<sup.*?>.*?</sup>", "", verse_span)
verse_text = re.sub(r"<div.*?>.*?</div>", "", verse_text)
# Remove <div> tags that do not have class matching "poetry-<integer>"
verse_text = re.sub(
r"<div(?!.*class=\"poetry-(\d+)\").*?>.*?</div>", "", verse_text
)
# Remove the remaining HTML tags and strip extra spaces
verse_text = re.sub(r"<.*?>", "", verse_text).strip()
logger.debug("verse_number: %s, verse_text: %s", verse_number_, verse_text)
Expand Down

0 comments on commit 6076d14

Please sign in to comment.