Bug: split_sentence does not seem to handle newlines well #60

thiswillbeyourgithub · 2024-09-14T19:56:26Z

Hi,

I was just playing around with split_sentence and noticed that :

In [16]: split_sentence("This is a test\nAnd here's another one", "en", 25)
Out[16]: ["This is a test And here's", 'another one']

In [17]: split_sentence("This is a test.And here's another one", "en", 25)
Out[17]: ['This is a test.', "And here's another one"]

Given that I use markdown bullet points a lot, I often have line that end with no punctuation.

What do you think about automatically replacing newlines by a point if it's not already following a punctuation mark?

Also, there's no env variable to set the text length for the splitter right? I think lowering that would too reduce my VRAM need. Any opinion on this?

The text was updated successfully, but these errors were encountered:

matatonic · 2024-09-14T20:03:51Z

Good problem to know about, thanks. I'll consider this when updating to better support markdown generation.

Re: #56

thiswillbeyourgithub · 2024-09-20T11:33:02Z

Maybe a simple fix would be to first pass the text through pysbd instead of split_sentence. And only pass sentence that are longer than some limit to split_sentence.

I discovered pysbd trough another of your repos so am also curious about why you used it in some places but not this time.

matatonic · 2024-09-20T21:24:05Z

I did have a version with pysbd instead, but found no major difference except that perhaps sentence_split was perhaps better for some languages. So why include the extra dependency? Anyways, I'm probably going to restore it after I look more deeply into this problem.

matatonic added the bug Something isn't working label Sep 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: split_sentence does not seem to handle newlines well #60

Bug: split_sentence does not seem to handle newlines well #60

thiswillbeyourgithub commented Sep 14, 2024 •

edited

Loading

matatonic commented Sep 14, 2024

thiswillbeyourgithub commented Sep 20, 2024

matatonic commented Sep 20, 2024

Bug: split_sentence does not seem to handle newlines well #60

Bug: split_sentence does not seem to handle newlines well #60

Comments

thiswillbeyourgithub commented Sep 14, 2024 • edited Loading

matatonic commented Sep 14, 2024

thiswillbeyourgithub commented Sep 20, 2024

matatonic commented Sep 20, 2024

thiswillbeyourgithub commented Sep 14, 2024 •

edited

Loading