Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: split_sentence does not seem to handle newlines well #60

Open
thiswillbeyourgithub opened this issue Sep 14, 2024 · 3 comments
Open
Labels
bug Something isn't working

Comments

@thiswillbeyourgithub
Copy link

thiswillbeyourgithub commented Sep 14, 2024

Hi,

I was just playing around with split_sentence and noticed that :

In [16]: split_sentence("This is a test\nAnd here's another one", "en", 25)
Out[16]: ["This is a test And here's", 'another one']

In [17]: split_sentence("This is a test.And here's another one", "en", 25)
Out[17]: ['This is a test.', "And here's another one"]

Given that I use markdown bullet points a lot, I often have line that end with no punctuation.

What do you think about automatically replacing newlines by a point if it's not already following a punctuation mark?

Also, there's no env variable to set the text length for the splitter right? I think lowering that would too reduce my VRAM need. Any opinion on this?

@matatonic
Copy link
Owner

Good problem to know about, thanks. I'll consider this when updating to better support markdown generation.

Re: #56

@matatonic matatonic added the bug Something isn't working label Sep 14, 2024
@thiswillbeyourgithub
Copy link
Author

Maybe a simple fix would be to first pass the text through pysbd instead of split_sentence. And only pass sentence that are longer than some limit to split_sentence.

I discovered pysbd trough another of your repos so am also curious about why you used it in some places but not this time.

@matatonic
Copy link
Owner

I did have a version with pysbd instead, but found no major difference except that perhaps sentence_split was perhaps better for some languages. So why include the extra dependency? Anyways, I'm probably going to restore it after I look more deeply into this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants