Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: upgrade from xx_sent_ud_sm to SaT #74

Merged
merged 1 commit into from
Jan 5, 2025
Merged

feat: upgrade from xx_sent_ud_sm to SaT #74

merged 1 commit into from
Jan 5, 2025

Conversation

lsorber
Copy link
Member

@lsorber lsorber commented Jan 2, 2025

Changes:

  1. Replace spaCy's xx_sent_ud_sm with wtpsplit's Segment any Text (SaT) model for sentence splitting.
  2. Expose a way to provide known sentence boundary probabilities to the sentence splitter, with the default being a method that defines Markdown headings in the document as contiguous sentences.
  3. Add wtpsplit-lite as a dependency.
  4. Remove spaCy as a dependency and simplify installation instructions.

@lsorber lsorber self-assigned this Jan 2, 2025
@lsorber lsorber merged commit 41525c8 into main Jan 5, 2025
2 checks passed
@lsorber lsorber deleted the ls-sat branch January 5, 2025 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant