Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a WordIndices struct #264

Open
bbqsrc opened this issue Mar 21, 2019 · 2 comments
Open

Add a WordIndices struct #264

bbqsrc opened this issue Mar 21, 2019 · 2 comments
Labels
A: lib-api Library API C: segmentation Unicode Text Segmentation enhancement Enhancements to existing features

Comments

@bbqsrc
Copy link

bbqsrc commented Mar 21, 2019

We have Words, WordBounds and WordBoundIndices but not WordIndices, and for a tokeniser for a spellchecker I'm working on, this would be extremely nice. 😄

@behnam
Copy link
Member

behnam commented Apr 1, 2019

Thanks for filing this, @bbqsrc.

We haven't spent much time on the string-level API yet, hence the API not being extensive. No objects to add WordIndices: as always, PRs are welcome!


Also, IMHO we should also try to come up with better naming for these as a higher-level API. A WordIterator in this case may actually emit white-space-only or punctuation tokens, which are not words, per se.

Any ideas/suggestions are welcome! :)

@behnam behnam added enhancement Enhancements to existing features C: segmentation Unicode Text Segmentation A: lib-api Library API labels Apr 1, 2019
@projektir
Copy link

There also seems to be some disagreement between the doc on the Words iterator and what it actually does. The doc says that the Words iterator should return only alphanumeric substrings, but Words actually returns all the substrings, and the alphanumeric part is accomplished by a filter that happens to be applied in all the tests and examples.

It would perhaps be beneficial for performance reasons to have a separate iterator that filters for alphanumeric characters from the beginning? To summarize the interfaces:

These use the current WordBounds iterator:

  • WordBoundsIndices, to emit all the tokens including whitespace, along with their indices
  • WordBounds, just the tokens

These would require a new iterator (that I'm interested in contributing):

  • WordIndices, as @bbqsrc suggested, to emit only alphanumeric tokens and their indices
  • Words, just the alphanumeric tokens

Words would also drop its filter argument. Words is already an iterator and it seems trivial for users to add a .filter() on top.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: lib-api Library API C: segmentation Unicode Text Segmentation enhancement Enhancements to existing features
Projects
None yet
Development

No branches or pull requests

3 participants