Add a `WordIndices` struct #264

bbqsrc · 2019-03-21T14:09:22Z

We have Words, WordBounds and WordBoundIndices but not WordIndices, and for a tokeniser for a spellchecker I'm working on, this would be extremely nice. 😄

The text was updated successfully, but these errors were encountered:

behnam · 2019-04-01T19:33:53Z

Thanks for filing this, @bbqsrc.

We haven't spent much time on the string-level API yet, hence the API not being extensive. No objects to add WordIndices: as always, PRs are welcome!

Also, IMHO we should also try to come up with better naming for these as a higher-level API. A WordIterator in this case may actually emit white-space-only or punctuation tokens, which are not words, per se.

Any ideas/suggestions are welcome! :)

projektir · 2019-07-30T15:39:37Z

There also seems to be some disagreement between the doc on the Words iterator and what it actually does. The doc says that the Words iterator should return only alphanumeric substrings, but Words actually returns all the substrings, and the alphanumeric part is accomplished by a filter that happens to be applied in all the tests and examples.

It would perhaps be beneficial for performance reasons to have a separate iterator that filters for alphanumeric characters from the beginning? To summarize the interfaces:

These use the current WordBounds iterator:

WordBoundsIndices, to emit all the tokens including whitespace, along with their indices
WordBounds, just the tokens

These would require a new iterator (that I'm interested in contributing):

WordIndices, as @bbqsrc suggested, to emit only alphanumeric tokens and their indices
Words, just the alphanumeric tokens

Words would also drop its filter argument. Words is already an iterator and it seems trivial for users to add a .filter() on top.

behnam added enhancement Enhancements to existing features C: segmentation Unicode Text Segmentation A: lib-api Library API labels Apr 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a `WordIndices` struct #264

Add a `WordIndices` struct #264

bbqsrc commented Mar 21, 2019

behnam commented Apr 1, 2019

projektir commented Jul 30, 2019

Add a WordIndices struct #264

Add a WordIndices struct #264

Comments

bbqsrc commented Mar 21, 2019

behnam commented Apr 1, 2019

projektir commented Jul 30, 2019

Add a `WordIndices` struct #264

Add a `WordIndices` struct #264