Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make tests pass #11

Open
wants to merge 1 commit into
base: languagetools
Choose a base branch
from
Open

Conversation

kant2002
Copy link

No description provided.

@dchaplinsky
Copy link
Contributor

Let's add more tests for the tokenize_text/tokenize_sents one day.

Also, current sentence tokenization algorithm is very naive and actually used other way round.
tokenizer expects to receive a sentence to split it into words
So correct scheme is going to look like this:
segmentor (choppa was the plan) breaks the text into sentences
then each sentence is being fed into tokenizer.

What is currently implemented is the reverse scheme. The whole text is being tokenized, than the results of the tokenization are being segmented into the sentences. That has to be thoroughly tested (you might use choppa tests to see if the reverse scheme is working sensible).

Another option is to use the old implementation from v1 for the segmentor, which can be found here: https://github.com/lang-uk/tokenize-uk/blob/master/tokenize_uk/tokenize_uk.py#L57

Ideal solution is to finish the segmentor, of course, but those got stuck, because of the differences in regex API for java and python.

Anyway, thank you a ton for looking into this, hopefully we can get this baby shipped one day.

@kant2002
Copy link
Author

If you notice, I change tox configuration to exclude Python 2.x
My question, is this target is still valuable to your? Are you expecting that some reasonable user still use Python 2.x nowadays?

With 3.x branch it is not so clear for me what proper minimum version should be.

@dchaplinsky
Copy link
Contributor

No, it's not

@kant2002
Copy link
Author

What about minimum Python 3 version? I use 3.6 just in case even if I would not use it.

@dchaplinsky
Copy link
Contributor

3.6 is fine. Feel free to drop it if it causes troubles.

Was fun to learn, that 3.6 is still by far the most popular version: https://w3techs.com/technologies/history_details/pl-python/3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants