Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Training Script? #3

Open
ghost opened this issue Feb 11, 2019 · 13 comments
Open

Missing Training Script? #3

ghost opened this issue Feb 11, 2019 · 13 comments
Labels
enhancement New feature or request

Comments

@ghost
Copy link

ghost commented Feb 11, 2019

Hey, I read your blogpost about profanity-check, so I've seen the code there.. but I'm wondering whether you have a file separately to that for training? And/or one for validation or "benchmarking"?

If so, I'd love to see those in the repo. :)

@vzhou842 vzhou842 added the enhancement New feature or request label Feb 11, 2019
@vzhou842
Copy link
Owner

Hey, thanks for the comment. I do have all of that code but unfortunately it's a bit scattered and not really in good shape to be uploaded to the repo. If anyone else is interested in seeing this, please comment on this issue! I'll clean up my code and upload it if a few people want to see it.

@ghost
Copy link
Author

ghost commented Feb 11, 2019

Welp, it's something I'd be interested in playing with, potentially contributing towards, if you do ever get around to sharing it. :)

@alexandrduduka
Copy link

alexandrduduka commented Mar 24, 2019

@vzhou842, thank you for your awesome job, it's really admirable!
I would be interested to see code as well. Is a piece of code mentioned in the article enough to retrain the model? I want to feed it more data and change requirements a bit (need to check not only for profanities, but for some more stuff).
Also I would like to ask: profanity-filter library claims to use deep analysis to identify cases with misspelling, do you think it is possible to somehow apply this approach on top of your library to improve preciseness in the cost of speed? As far as I understand it, it shouldn't be possible, cause you do not identify any "black list" directly, thus we do not have anything to convert, but maybe there is some other way I do not see? Having larger dataset with popular misspelling cases doesn't seem to fully resolve the issue, as there are just too many ways to misspell each word.
Sorry if my questions are profane, I am just starting getting into machine learning :-)

@vzhou842
Copy link
Owner

@alexandrduduka this library is based on scikit-learn's LinearSVC class, so I'd recommend playing with that if you want to reproduce something similar.

As far as improving precision, there are lots of ways to do that (all of which would come at the cost of speed). That's too big of a question for me to answer concisely, but basically you'd have to use more complex / powerful models and possible use better / more data preprocessing.

@adarsa
Copy link

adarsa commented Apr 17, 2019

@vzhou842 Thank you for the model. Would like to see the script for training and benchmarking you have presented.
Looking forward to being able to contribute, extend this.

@vshestopalov
Copy link

Interested.

@vaibhavvi-dev
Copy link

@vzhou842 I would like to see the script for train model.

@yasersakkaf
Copy link

I am interested too in seeing the script.
Doesn't matter how it's written.

@vaibhavvi-dev
Copy link

Can we translate this dataset for other language? E.x. Japanese
Is there any good option for it?

@Abhi-algo8
Copy link

I would love to see the training code please :)

@jgentil
Copy link

jgentil commented May 27, 2020

I definitely want to see it.

@ishanjoshi02
Copy link

Yes. This would be helpful.

Would like to train the model against my own abusive words.

@doctor-henry
Copy link

Definitely will be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

10 participants