Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search for only English pages #173

Open
MilesAheadAlso opened this issue Feb 20, 2018 · 1 comment
Open

Search for only English pages #173

MilesAheadAlso opened this issue Feb 20, 2018 · 1 comment

Comments

@MilesAheadAlso
Copy link

I'm trying to do sentiment analysis and obviously the sentiments are very language and even dialect dependent - Scottish vs English - so I'd like to retrieve pages with a certain language identifier only. I'm not sure this is even possible, but I thought I'd ask.

@nlch
Copy link

nlch commented Mar 7, 2018

As far as I know there's no direct way to do that? You could search for pages with specific English keywords but that's no guarantee you won't get false positives...

The best approach is to either make a list of pages you KNOW are in English, or from some set of pages use a language detection package like textcat, cld2 or cld3 on the posts of the pages.

Hope that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants