Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cál too high up? #3

Open
eoghanmurray opened this issue Jun 28, 2019 · 2 comments
Open

cál too high up? #3

eoghanmurray opened this issue Jun 28, 2019 · 2 comments

Comments

@eoghanmurray
Copy link

Sorry just wanted to register a further issue although I know this is an old repository.
I'm wondering why cál is so high up the list as 'kale/cabbage' doesn't seem to merit such a high position.

Anyhow probably time I dived into creating a similar word frequency list myself from the source texts as then I'll be able to investigate myself!

@eoghanmurray
Copy link
Author

Another one is comhalta which I presume is so high because the corpus contained a large number of legislative/legal text.

I've since acquired Liostaí Bhreacadh https://www.breacadh.ie/ (book) which covers top 500 words and divides up by spoken language vs. written.

Maybe a link to that would be appropriate on the front-page?

@michmech
Copy link
Owner

"Cál" is so high up probably because the New Corpus for Ireland has incorrectly lemmatized some occurrences of "cáil" as a form of "cál", whereas most of the time "cáil" is actually either its own lemma (a noun meaning "reputation", "famousness") or a non-standard compound of "cá bhfuil" ("cáil tú?" = "cá bhfuil tú?" = "where are you?").

The high score of "comhalta" is probably explainable as you say.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants