Skip to content

Stopwords for 50 languages in JSON format

Notifications You must be signed in to change notification settings

6/stopwords-json

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

stopwords-json Build Status npm Bower

Stopwords for various languages in JSON format. Per Wikipedia:

Stop words are words which are filtered out prior to, or after, processing of natural language data [...] these are some of the most common, short function words, such as the, is, at, which, and on.

You can use all stopwords with stopwords-all.json (keyed by language ISO 639-1 code), or see the below table for individual language stopword files.

Languages

There are a total of 50 supported languages:

Language Stopword count Filename
Afrikaans 51 af.json
Arabic 162 ar.json
Armenian 45 hy.json
Basque 98 eu.json
Bengali 116 bn.json
Breton 126 br.json
Bulgarian 259 bg.json
Catalan 218 ca.json
Chinese 542 zh.json
Croatian 179 hr.json
Czech 346 cs.json
Danish 101 da.json
Dutch 275 nl.json
English 570 en.json
Esperanto 173 eo.json
Estonian 35 et.json
Finnish 772 fi.json
French 606 fr.json
Galician 160 gl.json
German 596 de.json
Greek 75 el.json
Hausa 39 ha.json
Hebrew 194 he.json
Hindi 225 hi.json
Hungarian 781 hu.json
Indonesian 355 id.json
Irish 109 ga.json
Italian 619 it.json
Japanese 109 ja.json
Korean 679 ko.json
Latin 49 la.json
Latvian 161 lv.json
Marathi 99 mr.json
Norwegian 172 no.json
Persian 332 fa.json
Polish 260 pl.json
Portuguese 408 pt.json
Romanian 282 ro.json
Russian 539 ru.json
Slovak 110 sk.json
Slovenian 446 sl.json
Somalia 30 so.json
Southern Sotho 31 st.json
Spanish 577 es.json
Swahili 74 sw.json
Swedish 401 sv.json
Thai 115 th.json
Turkish 279 tr.json
Yoruba 60 yo.json
Zulu 29 zu.json

Sources

License and Copyright

Copyright (c) 2017 Peter Graham, contributors. Released under the Apache-2.0 license.