This webpage provides a collection of datasets for benchmarking word similarity and relatedness. The datasets are described in:
Kliegr, Tomáš, and Ondřej Zamazal. Antonyms are similar: Towards paradigmatic association approach to rating similarity in SimLex-999 and WordSim-353. Data & Knowledge Engineering 115 (2018): 174-193.
Link to the paper: Antonyms are similar: Towards paradigmatic association approach to rating similarity in SimLex-999 and WordSim-353
- simlex999cs.csv: SimLex-999 word pairs reannotated according to the original SimLex-999 guidelines - CZECH version
- wordlex999cs.csv: SimLex-999 word pairs reannotated according to the WordSim353 guidelines - CZECH version.
- winlex999cs.csv: SimLex-999 word pairs reannotated according to the word interchangeability guidelines - CZECH version .
- wordsim353crowd.csv: WordSim353 word pairs reannotated according to the original WordSim353 guidelines using crowdsourcing.
- win353.csv: WordSim353 word pairs reannotated according to the word interchangeability guidelines.
- explicitsim353.csv: WordSim353 word pairs reannotated dataset according to explicit similarity guidelines.
- win353cs.csv: WordSim353 word pairs reannotated according to the word interchangeability guidelines - CZECH version.
- searchkeys_automatic_wordsim353.csv: Automatic mappings - WordSim353
- searchkeys_crowdsourced_wordsim353.csv: Crowdsourced mappings - WordSim353
- searchkeys_automatic_simlex666.csv: Automatic mappings - SimLex-666
These datasets are licensed under a Creative Commons Attribution 4.0 International License.
The English SimLex-999 word pairs and instructions are credited to
Hill, Felix, Roi Reichart, and Anna Korhonen. "Simlex-999: Evaluating semantic models with (genuine) similarity estimation." Computational Linguistics 41.4 (2015): 665-695.
The English WordSim-353 word pairs and instructions are credited to
Finkelstein, Lev, et al. "Placing search in context: The concept revisited." Proceedings of the 10th international conference on World Wide Web. ACM, 2001.
The Czech WordSim-353 word pairs and instructions are credited to
Cinková, Silvie. "WordSim353 for Czech." In International Conference on Text, Speech, and Dialogue, pp. 190-197. Springer, Cham, 2016.
Agirre, Eneko, et al. "A study on similarity and relatedness using distributional and wordnet-based approaches." Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2009.
More details at WIN-353 website.