Skip to content

Commit

Permalink
Add new NLP datasets and LLM datasets
Browse files Browse the repository at this point in the history
  • Loading branch information
PhilipMay committed Dec 30, 2023
1 parent cadf053 commit 7373c97
Showing 1 changed file with 11 additions and 1 deletion.
12 changes: 11 additions & 1 deletion source/machine-learning/nlp-datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,17 @@
- <http://www.romanklinger.de/scare/>
- More Data: <https://github.com/WladimirSidorenko/CGSA>

## Text Corpus

### Multilingual Text Corpus

- [RedPajama-Data-v2](https://www.together.ai/blog/redpajama-data-v2)
- Wikipedia (multiple languages): <https://huggingface.co/datasets/wikimedia/wikipedia>

## LLM Datasets

- Function calling: https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2

## More

- <https://github.com/UKPLab/sentence-transformers/issues/747#issuecomment-776993279>
- Wikipedia (multiple languages): <https://huggingface.co/datasets/wikimedia/wikipedia>

0 comments on commit 7373c97

Please sign in to comment.