Library with foundational NLP Algorithms implemented from scratch using PyTorch.
- Documentation
- Installation
- Features
- Examples
- Contributing
- Acknowledgements
- About Me
- Lessons Learned
- License
- Feedback
pip install ScratchNLP
Clone the repo
gh repo clone shanmukh05/scratch_nlp
Install dependencies
pip install -r requirements.txt
-
Algorithms
- Bag of Words
- Ngram
- TF-IDF
- Hidden Markov Model
- Word2Vec
- GloVe
- RNN (Many to One)
- LSTM (One to Many)
- GRU (Many to Many Synced)
- Seq2Seq + Attention (Many to Many)
- Transformer
- BERT
- GPT-2
-
Tokenization
- BypePair Encoding
- WordPiece Tokenizer
-
Metrics
- BLEU
- ROUGE (-N, -L, -S)
- Perplexity
- METEOR
- CIDER
-
Datasets
- IMDB Reviews Dataset
- Flickr Dataset
- NLTK POS Datasets (treebank, brown, conll2000)
- SQuAD QA Dataset
- Genius Lyrics Dataset
- LAMBADA Dataset
- Wiki en dataset
- English to Telugu Translation Dataset
-
Tasks
- Sentiment Classification
- POS Tagging
- Image Captioning
- Machine Translation
- Question Answering
- Text Generation
Algorithm | Task | Tokenization | Output | Dataset |
---|---|---|---|---|
BOW | Text Representation | Preprocessed words |
|
IMDB Reviews |
Ngram | Text Representation | Preprocessed Words |
|
IMDB Reviews |
TF-IDF | Text Representation | Preprocessed words |
|
IMDB Reviews |
HMM | POS Tagging | Preprocessed words |
|
NLTK Treebank |
Word2Vec | Text Representation | Preprocessed words |
|
IMDB Reviews |
GloVe | Text Representation | Preprocessed words |
|
IMDB Reviews |
RNN | Sentiment Classification | Preprocessed words |
|
IMDB Reviews |
LSTM | Image Captioning | Preprocessed words |
|
Flickr 8k |
GRU | POS Tagging | Preprocessed words |
|
NLTK Treebank, Broown, Conll2000 |
Seq2Seq + Attention | Machine Translation | Tokenization |
|
English to Telugu Translation |
Transformer | Lyrics Generation | BytePairEncoding |
|
Genius Lyrics |
BERT | NSP Pretraining, QA Finetuning | WordPiece |
|
Wiki en, SQuAD v1 |
GPT-2 | Sentence Completition | BytePairEncoding |
|
LAMBADA |
Run Train and Inference directly through import
import yaml
from scratch_nlp.src.core.gpt import gpt
with open(config_path, "r") as stream:
config_dict = yaml.safe_load(stream)
gpt = gpt.GPT(config_dict)
gpt.run()
Run through CLI
cd src
python main.py --config_path '<config_path>' --algo '<algo name>' --log_folder '<output folder>'
Contributions are always welcome!
See CONTRIBUTING.md for ways to get started.
I have referred to so many online resources to create this project. I'm adding all the resources to RESOURCES.md. Thanks to all who has created those blogs/code/datasets 😊.
Thanks to CS224N course which gave me motivation to start this project
I am Shanmukha Sainath, working as AI Engineer at KLA Corporation. I have done my Bachelors from Department of Electronics and Electrical Communication Engineering department with Minor in Computer Science Engineering and Micro in Artificial Intelligence and Applications from IIT Kharagpur.
Most of the things present in this project are pretty new to me. I'm listing down my major learnings when creating this project
- NLP Algorithms
- Research paper Implementation
- Designing Project structure
- Documentation
- GitHub pages
- PIP packaging
If you have any feedback, please reach out to me at [email protected]