Scratch NLP 🧠

Library with foundational NLP Algorithms implemented from scratch using PyTorch.

Table of Contents 📋

Documentation
Installation
Features
Examples
Contributing
Acknowledgements
About Me
Lessons Learned
License
Feedback

Documentation 📝

Installation ⬇️

Install using pip

   pip install ScratchNLP

Install Manually for development

Clone the repo

  gh repo clone shanmukh05/scratch_nlp

Install dependencies

  pip install -r requirements.txt

Features 🛠️

Algorithms
- Bag of Words
- Ngram
- TF-IDF
- Hidden Markov Model
- Word2Vec
- GloVe
- RNN (Many to One)
- LSTM (One to Many)
- GRU (Many to Many Synced)
- Seq2Seq + Attention (Many to Many)
- Transformer
- BERT
- GPT-2
Tokenization
- BypePair Encoding
- WordPiece Tokenizer
Metrics
- BLEU
- ROUGE (-N, -L, -S)
- Perplexity
- METEOR
- CIDER
Datasets
- IMDB Reviews Dataset
- Flickr Dataset
- NLTK POS Datasets (treebank, brown, conll2000)
- SQuAD QA Dataset
- Genius Lyrics Dataset
- LAMBADA Dataset
- Wiki en dataset
- English to Telugu Translation Dataset
Tasks
- Sentiment Classification
- POS Tagging
- Image Captioning
- Machine Translation
- Question Answering
- Text Generation

Implementation Details

Algorithm	Task	Tokenization	Output	Dataset
BOW	Text Representation	Preprocessed words	Text Label, Vector npy files Top K Vocab Frequency Histogram png Vocab frequency csv Wordcloud png	IMDB Reviews
Ngram	Text Representation	Preprocessed Words	Text Label, Vector npy files Top K Vocab Frequency Histogram png Top K ngrams Piechart ong Vocab frequency csv Wordcloud png	IMDB Reviews
TF-IDF	Text Representation	Preprocessed words	Text Label, Vector npy files TF PCA Pairplot png TF-IDF PCA Pairplot png IDF csv	IMDB Reviews
HMM	POS Tagging	Preprocessed words	Data Analysis png (sent len, POS tags count) Emission Matrix TSNE html Emission matrix csv Test Predictions conf matrix, clf report png Transition Matrix csv, png	NLTK Treebank
Word2Vec	Text Representation	Preprocessed words	Best Model pt Training History json Word Embeddings TSNE html	IMDB Reviews
GloVe	Text Representation	Preprocessed words	Best Model pt Training History json Word Embeddings TSNE html Top K Cooccurence Matrix png	IMDB Reviews
RNN	Sentiment Classification	Preprocessed words	Best Model pt Training History json Word Embeddings TSNE html Confusion Matrix png Training History png	IMDB Reviews
LSTM	Image Captioning	Preprocessed words	Best Model pt Training History json Word Embeddings TSNE html Training History png	Flickr 8k
GRU	POS Tagging	Preprocessed words	Best Model pt Training History json Word Embeddings TSNE html Confusion Matrix png Test predictions csv Training History png	NLTK Treebank, Broown, Conll2000
Seq2Seq + Attention	Machine Translation	Tokenization	Best Model pt Training History json Source, Target Word Embeddings TSNE html Test predictions csv Training History png	English to Telugu Translation
Transformer	Lyrics Generation	BytePairEncoding	Best Model pt Training History json Token Embeddings TSNE html Test predictions csv Training History png	Genius Lyrics
BERT	NSP Pretraining, QA Finetuning	WordPiece	Best Model pt (pretrain, finetune) Training History json (pretrain, finetune) Token Embeddings TSNE html Finetune Test predictions csv Training History png (pretrain, finetune)	Wiki en, SQuAD v1
GPT-2	Sentence Completition	BytePairEncoding	Best Model pt Training History json Token Embeddings TSNE html Test predictions csv Training History png	LAMBADA

Examples 🌟

Run Train and Inference directly through import

import yaml
from scratch_nlp.src.core.gpt import gpt

with open(config_path, "r") as stream:
  config_dict = yaml.safe_load(stream)

gpt = gpt.GPT(config_dict)
gpt.run()

Run through CLI

  cd src
  python main.py --config_path '<config_path>' --algo '<algo name>' --log_folder '<output folder>'

Contributing 🤝

Contributions are always welcome!

See CONTRIBUTING.md for ways to get started.

Acknowledgements 💡

I have referred to so many online resources to create this project. I'm adding all the resources to RESOURCES.md. Thanks to all who has created those blogs/code/datasets 😊.

Thanks to CS224N course which gave me motivation to start this project

About Me 👤

I am Shanmukha Sainath, working as AI Engineer at KLA Corporation. I have done my Bachelors from Department of Electronics and Electrical Communication Engineering department with Minor in Computer Science Engineering and Micro in Artificial Intelligence and Applications from IIT Kharagpur.

Connect with me

Lessons Learned 📌

Most of the things present in this project are pretty new to me. I'm listing down my major learnings when creating this project

NLP Algorithms
Research paper Implementation
Designing Project structure
Documentation
GitHub pages
PIP packaging

License ⚖️

Feedback 📣

If you have any feedback, please reach out to me at [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.github/workflows		.github/workflows
assets		assets
configs		configs
docs		docs
sample_data		sample_data
src		src
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Experiments.ipynb		Experiments.ipynb
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
RESOURCES.md		RESOURCES.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scratch NLP 🧠

Table of Contents 📋

Documentation 📝

Installation ⬇️

Install using pip

Install Manually for development

Features 🛠️

Implementation Details

Examples 🌟

Contributing 🤝

Acknowledgements 💡

About Me 👤

Connect with me

Lessons Learned 📌

License ⚖️

Feedback 📣

About

Releases 1

Contributors 2

Languages

License

shanmukh05/scratch_nlp

Folders and files

Latest commit

History

Repository files navigation

Scratch NLP 🧠

Table of Contents 📋

Documentation 📝

Installation ⬇️

Install using pip

Install Manually for development

Features 🛠️

Implementation Details

Examples 🌟

Contributing 🤝

Acknowledgements 💡

About Me 👤

Connect with me

Lessons Learned 📌

License ⚖️

Feedback 📣

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 1

Contributors 2

Languages