Some Tutorials and in depth analysis of Natural Language Processing (NLP) techniques and applied NLP
Explore the docs »
View Demo
·
Report Bug
·
Request Feature
ADD PROJECT DESCRIPTION + TWO LINES ABOUT MLJC
You can either get a local copy by downloading this repo or either use Google Colaboratory by copy-pasting the link of the notebook (.ipynb file) of your choice.
Install Miniconda
Please go to the Anaconda website. Download and install the latest Miniconda version for Python 3.8 for your operating system.
wget <http:// link to miniconda>
sh <miniconda*.sh>
Download This Repo
git clone https://github.com/MachineLearningJournalClub/LearningNLP
Setup Conda Environment
IN THE END WE CAN SETUP A CONDA ENVIRONMENT AND EXPORT REQUIREMENTS (NEEDED LIBRARIES)
Change directory (cd
) into the LearningNLP folder, then type:
# cd LearningNLP
conda env create -f environment.yml
source activate LNLP
- Sentiment Analysis with Logistic Regression
- Sentiment Analysis with Naive Bayes
- Word Vectorizing (CountVectorizer in Scikit-learn)
- Some Explainability Methods
-
Dataset: ArXiv from Kaggle
-
Binary classification: Scikit-learn's CountVectorizer + TfidfTransformer
-
Explainability Methods: LIME, SHAP
Useful references for explainibility methods:
- LIME, Why Should I Trust You?": Explaining the Predictions of Any Classifier
- SHAP, A Unified Approach to Interpreting Model Predictions
- Adversarial attacks (have you heard of?), i.e. how to fool algorithms --> Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods
-
Open Questions for you:
- How to deal with multiclass problems?
- Try to develop binary classification with abstracts instead of titles
- Try to develop the same pipeline with spaCy
- Bias & Fairness in NLP (Ethics and Machine Learning)
- Gender Framing (in Political Tweets)
- Political Party Prediction
- Topic Modeling - Latent Dirichlet Allocation (LDA)
We'd like to introduce some ethical concerns in ML and especially in NLP, the idea is to start a long-term project directed towards Bias & Fairness in Machine Learning, i.e. intrinsic problems in our data can create inequalities in the real world (Have you watched "Coded Bias" on Netflix?)
- Dataset: we created a dataset by scraping tweets from some US politicians
- Preprocessing: pandas, nltk, gensim
- Binary classification: Scikit-learn's CountVectorizer + TfidfTransformer
- Topic Modeling by employing Latent Dirichlet Allocation (LDA) + visualization. Some educational contents for LDA: L. Serrano part 1 on LDA, L. Serrano part 2 How to train LDA
In the two following notebooks we are going to focus on a Kaggle competition, namely: the CommonLit Readability Prize
- Exploratory Data Analysis
You can directly run it on Kaggle
- Pretrained Word2Vec model, feature extraction
- Dimensionality Reduction and visualization with UMAP
- Naive Word2Vec Augmentation
- Global Vectors for word representations (GloVe), Stanford NLP
- Fasttext, skipgrams vs CBOWs
- Bias in Word Embeddings (Gender + Ethnic Stereotypes) with WEFE
- Bias in Word Embeddings: What causes it?
- Understanding Bias in Word Embeddings, ICML paper + code
- Employing The Word Embedding Fairness Evaluation Framework (WEFE): WEAT, (RIPA?)
- Debiasing Word Embeddings, Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings, code
- Biasing a simple model: how can we deliberately bias our model by injecting biased information into our model? What can we learn from this? How is this thing useful for debiasing purposes?
In the two following notebook we are going to focus on a Kaggle competition, namely: the CommonLit Readability Prize
- Data Augmentation
In the following notebooks (in this Github repo) we outlined our solution for the CommonLit Readibility Prize
- Finetuning Sentence Transformers models (Roberta family) in PyTorch
- Possible strategies for data augmentation
See the open issues for a list of proposed features (and known issues).
Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License. See LICENSE
for more information.
Simone Azeglio - email : [email protected] - linkedin
Luca Bottero - email : [email protected] - linkedin
Marina Rizzi - email : - linkedin
Alessio Borriero - email : [email protected] - linkedin
Micol Olocco - email : - linkedin
Project Link: https://github.com/MachineLearningJournalClub/LearningNLP