Skip to content

Latest commit

 

History

History
47 lines (34 loc) · 3.29 KB

README.md

File metadata and controls

47 lines (34 loc) · 3.29 KB

ML6 NLP Quick Tips

Current content:

  • Multilingual Sentence Embeddings (21/01/2021): Gives an overview of various current multilingual sentence embedding techniques and tools, and how they compare given various sequence lengths.

  • Spacy 3.0 (01/02/2021): Spacy 3.0 has just been released and in this tip, we'll have a look at some of the new features. We'll be training a German NER model and streamline the end-to-end pipeline using the brand new spaCy projects!

  • Compact transformers (26/02/2021): Bigger isn't always better. In this tip we look at some compact BERT-based models that provide a nice balance between computational resources and model accuracy.

  • Keyword Extraction with pke (18/03/2021): The KEYNG (read king) is dead, long live the KEYNG! In this tip we look at pke, an alternative to Gensim for keyword extraction.

  • Explainable transformers using SHAP (22/04/2021): BERT, explain yourself! 📖 Up until recently language model predictions have lacked transparency. In this tip we look at SHAP, a way to explain your latest transformer based models.

  • Transformer-based Data Augmentation (18/06/2021): Ever struggled with having a limited non-English NLP dataset for a project? Fear not, data augmentation to the rescue ⛑️ In this week's tip, we look at backtranslation 🔀 and contextual word embedding insertions as data augmentation techniques for multilingual NLP.

  • Long range transformers (14/07/2021): Beyond and above the 512! 🏅 In this week's tip, we look at novel long range transformer architectures and compare them against the well-known RoBERTa model.

  • Neural Keyword Extraction (10/09/2021): Neural Keyword Extraction 🧠 In this week's tip, we look at neural keyword extraction methods and how they compare to classical methods.

  • HuggingFace Optimum (12/10/2021): HuggingFace Optimum Quantization ✂️ In this week's tip, we take a look at the new HuggingFace Optimum package to check out some model quantization techniques.

  • Text Augmentation using large-scale LMs and prompt engineering (25/11/2021): Typically, the more data we have, the better performance we can achieve 🤙. However, it is sometimes difficult and/or expensive to annotate a large amount of training data 😞. In this tip, we leverage three large-scale LMs (GPT-3, GPT-J and GPT-Neo) to generate very realistic samples from a very small dataset.

  • Gender debaising of datasets using CDA (25/01/2022): A lot of large language models are trained on webtext. However, this means that unintended biases can sneak into your model behaviour 😞. In this tip, we'll look at how to try and alleviate this bias using Counterfactual Data Augmentation ⚖️.

  • GPT2 Quantization using ONNXRuntime (19/04/2022): Large language models are costly to run, in this notebook we leverage ONNXRuntime to quantize and run our Dutch GPT2 model in a more efficient way 💰.