Sensitive Text Clustering

Author: Lorenzo Balzani

balzanilo[at]icloud[dot]com

This code was originally intended to be the outcome of a challenge for an NLP-focused internship.

Problem statement

Given a dataset where each sentence (by human judgment) provides some sensitive personal information about a person, write an ML algorithm that clusters data into different types of sensitive personal information and automatically assigns a label to each cluster.

Use Jupyter notebook and Python 3.X.

Hints:

focus on SOTA approaches, try to avoid ancient technology if there is a better one.
write clean code, it will be evaluated.
write a conclusion to your findings.

Notes

GPU usage

I have chosen to use Google Colab because a GPU is freely available. Hence, it will be assumed that a GPU is available. If not, some packages might either not work (CuML) or be slower (BERTopic). You can fix this by using the standard package versions.

Visualizations

All the visualizations (both interactive and static) cannot approprietaly be rendered in the notebook. Therefore, you might want to re-run the notebook to get them.

Fine-tuned model

If you are not keen to re-train the model, download the sensitive-text-clustering model from the HuggingFace Hub.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
personal_data.json		personal_data.json
sensitive_text_clustering.ipynb		sensitive_text_clustering.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sensitive Text Clustering

Problem statement

Notes

GPU usage

Visualizations

Fine-tuned model

About

Languages

License

lorenzobalzani/sensitive-text-clustering

Folders and files

Latest commit

History

Repository files navigation

Sensitive Text Clustering

Problem statement

Notes

GPU usage

Visualizations

Fine-tuned model

About

Topics

Resources

License

Stars

Watchers

Forks

Languages