school-website-classification

Pet-project: comparison of different algorithms for school website classification + text clusterization + streamlit app

The final visualization in the form of a streamlit web service is here

The pet-project was made on the basis of a small closed hackathon (by this reason I've only published models' weights without training data) of the Summer School from the ML&Text track (2023 year).

The training dataset contained texts from archived websites (3630 examples). The texts are heavily polluted with escape symbols, html markup and punctuation marks. Each text has its own label indicating belonging to the school website: if yes (1), if not (0).

What did I do?

on the cross validation of 5 batches, I compared:
- several classical classification algorithms:
  - Logistic Regression
  - Naive Baise
  - KNN (k-nearest neighbours)
  - Linear SVM (support vector machine)
- different vectorization models:
  - BOW (Bag of Words), q-grams (and not)
  - TF-IDF, q-grams (and not)
- preprocessed and non-preprocessed data

I checked whether the quality of the model improves depending on the preprocessing of texts
A large number of words in the data are separated into parts by spaces, so I checked whether q-grams can improve the metrics
I did a frequency analysis of the data and compared it with TF-IDF features
I tried to do clusterization of non-school websites:
- I found an optimal number of clusters by different methods:
  - AgglomerativeClustering with automatic determination of the optimal number of clusters
  - dendrogram of clusters
  - comparison of silhouette Score for different numbers of clusters
- I lowered the dimension of vector spaces by using IncrementalPCA and made an interactive scattering plot

I visualized all the work done in the streamlit web service (ru) where using different pre-trained models you can determine whether the text is found on the school website and look at the metrics

In a folder streamlit_app are located requirements, a web-service code, training script and 32 classification models with different configurations.

In a folder research are stored my notebooks where I tested hypotheses, did a frequency analysis + words' clusterization and explained my decisions, problems, possible solutions and some troubles that I couldn't fix.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
research		research
streamlit_app		streamlit_app
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

school-website-classification

About

Languages

kdduha/school-website-classification

Folders and files

Latest commit

History

Repository files navigation

school-website-classification

About

Topics

Resources

Stars

Watchers

Forks

Languages