Skip to content

Pet-project: comparison of different algorithms for school website classification + text clusterization + streamlit app

Notifications You must be signed in to change notification settings

kdduha/school-website-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 

Repository files navigation

school-website-classification

Pet-project: comparison of different algorithms for school website classification + text clusterization + streamlit app

The final visualization in the form of a streamlit web service is here


The pet-project was made on the basis of a small closed hackathon (by this reason I've only published models' weights without training data) of the Summer School from the ML&Text track (2023 year).

The training dataset contained texts from archived websites (3630 examples). The texts are heavily polluted with escape symbols, html markup and punctuation marks. Each text has its own label indicating belonging to the school website: if yes (1), if not (0).

What did I do?

  • on the cross validation of 5 batches, I compared:
    • several classical classification algorithms:
      • Logistic Regression
      • Naive Baise
      • KNN (k-nearest neighbours)
      • Linear SVM (support vector machine)
    • different vectorization models:
      • BOW (Bag of Words), q-grams (and not)
      • TF-IDF, q-grams (and not)
    • preprocessed and non-preprocessed data

  • I checked whether the quality of the model improves depending on the preprocessing of texts

  • A large number of words in the data are separated into parts by spaces, so I checked whether q-grams can improve the metrics

  • I did a frequency analysis of the data and compared it with TF-IDF features

  • I tried to do clusterization of non-school websites:

    • I found an optimal number of clusters by different methods:
      • AgglomerativeClustering with automatic determination of the optimal number of clusters
      • dendrogram of clusters
      • comparison of silhouette Score for different numbers of clusters
    • I lowered the dimension of vector spaces by using IncrementalPCA and made an interactive scattering plot

  • I visualized all the work done in the streamlit web service (ru) where using different pre-trained models you can determine whether the text is found on the school website and look at the metrics

In a folder streamlit_app are located requirements, a web-service code, training script and 32 classification models with different configurations.

In a folder research are stored my notebooks where I tested hypotheses, did a frequency analysis + words' clusterization and explained my decisions, problems, possible solutions and some troubles that I couldn't fix.

About

Pet-project: comparison of different algorithms for school website classification + text clusterization + streamlit app

Topics

Resources

Stars

Watchers

Forks