Squid (the DP page view prediction ML model)

(last updated 12/2 8am ET)

Project Report

Refer to CIS_520_Final_Project_Report.pdf for a summary of the methods and findings of this project

Why Squid

The messenger group chat the members created for this project was initially called the DP Squad. We got the inspiration of Squid from Squad.

Motivation

The members of this project team have extensive connections with Penn’s student newspaper, The Daily Pennsylvanian (DP). Everyone at one point has worked as staff in the paper’s data analytics department; one member ran the department in 2019. During our time at the DP, we realized the potential of machine learning to improve the operations of the paper, which is increasingly moving towards a digital-first format as it struggles to find a viable business model in the twenty-first century. As it becomes less common for students to pick up a copy of the paper while moving around campus, the DP is striving to establish an online readership base in order to remain relevant to students and continue earning revenue from advertisers. This requires the paper to understand its audience and predict the performance of articles, which in turn requires machine learning models to provide this information. If the DP has a model that predicts the performance of news articles, they can use that information to form a social media posting strategy, curate their daily and weekly newsletters, and position articles on their website. This will help the paper increase their pageviews, keeping its work a part of the campus conversation and attracting more revenue opportunities.

Problem Formulation

Classification:

Classify articles into the quintiles based on views (ranked wrt articles written in the same month)

Regression:

Predict article's percentile wrt articles written in the same month (reasoning: ultimately our purpose is to be able to rank articles published within a certain timeframe to decide what goes onto headline and what doesn't - David told us that model should be useful for ranking)
Predict article's views (reasoning: perhaps certain months just have bad articles so predicting actual views makes sense too)

File Structure

Data: https://drive.google.com/drive/folders/14UH75BSa7sFZ17ZiX-kGvvOdsZJLeGYx?usp=sharing

Deep Learning

train-test-split.ipynb - merges formatted content and views together, runs the train and test split (stratified by year and month of publish), and creates train.csv and test.csv
Word Embeddings.ipynb - trains Word2Vec word embeddings on the content data, outputs 'word2vec_train2.txt' which is used for DL.
DL_Quintile_Classification.ipynb - initial Deep Learning file: loads existing word embeddings to create embedding matrix,trains RNN model to classify articles into quintiles, model evaluation (confusion matrix), explainable model insights using Eli5
DL_Percentile_Classification.ipynb - regression Deep Learning model to predict an article's views percentile (wrt articles published in that month)

Deep Learning Results so far

Model specification: GRU (32 units, 0.2 dropout), 0.25 dropout in embedding

Classification: ~39.1% accuracy the 5-quintile classification, 75.9% off-by-one accuracy
Regression on percentiles: mean absolute error of 16.6 percentile

Deep Learning: To look into next

Using titles to predict, instead of entire content
Regression on actual article views
Using LSTM (training overnight)
Using other Word Embedding methods - FastText (12/3 or 12/4)
Using Pre-trained Language Models - e.g. BERT, ELMO (12/3 or 12/4)
*Training DL models on the residuals from supervised learning steps (after accounting for seasonality, time spent in headlines and other factors)

Supervised Learning

Variables to used: Topic probabilities (from LDA) AND/OR bag-of-words (count or TFIDF vectorizer), metadata (tags, duration on front page), temporal features (day published, month, year), author

Methods: Ridge regression / Logistic regression (simplest baseline), RF, GB, SVM (with hyperparameter tuning)

Unsupervised Learning

LDA_worksheet.ipynb - conducts LDA and NMF to get topics

To do

Retraining LDA on the training data (see Google Drive)
Selecting perfect number of topics (k) using topic coherence
Obtain topic probabilities for every article (to pass to Peter & James for supervised learning)

Ideas in General

How to incorporate time-varying effect of different topics on article views?
- Idea: Run a rolling window sample (3 months) - and plot how the predicted parameters change over time (am especially interested to see how the topic probabilities or keyword importances change over time)

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.gitignore		.gitignore
Author_Positions_Supervised_Learning.ipynb		Author_Positions_Supervised_Learning.ipynb
CIS_520_Final_Project_Report.pdf		CIS_520_Final_Project_Report.pdf
DL_Percentile_Regression.ipynb		DL_Percentile_Regression.ipynb
DL_Quintile_Classification.ipynb		DL_Quintile_Classification.ipynb
Deep_Learning_for_Title_Classification.ipynb		Deep_Learning_for_Title_Classification.ipynb
EDA.ipynb		EDA.ipynb
Getting Article Views.ipynb		Getting Article Views.ipynb
LDA_worksheet.ipynb		LDA_worksheet.ipynb
Meta.ipynb		Meta.ipynb
README.md		README.md
Supervised_Learning.ipynb		Supervised_Learning.ipynb
Supervised_TopicsAndEnsemble.ipynb		Supervised_TopicsAndEnsemble.ipynb
Tags_Supervised_Learning.ipynb		Tags_Supervised_Learning.ipynb
Title_Supervised_Learning.ipynb		Title_Supervised_Learning.ipynb
Train-Test Split.ipynb		Train-Test Split.ipynb
Transformers_DL.ipynb		Transformers_DL.ipynb
Unsupervised Topic Classification.ipynb		Unsupervised Topic Classification.ipynb
Word Embeddings.ipynb		Word Embeddings.ipynb
joblib_lda_model20.sav		joblib_lda_model20.sav
meta.ipynb		meta.ipynb
metadata.yml		metadata.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Squid (the DP page view prediction ML model)

Project Report

Why Squid

Motivation

Problem Formulation

File Structure

About

Releases

Packages

Contributors 4

Languages

peterbaile/squid

Folders and files

Latest commit

History

Repository files navigation

Squid (the DP page view prediction ML model)

Project Report

Why Squid

Motivation

Problem Formulation

File Structure

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages