Similar to Facebook, WeChat has an information flow section which allows users to see their friends posts and updates. It is not surpring that plenty of influencer accounts are active on WeChat's information flow, trying to spred news, provide information, promote products and services etc.
I am interested to see if I can predict the number of "likes" of the content created by influencers. Especially, I am insterested in the performance of features extracted by nlp techinques. There is surprinsingly little research or insights on this question derived from data science. In addition, I am also interested to see if textual features can help machine to classify if a post or article from the influencer contains advertisement.
I scrapped 5,517,352 articles from influencer accounts on WeChat. Depending on tasks, I put different sizes of subsets of the dataset into analysis. An sample dataset is shown below.
I used number of likes on the articles and ad labels as the dependent variables for the analysis. There are several existing features on each article that can be used as predictors:
clicksCount: number of clicks on the article
originalFlag: whether the article was originally created (i.e., not shared) by the account
orderNum: wechat allows influencers to upload several articles as a group each day. Influencers decide the order of the articles in the group. Usually, the first article receives most views.
For predicting likes count, I used mean squared error (MSE) and mean absolute percentage error (MAPE) as the evaluation metrics of the regression task. For ad classification task, I mainly used prediction accuracy as the evaluation metric. Since I am interested in both false positive and negative cases and the labels are relatively balanced, accuracy is a good metric to describe the prediction results.
To get a general understanding of the content posted by all the influencer accounts, I conducte topic modeling using Latent Dirichlet Allocation(LDA) (see appendix 1). Given the large computing resource required, the LDA code is ran on the GPU server of my school.
Examples of the word loading within each topics derived from LDA topic modeling (manually translated to English for illustration purpose)
Plot the correlation between each topic with likeCount across time, the results show that the popularity of different topics differ signficantly over time.
I conducted tabular analysis with non-text predictors to predict number of likes on the posted articles with different models. Based on mean squared error, the random rorest model performed best.
Analyze the time series with day as the time unit. This gives 1920 days of likes count as the time series dataset.
I first adopted an naive approach with assumes that the next expected point is equal to the last observed point.
Naive forecast MSE: 40500.711534
Naive forecast MAPE: 3891825798079374.500000
This simple average method forecasts the expected value equal to the average of all previously observed points.
Simple average MSE: 13333.904245
Simple average MAPE: 2825187969291605.500000
Since the original data is not stationary, it needs differencing. The result below suggests differencing once is enough to achieve stationary (d = 1 in the ARIMA model). The ACF and PACF plot suggest p = 13, q = 1
ARIMA MSE: 6630.75
ARIMA MAPE: 101.35
Given the MSE and MAPE results, the ARIMA model clearly performed better than the naive model and the simple average model.
Because there is not much trend or seasonality that observed on the day level, I tried to analyze the data again on the month level. This gives 64 rows of data in the time series.
Again, the original time series is not stationay (figure on top) and it became stationay after differencing (figure at bottom)
ARIMA MSE: 111725.21
ARIMA MAPE: 94.77
Given the MAPE results, forecasting future months' likes is actually easier than forecasting future day's likes, possibly because more time-series information can be captured on the monthly data.
To judge if an anticle is an ad, I outsourced the manual labelling of ad to an agency in China. I labelled about a hundred articles as examples and asked the agency to label 10,000 others. In the data, the label A means the article contains no ad, B means the whole article is ad, and C means the main body of the article is not ad, but it includes ad near the end of the article.
I used the tokenized words as the predictors in the models. Using only features extracted from the text of the articles, I adopted vanilla LSTM, bidirectional LSTM, and stacked LSTM to predict the ad label of the articles. The results below show relatively good prediction accuracy (> 80%) on the testing set. The bidirectional LSTM model performed best and used the longest time to converge. This model showed signs of overfitting after the 2nd epoch.
As a comparison, I trained a SVM classification model with the non-text predictors. The SVM model show an accuracy of 0.669835, which is clearly lower than the accuracy achieved by the LSTM models based on text features.
Similar to the above analysis, I adopted the three LSTM models to predict likes count with only features extracted from the text. The MSE results below shows that th bidrectional LSTM model performed best, and the other two models almost performed the same.
As a comparison, I trained a linear regression model with the non-text features. The model achieved an validation MSE of 1010437.907273, which is 18% higher than the MSE from the bidirectional LSTM: 828027.25
The above analysis with LSTM used only tokenized words (i.e., encoding based on word frequency from Keras) as the input, they did no capture the semantic information among the words. This analysis adopted the pretrained BERT model with some fine tunining at the output layer, trying to achieve a better prediction result.
I adopted the pretrained model of bert-base-chinese from the transformers package. After data preprocessing to input the text data into the bert model, I constructed the model as below. I froze the bert layers so I don't need train them on my side. Instead, I used the semantic information from these pretrained layers to extract more insight from my own text data.
Given limited computing resources, I only ran three epochs and the results are shown below. Given the decreasing loss over the epochs, the model is likely to be underfit with three epochs. However, the results already showed a clear improvement over the linear regression and LSTM models in the previous section, because the validation MSE was 810718.6875, which is more than 2% lower than the results from the bidirectional LSTM models. Provided more computing resources for the current BERT model, it is expected to outperform other models more.