This project is for multi-label text classification. The dataset is reuter-21578.
Type in command line, in this case, the default data path is data.
python3 main.py
Type in command line, data_path is self-defined data path.
python3 main.py data_path
Output contains output files and print messages.
It will output following files. Other information will be print to System.out.
-
TOPIC_lilsst.csv:
This file contains a list of TOPIC words. We have remove these words from bag of terms (features).
Short view:
castorseed,potato,barley,soybean,gas,crude,nickel,coconut,nkr,platinum,citruspulp,yen,cotton,dfl,copper,fishmeal,dmk,hog,jobs,lead,rubber,interest,cornglutenfeed,cruzado,inventories,grain,sugar,oat,ship,palmkernel,alum,reserves,...
-
feature_vector_1.csv:
This file contains one derivative of selected feature vector which has a cardinality of 125.
Short view (unordered, because it is stored in dict):
profit,exploration,quarter,earnings,rules,revs,buffer,unemployment,shipping,note,ounce,mths,index,february,consumer,...
-
feature_vector_2.csv:
This file contains one derivative of selected feature vector which has a cardinality of 270.
Short view (unordered, because it is stored in dict):
loss,revs,profit,company,shares,note,year,dollar,tonnes,rate,bank,corp,unemployment,march,deficit,barrels,dlrs,rates,reuter,buffer,agriculture,growth,money,record,acquisition,japan,icco,dividend,beef,aluminium,soybeans,offer,account,stake,billion,...
-
KNN_predict_class_labels_125_feature_vector.txt
The file stores predicted labels of knn classifier for feature vector with 125 cardinality.
Short view (left is true labels, right is predicted labels):
True labels -> Predicted labels {'gas', 'gnp'} -> ['interest'] {'ship', 'acq'} -> ['ship'] {'crude'} -> ['crude']
-
KNN_predict_class_labels_270_feature_vector.txt
The file stores predicted labels of knn classifier for feature vector with 270 cardinality.
True labels -> Predicted labels {'gas', 'gnp'} -> ['gnp'] {'ship', 'acq'} -> ['ship'] {'crude'} -> ['crude']
-
Naive_predict_class_labels_125_feature_vector.txt
The file stores predicted labels of naive classifier for feature vector with 125 cardinality.
True labels -> Predicted labels {'gas', 'gnp'} -> ['interest'] {'ship', 'acq'} -> ['acq'] {'crude'} -> ['crude']
-
Naive_predict_class_labels_270_feature_vector.txt
The file stores predicted labels of naive classifier for feature vector with 270 cardinality.
True labels -> Predicted labels {'gas', 'gnp'} -> ['earn'] {'ship', 'acq'} -> ['acq'] {'crude'} -> ['crude']
-
Termination_messages.txt
The accuracy, offline and online efficiency data is list here. You can also see it when program is done.
========== Termination message ========== Mission completed. We select knn classifier and naive classifer. For feature vector with 125 cardinality: The accuracy of knn classifier is 0.7861500248385495. The offline efficient cost of knn classifier is 2.3052525520324707 s. The online efficient cost of knn classifier is 322.94284439086914 s. The accuracy of naive classifier is 0.7159463487332339. The offline efficient cost of naive classifier is 4.73245096206665 s. The online efficient cost of naive classifier is 139.92550945281982 s. For feature vector with 270 cardinality: The accuracy of knn classifier is 0.8128763040238451. The offline efficient cost of knn classifier is 326.36808919906616 s. The online efficient cost of knn classifier is 470.3035988807678 s. The accuracy of naive classifier is 0.7347242921013413. The offline efficient cost of naive classifier is 7.906782388687134 s. The online efficient cost of naive classifier is 274.57206416130066 s.
print message will show the rate of progress. See details by running the program.
Termination messages is given.
main.py
is program controller.
data_structure.py
in /data_structure
defines document objects and static statistic data we would use in building models and predicting class labels.
preprocess.py
in /data_preprocess
module is to read data from reuter-21578, parser text data, translate them into list of document objects which has the class labels and feature vector. Then, tokenize the words and construct a list of class labels and bag of terms.
metric.py
in /metric defines
importance metric for feature selection.
classifier.py
in /classifier
defines two classifiers: knn classifier and Naive Bayes classifier. Include fit(build) and predict methods.
mymethods.py
stores some self-defined methods.
I use regular expression to extract the content of each article and construct each news article as an object.
I use regular expression to find all string which only contains [a-z.-]. Convert text to a list of words.
Now, I don't lemmatize or stem the words.
In this procedure, I combine chi square, term frequency and information entropy to do feature selection.
The formula is ichi[term][class] = chi_2[term][class] * tf_class[term][class] * entropy[term][class] * beta[term][class]
, and I use ichi[term] = argmax{ichi[term][class]}
as metric of term importance.
chi_2[term][class]
is chi square test for term in a class.tf_class[term][class]
isterm frequency / documents num
in a class.entropy[term][class]
is information entropy.p = tf(t, dj) / tf(t, ci)
Sort the words by ichi value and we get a list of words in decreasing order of importance for classification.
Select top K words as feature vector. Get bag of features.
feature vector = [term 1, term 2, term 3, ... , term n]
Compute term frequency in each document and use tf-idf to construct feature vector of document.
[tf_idf_1, tf_idf_2, tf_idf_3, ... , tf_idf_4]
In this project, I design two derivative of selected feature vector. One feature vector has cardinality of 125 and another feature vector has cardinality of 270.
I select two classifiers in this lab: knn classifier and Naive Bayes classifier.
I use Euclidean distance to find k (k = 5) neighbors. They will vote for predicted labels. For knn classifier, I will generate one or multiple labels for each test documents.
Generate one label for each test documents.
For feature vector with cardinality of 125:
- The accuracy of knn classifier is 0.792.
- The accuracy of naive bayes classifier is 0.716.
For feature vector with cardinality of 270:
- The accuracy of knn classifier is 0.814.
- The accuracy of naive bayes classifier is 0.735.
Use following method to measure the accuracy of two classifiers.
-
First, extract the topics of test documents as a list named Y_original.
-
For knn classifier, I will generate one or multiple labels for each test documents. Compare each predicted label with original class labels(topics), if a predicted label appears in original class labels, we collect it as a true label. Use
to measure the accuracy for a test document. Then the accuracy for the classifier is
-
For naive bayes classifier, I will generate one label for each test documents. Compare each predicted label with original class labels(topics), if a predicted label appears in original class labels, we collect it as a true label. Use
to measure the accuracy for a test document. We can see the
accuracy_i
is 0 or 1.Then the accuracy for the classifier is
For feature vector with cardinality of 125:
- The offline efficient cost of knn classifier is 0.69 s.
- The offline efficient cost of naive classifier is 1.41 s.
For feature vector with cardinality of 270:
- The offline efficient cost of knn classifier is 1.085 s.
- The offline efficient cost of naive classifier is 2.374 s.
For feature vector with cardinality of 125:
-
The online efficient cost of knn classifier is 69.97 s.
-
The online efficient cost of naive classifier is 30.056 s.
For feature vector with cardinality of 270:
- The online efficient cost of knn classifier is 122.966 s.
- The online efficient cost of naive classifier is 60.379 s.
Well, knn is really lazy and slow...