This project search for top 10 relevant articles based on input queries among 7,034 English news collected from reuters.com. The retrieved results will be shown as NewsID \t Similarity Score
, and these results will be ranked from top relevant to least relevant.
TF Weighting + Cosine Similarity
News Score
--------- ---------
News123256 0.48507125007266594
News119356 0.48507125007266594
News108578 0.4472135954999579
News120265 0.4472135954999579
News103117 0.38575837490522974
News101763 0.3768891807222045
News111959 0.3768891807222045
News115859 0.3768891807222045
News119746 0.3768891807222045
News122919 0.37267799624996495
-------------------------
$ python3 main.py --query '<querylist>'
$ python3 main.py -d <doc_list_folder> -w <weight_type> -r <relevance_type> -q '<querylist>' [-f]
-d
,--doc
: Specify search target folder.-w
,--weight
: Specify weight type, for instance:tf
ortfidf
.-r
,--relevance
: Define how to measure similarity,cs
stands for Cosine Similarity,eu
stands for Euclidean Distance.-q
,--query
: Define searching query, words seperated by(space)
.-f
,--feedback
: add this argument if want to calculate pseudo feedback.
- Term Frequency (TF) Weighting + Cosine Similarity
$ python3 main.py -d EnglishNews -w tf -r cs -q 'trump biden taiwan china'
- Term Frequency (TF) Weighting + Euclidean Distance
$ python3 main.py -d EnglishNews -w tf -r eu -q 'trump biden taiwan china'
- TF-IDF Weighting + Cosine Similarity
$ python3 main.py -d EnglishNews -w tfidf -r cs -q 'trump biden taiwan china'
- TF-IDF Weighting + Euclidean Distance
$ python3 main.py -d EnglishNews -w tfidf -r eu -q 'trump biden taiwan china'
$ python3 main.py -d EnglishNews -w tfidf -r cs -q 'trump biden taiwan china' -f