Skip to content

Y-B-Class-Projects/NLP-TF-IDF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

In this project we went through hundreds of articles in Hebrew from the news site walla.co.il and through NLP we searched for articles that matched the query we defined in advance. We used the TF-IDF method to find distances between the articles and the query. The distances between the Marmots were calculated using three different distance functions:

  • Cosine distance
  • Euclidean distance
  • Jaccard distance

The query used in the run is "קרונה COVID חיסונים מחלימים חיסון שלישי בוסטר מתחסנים הקורונה מחלה בדיקות סגר חולים"

Top results of the three distance functions:

│ file │cosine distances│ │ data\2934404.txt │ 0.671822 │
│ data\2919869.txt │ 0.729147 │
│ data\2795571.txt │ 0.729724 │
│ data\2788711.txt │ 0.730278 │
│ data\2672932.txt │ 0.732511 │
│ data\2930070.txt │ 0.744248 │
│ data\2920501.txt │ 0.745138 │
│ data\2752677.txt │ 0.749246 │
│ data\3025567.txt │ 0.756526 │
│ data\2686327.txt │ 0.763186 │


│ file │ euclidean distances │
│ data\2612943.txt │ 60.7065 │
│ data\2617199.txt │ 60.7065 │
│ data\2617945.txt │ 60.7065 │
│ data\2624898.txt │ 60.7065 │
│ data\2674495.txt │ 60.7065 │
│ data\2674793.txt │ 60.7065 │
│ data\2676337.txt │ 60.7065 │
│ data\2677887.txt │ 60.7065 │
│ data\2682157.txt │ 60.7065 │
│ data\2682875.txt │ 60.7065 │


│ file │ jaccard distances │
│ data\2628358.txt │ 0.00047619 │
│ data\2695054.txt │ 0.0004914 │
│ data\2953739.txt │ 0.0005 │
│ data\2841717.txt │ 0.000530223 │
│ data\2743524.txt │ 0.000536769 │
│ data\3004443.txt │ 0.000547345 │
│ data\2834335.txt │ 0.000548246 │
│ data\2843278.txt │ 0.000548246 │
│ data\2731211.txt │ 0.000553403 │
│ data\2897353.txt │ 0.000560852 │

About

Information Retrieval EX02

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages