Skip to content

Latest commit

 

History

History
52 lines (43 loc) · 2.5 KB

README.md

File metadata and controls

52 lines (43 loc) · 2.5 KB

In this project we went through hundreds of articles in Hebrew from the news site walla.co.il and through NLP we searched for articles that matched the query we defined in advance. We used the TF-IDF method to find distances between the articles and the query. The distances between the Marmots were calculated using three different distance functions:

  • Cosine distance
  • Euclidean distance
  • Jaccard distance

The query used in the run is "קרונה COVID חיסונים מחלימים חיסון שלישי בוסטר מתחסנים הקורונה מחלה בדיקות סגר חולים"

Top results of the three distance functions:

│ file │cosine distances│ │ data\2934404.txt │ 0.671822 │
│ data\2919869.txt │ 0.729147 │
│ data\2795571.txt │ 0.729724 │
│ data\2788711.txt │ 0.730278 │
│ data\2672932.txt │ 0.732511 │
│ data\2930070.txt │ 0.744248 │
│ data\2920501.txt │ 0.745138 │
│ data\2752677.txt │ 0.749246 │
│ data\3025567.txt │ 0.756526 │
│ data\2686327.txt │ 0.763186 │


│ file │ euclidean distances │
│ data\2612943.txt │ 60.7065 │
│ data\2617199.txt │ 60.7065 │
│ data\2617945.txt │ 60.7065 │
│ data\2624898.txt │ 60.7065 │
│ data\2674495.txt │ 60.7065 │
│ data\2674793.txt │ 60.7065 │
│ data\2676337.txt │ 60.7065 │
│ data\2677887.txt │ 60.7065 │
│ data\2682157.txt │ 60.7065 │
│ data\2682875.txt │ 60.7065 │


│ file │ jaccard distances │
│ data\2628358.txt │ 0.00047619 │
│ data\2695054.txt │ 0.0004914 │
│ data\2953739.txt │ 0.0005 │
│ data\2841717.txt │ 0.000530223 │
│ data\2743524.txt │ 0.000536769 │
│ data\3004443.txt │ 0.000547345 │
│ data\2834335.txt │ 0.000548246 │
│ data\2843278.txt │ 0.000548246 │
│ data\2731211.txt │ 0.000553403 │
│ data\2897353.txt │ 0.000560852 │