Skip to content

This repository carries the source code and instructions for the duplication detection service deployed in X5GON

License

Notifications You must be signed in to change notification settings

X5GON/dupe_detect

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Duplicate Detection

This repository carries the source code and instructions for the duplication detection service deployed in X5GON

  • This file contains the script to update the X5GON DB with documents which contains exact same values in a newly created column duplicate(boolean) as TRUE

    8542 exact duplicate materials was found with 3230 distinct values which implies 5312 documents can be disregarded as duplicates

  • This file was used to detect all the duplicate clusters in the X5GON DB using TF (Term Frequency) and WIKI as metrics to determine whether a pair of document is a duplicate or not.

    TF > 0.85 and WIKI > 0.95 were used as thresholds for a document pair to be considered as a Duplicate.

This contains the results obtained using the above proposed method. This contains material IDs of all the documents with material IDs of their respective detected duplicates. This result was used to plot the following graph to analyse the result visually.

img_1.png

This interactive graph was used to evaluate the results produced by the above proposed method. Each dot represents a documents and clusters represent a set of duplicate documents.

This graph can be generated using graph_draw.py file. Also you can use the ipython notebook for more interactive analysis which has the option to click on a node to open the respective document.

Datasets

Description Link Info
Results Dataset link This dataset contains the results obtained using the above proposed method. This contains material IDs of all the documents with material IDs of their respective detected duplicates
Manually Evaluated Dataset link This dataset contains manual evaluation done on the above obtained results

TODO

  • Write the script for the cron job to be run on the X5GON server to update duplicates of future OER materials.
  • Write the script to update the DB with a new table using obtained results

About

This repository carries the source code and instructions for the duplication detection service deployed in X5GON

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published