Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5. identify duplicities #4

Open
24 tasks
BerkvensNick opened this issue May 23, 2024 · 2 comments
Open
24 tasks

5. identify duplicities #4

BerkvensNick opened this issue May 23, 2024 · 2 comments

Comments

@BerkvensNick
Copy link
Contributor

BerkvensNick commented May 23, 2024

The SWR will identify duplicates based on metadata. https://github.com/soilwise-he/Soilwise-userstories/issues/16

Origin: D1.3 Repository architecture

  • define strategy for identification, processing and storing of duplicates for iteration 1
  • processing in metadata store
  • identify duplicates based on metadata
  • adapt KG structure to support duplicities (link with knowledge graph)
  • visualisation of duplicates in user interface (link with UI component)

With more detailed tasks per requirement:

  • Define Strategy for Identification, Processing, and Storing of Duplicates for Iteration 1

  • Define a clear algorithm or methodology for identifying duplicates based on metadata attributes (e.g., title, author, publication date).

  • Ensure the strategy accounts for variations in metadata across different document types and repositories.

  • Outline the workflow for processing identified duplicates, including how to handle conflicting metadata and determining the primary document.

  • Specify whether duplicates should be merged, flagged, or excluded from search results.

  • Define the storage mechanism for duplicates, ensuring efficient retrieval and management within the central database.

  • Adapt KG Structure to Support Duplicities (Link with Knowledge Graph)

  • Modify the knowledge graph (KG) schema to accommodate duplicate relationships between documents.

  • Define how duplicate relationships will be represented within the KG (e.g., as edges linking duplicate nodes).

  • Ensure KG queries can retrieve duplicate-related information, allowing users to explore connections between duplicate documents.

  • Test KG queries to verify accurate retrieval of duplicate metadata and relationships.

  • Processing in Metadata Store

  • Ensure metadata extraction is accurate and robust across different document formats and languages.

  • Enrich metadata with additional attributes that facilitate duplicate identification (e.g., normalized titles, standardized author names).

  • Identify duplicates using the processing capabilities of the SWR metadata store.

  • Identify Duplicates Based on Metadata

  • Implement a duplicate detection algorithm based on metadata similarity metrics (e.g., Jaccard similarity, Levenshtein distance).

  • Test the algorithm's performance on a diverse dataset to evaluate its accuracy and efficiency.

  • Define thresholds for similarity scores or metadata attributes to classify documents as duplicates.

  • Adjust thresholds based on the desired balance between precision and recall in duplicate identification.

  • Visualization of Duplicates in User Interface (Link with UI Component)

  • Design intuitive visualizations within the user interface (UI) to represent duplicate relationships.

  • Ensure visualizations are accessible and informative for users of varying expertise levels.

  • Implement interactive features that allow users to explore duplicate relationships (e.g., clicking on a document to view its duplicates).

@BerkvensNick BerkvensNick changed the title identify duplicities 4. identify duplicities May 23, 2024
@pvgenuchten
Copy link
Contributor

This issue needs to be discussed, duplicities will occur, a knowledge article will be available in both Zenodo, OpenAire and Cordis. However each of these platforms capture extra information about the resource. The information should be merged to a single set of statements about the resource. The knowledge graph will facilitate this process. In the process we will find multiple challenges, for example if a resource has different titles in different platforms. Typical behaviour is that both titles are stored.

@BerkvensNick
Copy link
Contributor Author

Maybe in this first iteration we can identify/flag duplicates based on doi - title similarity - author - date and then further discuss with JRC how to tackle the duplicities, but then based on actual "duplicate sources" we have found?

@roblokers roblokers self-assigned this May 29, 2024
@BerkvensNick BerkvensNick changed the title 4. identify duplicities 5. identify duplicities Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants