Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing duplicates from the output? #26

Open
dachafra opened this issue Mar 2, 2021 · 7 comments
Open

Removing duplicates from the output? #26

dachafra opened this issue Mar 2, 2021 · 7 comments

Comments

@dachafra
Copy link

dachafra commented Mar 2, 2021

Hi!
Is there any possibility to indicate to the engine to not generate duplicates during the construction of the KG?

@ghsnd
Copy link
Contributor

ghsnd commented Mar 3, 2021

Hi,
No, right now this is not possible. I'll create an issue.
Best regards,
Gerald

@andimou
Copy link

andimou commented Mar 3, 2021

@dachafra I do not think that the R2RML spec clarifies what the processors need to do with duplicate triples.

I think this might indeed become an issue for streaming data but for static KGs generated with [R2]RML, I think it is a matter of libraries implementations and stores to deal with duplicates. E.g., some SPARQL endpoints deduplicate.

All I see at the R2RML spec is:

RDF graphs cannot contain duplicate RDF triples. Placing multiple equal triples into the same graph has the same effect as placing it into the graph only once.

but there is no clarification if the processors need to deduplicate. To the contrary, the latter sentence gives the feeling that it doesn't matter if there are duplicates

@dachafra
Copy link
Author

dachafra commented Mar 3, 2021

@andimou I partially agree with you. Although the R2RML specification does not specifically say that RDF graphs with duplicates are invalid, it mentions Placing multiple equal triples into the same graph has the same effect as placing it into the graph only once. which as far as I understand it means that should not have a lot of sense to insert a triple if it is already in the graph, and, indeed when our aim is to create a high-quality graph.

On the other hand, I understand that this task cannot be relevant for some of the parsers and it can be delegated to SPARQL endpoints or other mechanisms that perform the deduplication

@andimou
Copy link

andimou commented Mar 3, 2021

I interpret it as redundant but not harmful if it happens because it doesn't matter because it has the same effect!

For as long as the spec allows different interpretations, as we do here and we are both as knowledgeable about [R2]RML, it means that the processors may chose to implement as they like I think because the spec does not specify what the processors should do as it does in other cases. In this case, I like the vagueness :)

Forcing deduplication to be implemented may become more complicated in different scenarios. For instance in a streaming scenario, it might mean keeping in memory all triples ever generated so you don't feed the stream with the same triple, exhausting eventually all your memory capacity. I'd rather have duplicates then! They are not harmful after all ;)

Then again, what's considered duplicate?

The actual triple might be the same but its timestamp in the case of the Streamer might be different. Is it the same or different triple?! In the RMLMapper, we have PROV being automatically generated, 2 triples which are the same may be considered duplicates but if you capture their provenance, they are not because their provenance might be different. In one case the processor may remove them, in the other may not.

I do not see though how having duplicates affects quality?

I don't think that existence of duplicate triples is considered in any of the Linked Data quality dimensions/metrics.

@arenas-guerrero-julian
Copy link

Hi all!,

For me, removing duplicates is not mandatory, but a very good to have feature. About the Linked Data quality, the article Quality Assessment for Linked Open Data: A Survey penalizes extensional concisness if there are duplicated triples (if I understood well).

Best,
Julián

@dachafra
Copy link
Author

dachafra commented Mar 3, 2021

Indeed, I totally agree about the points of the Streamer and PROV+RMLMapper, I think that the timestamp in streaming data will definitely play a relevant point in the elimination of duplicates.

I was asking more focused on the toFile feature of the RMLStreamer when the generation of KG is made from static data

@SteBiard
Copy link

SteBiard commented Jun 1, 2022

Hello all,

I would have interest in this feature because I had the case that on large files (~120k rows, not so extreme though) (maybe then only the Streamer is valid), the knowledge graph generated was 900Mo and after merge with the Tbox using Rdflib and the OWL API that removes duplicates directly, it went down to 4mo. Though it does not impact the result indeed, it seems that the mapping took almost 15min but could have been much faster if duplicates were ignored.

Have a good day,

Stéphane

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants