Removing duplicates from the output? #26

dachafra · 2021-03-02T15:35:20Z

Hi!
Is there any possibility to indicate to the engine to not generate duplicates during the construction of the KG?

ghsnd · 2021-03-03T10:35:49Z

Hi,
No, right now this is not possible. I'll create an issue.
Best regards,
Gerald

andimou · 2021-03-03T12:45:36Z

@dachafra I do not think that the R2RML spec clarifies what the processors need to do with duplicate triples.

I think this might indeed become an issue for streaming data but for static KGs generated with [R2]RML, I think it is a matter of libraries implementations and stores to deal with duplicates. E.g., some SPARQL endpoints deduplicate.

All I see at the R2RML spec is:

RDF graphs cannot contain duplicate RDF triples. Placing multiple equal triples into the same graph has the same effect as placing it into the graph only once.

but there is no clarification if the processors need to deduplicate. To the contrary, the latter sentence gives the feeling that it doesn't matter if there are duplicates

dachafra · 2021-03-03T16:00:08Z

@andimou I partially agree with you. Although the R2RML specification does not specifically say that RDF graphs with duplicates are invalid, it mentions Placing multiple equal triples into the same graph has the same effect as placing it into the graph only once. which as far as I understand it means that should not have a lot of sense to insert a triple if it is already in the graph, and, indeed when our aim is to create a high-quality graph.

On the other hand, I understand that this task cannot be relevant for some of the parsers and it can be delegated to SPARQL endpoints or other mechanisms that perform the deduplication

andimou · 2021-03-03T16:37:41Z

I interpret it as redundant but not harmful if it happens because it doesn't matter because it has the same effect!

For as long as the spec allows different interpretations, as we do here and we are both as knowledgeable about [R2]RML, it means that the processors may chose to implement as they like I think because the spec does not specify what the processors should do as it does in other cases. In this case, I like the vagueness :)

Forcing deduplication to be implemented may become more complicated in different scenarios. For instance in a streaming scenario, it might mean keeping in memory all triples ever generated so you don't feed the stream with the same triple, exhausting eventually all your memory capacity. I'd rather have duplicates then! They are not harmful after all ;)

Then again, what's considered duplicate?

The actual triple might be the same but its timestamp in the case of the Streamer might be different. Is it the same or different triple?! In the RMLMapper, we have PROV being automatically generated, 2 triples which are the same may be considered duplicates but if you capture their provenance, they are not because their provenance might be different. In one case the processor may remove them, in the other may not.

I do not see though how having duplicates affects quality?

I don't think that existence of duplicate triples is considered in any of the Linked Data quality dimensions/metrics.

arenas-guerrero-julian · 2021-03-03T18:06:00Z

Hi all!,

For me, removing duplicates is not mandatory, but a very good to have feature. About the Linked Data quality, the article Quality Assessment for Linked Open Data: A Survey penalizes extensional concisness if there are duplicated triples (if I understood well).

Best,
Julián

dachafra · 2021-03-03T19:30:21Z

Indeed, I totally agree about the points of the Streamer and PROV+RMLMapper, I think that the timestamp in streaming data will definitely play a relevant point in the elimination of duplicates.

I was asking more focused on the toFile feature of the RMLStreamer when the generation of KG is made from static data

SteBiard · 2022-06-01T12:36:14Z

Hello all,

I would have interest in this feature because I had the case that on large files (~120k rows, not so extreme though) (maybe then only the Streamer is valid), the knowledge graph generated was 900Mo and after merge with the Tbox using Rdflib and the OWL API that removes duplicates directly, it went down to 4mo. Though it does not impact the result indeed, it seems that the mapping took almost 15min but could have been much faster if duplicates were ignored.

Have a good day,

Stéphane

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing duplicates from the output? #26

Removing duplicates from the output? #26

dachafra commented Mar 2, 2021

ghsnd commented Mar 3, 2021

andimou commented Mar 3, 2021

dachafra commented Mar 3, 2021

andimou commented Mar 3, 2021

arenas-guerrero-julian commented Mar 3, 2021

dachafra commented Mar 3, 2021 •

edited

Loading

SteBiard commented Jun 1, 2022

Removing duplicates from the output? #26

Removing duplicates from the output? #26

Comments

dachafra commented Mar 2, 2021

ghsnd commented Mar 3, 2021

andimou commented Mar 3, 2021

dachafra commented Mar 3, 2021

andimou commented Mar 3, 2021

arenas-guerrero-julian commented Mar 3, 2021

dachafra commented Mar 3, 2021 • edited Loading

SteBiard commented Jun 1, 2022

dachafra commented Mar 3, 2021 •

edited

Loading