Triple store queries: replace subqueries with LIMIT 1 by proper subqueries using GROUP BY #3330
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Please check if the PR fulfills these requirements
Does this PR already have an issue describing the problem?
What kind of change does this PR introduce?
Bug fix
Does this PR introduce a new Powsybl Action implying to be implemented in simulators or pypowsybl?
What is the current behavior?
Some SPARQL queries use subqueries with LIMIT 1 expecting inner variables to be instantiated by the outer scope. For example, query for regions and subregions of a substation. The subquery combined with LIMIT was used to obtain only the first region associated with a given substation.
This way of querying the triple store works in some situations, but is not a guarantee, according to the SPARQL language specification. In theory it should never work: Subqueries are executed first and then its results are projected on the outer query https://www.w3.org/TR/sparql11-query/#subqueries.
As a minimum example, this query returns correct results for looking one of the past projects of a person using our current version of RDF4J:
It should return the same Project for all Persons, because in the subquery the ?Person variable should not be instantiated from the outer scope, and just the first project of any Person should be returned up. But instead, this version of the query returns a (proper) Project for every Person.
Just by altering very slightly the query, we obtain (as expected) bad results:
What is the new behavior (if this is a feature change)?
The proper way of obtaining the intended data is to use the subquery with the variable that we want to match in the outer scope also projected. In the subquery, the additional data for it should be obtained using a GROUP BY clause and aggregate function. Performance wise, SAMPLE could be used to aggregate, but that could lead to nondeterministic results, so MIN is used.
This is the proposed way of rewriting the subqueries to remove the LIMIT 1 related to the (wrong) assumption that inner variables were already instantiated:
Does this PR introduce a breaking change or deprecate an API?
If yes, please check if the following requirements are fulfilled
What changes might users need to make in their application due to this PR? (migration steps)
Other information:
To evaluate the potential impact on performance the change has been verified importing a huge network (an 1.6 GB EQ XML file of a real world network). The size of the network is summarised in the following table:
The import of the network takes an average of 72s in a reference machine before the change.
With this change the average time for import is 73s.
The figures have been obtained from 6 sample runs on an Apple MacBook Pro 2.3 GHz 8-core Intel Core i9 with 16 GB main memory.