Triple store queries: replace subqueries with LIMIT 1 by proper subqueries using GROUP BY #3330

zamarrenolm · 2025-02-20T17:55:34Z

Please check if the PR fulfills these requirements

The commit message follows our guidelines
Tests for the changes have been added (for bug fixes / features)
Docs have been added / updated (for bug fixes / features)

Does this PR already have an issue describing the problem?

What kind of change does this PR introduce?

Bug fix

Does this PR introduce a new Powsybl Action implying to be implemented in simulators or pypowsybl?

Yes, the corresponding issue is here
No

What is the current behavior?

Some SPARQL queries use subqueries with LIMIT 1 expecting inner variables to be instantiated by the outer scope. For example, query for regions and subregions of a substation. The subquery combined with LIMIT was used to obtain only the first region associated with a given substation.

This way of querying the triple store works in some situations, but is not a guarantee, according to the SPARQL language specification. In theory it should never work: Subqueries are executed first and then its results are projected on the outer query https://www.w3.org/TR/sparql11-query/#subqueries.

As a minimum example, this query returns correct results for looking one of the past projects of a person using our current version of RDF4J:

prefix foaf: <http://xmlns.com/foaf/0.1/>
SELECT *
WHERE {
    ?Person a foaf:Person ;
        foaf:name ?name .
    OPTIONAL {
        {
            SELECT ?Project
            WHERE {
                ?Person foaf:pastProject ?Project .
            }
            LIMIT 1
        }
        ?Project foaf:name ?projectName .
    }
}

It should return the same Project for all Persons, because in the subquery the ?Person variable should not be instantiated from the outer scope, and just the first project of any Person should be returned up. But instead, this version of the query returns a (proper) Project for every Person.

Just by altering very slightly the query, we obtain (as expected) bad results:

prefix foaf: <http://xmlns.com/foaf/0.1/>
SELECT *
WHERE {
    ?Person a foaf:Person ;
        foaf:name ?name .
    OPTIONAL {
        {
            SELECT ?Project ?projectName
            WHERE {
                ?Person foaf:pastProject ?Project .
                ?Project foaf:name ?projectName .
            }
            LIMIT 1
        }
    }
}

What is the new behavior (if this is a feature change)?
The proper way of obtaining the intended data is to use the subquery with the variable that we want to match in the outer scope also projected. In the subquery, the additional data for it should be obtained using a GROUP BY clause and aggregate function. Performance wise, SAMPLE could be used to aggregate, but that could lead to nondeterministic results, so MIN is used.

This is the proposed way of rewriting the subqueries to remove the LIMIT 1 related to the (wrong) assumption that inner variables were already instantiated:

prefix foaf: <http://xmlns.com/foaf/0.1/>
SELECT *
WHERE {
    ?Person a foaf:Person ;
        foaf:name ?name .
    OPTIONAL {
        {
            SELECT ?Person (MIN(?Project) AS ?Project)
            WHERE {
                ?Person foaf:pastProject ?Project .
            }
            GROUP BY ?Person
        }
        ?Project foaf:name ?projectName .
    }
}

Does this PR introduce a breaking change or deprecate an API?

Yes
No

If yes, please check if the following requirements are fulfilled

The Breaking Change or Deprecated label has been added
The migration steps are described in the following section

What changes might users need to make in their application due to this PR? (migration steps)

Other information:

To evaluate the potential impact on performance the change has been verified importing a huge network (an 1.6 GB EQ XML file of a real world network). The size of the network is summarised in the following table:

Class	# of objects
Substations	5,000
Voltage levels	9,000
Connectivity nodes	60,000
Busbars	12,000
Lines	9,000
Transformers	2,500
Switches	70,000
Terminals	180,000

The import of the network takes an average of 72s in a reference machine before the change.
With this change the average time for import is 73s.
The figures have been obtained from 6 sample runs on an Apple MacBook Pro 2.3 GHz 8-core Intel Core i9 with 16 GB main memory.

…GROUP BY Signed-off-by: Luma <[email protected]>

…P BY Signed-off-by: Luma <[email protected]>

Signed-off-by: Luma <[email protected]>

sonarqubecloud · 2025-03-18T08:14:54Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

zamarrenolm added 3 commits February 20, 2025 18:53

replace subquery LIMIT 1 for regions and subregions by subquery with …

c00caca

…GROUP BY Signed-off-by: Luma <[email protected]>

replace subquery LIMIT 1 for connectivity nodes by subquery with GROU…

06010f7

…P BY Signed-off-by: Luma <[email protected]>

copyright year

d68d4e3

Signed-off-by: Luma <[email protected]>

rcourtier self-requested a review March 7, 2025 08:19

Merge branch 'main' into fix_sparql_subqueries_limit_1

ab5add5

zamarrenolm marked this pull request as ready for review March 11, 2025 08:53

olperr1 added the bug label Mar 12, 2025

rcourtier approved these changes Mar 13, 2025

View reviewed changes

Merge branch 'main' into fix_sparql_subqueries_limit_1

5bf31f9

rcourtier assigned zamarrenolm Mar 20, 2025

rcourtier added the CGMES label Mar 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Triple store queries: replace subqueries with LIMIT 1 by proper subqueries using GROUP BY #3330

Triple store queries: replace subqueries with LIMIT 1 by proper subqueries using GROUP BY #3330

zamarrenolm commented Feb 20, 2025 •

edited by rcourtier

Loading

sonarqubecloud bot commented Mar 18, 2025

Triple store queries: replace subqueries with LIMIT 1 by proper subqueries using GROUP BY #3330

Are you sure you want to change the base?

Triple store queries: replace subqueries with LIMIT 1 by proper subqueries using GROUP BY #3330

Conversation

zamarrenolm commented Feb 20, 2025 • edited by rcourtier Loading

sonarqubecloud bot commented Mar 18, 2025

Quality Gate passed

zamarrenolm commented Feb 20, 2025 •

edited by rcourtier

Loading