Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Triple store queries: replace subqueries with LIMIT 1 by proper subqueries using GROUP BY #3330

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

zamarrenolm
Copy link
Member

@zamarrenolm zamarrenolm commented Feb 20, 2025

Please check if the PR fulfills these requirements

  • The commit message follows our guidelines
  • Tests for the changes have been added (for bug fixes / features)
  • Docs have been added / updated (for bug fixes / features)

Does this PR already have an issue describing the problem?

What kind of change does this PR introduce?

Bug fix

Does this PR introduce a new Powsybl Action implying to be implemented in simulators or pypowsybl?

  • Yes, the corresponding issue is here
  • No

What is the current behavior?

Some SPARQL queries use subqueries with LIMIT 1 expecting inner variables to be instantiated by the outer scope. For example, query for regions and subregions of a substation. The subquery combined with LIMIT was used to obtain only the first region associated with a given substation.

This way of querying the triple store works in some situations, but is not a guarantee, according to the SPARQL language specification. In theory it should never work: Subqueries are executed first and then its results are projected on the outer query https://www.w3.org/TR/sparql11-query/#subqueries.

As a minimum example, this query returns correct results for looking one of the past projects of a person using our current version of RDF4J:

prefix foaf: <http://xmlns.com/foaf/0.1/>
SELECT *
WHERE {
    ?Person a foaf:Person ;
        foaf:name ?name .
    OPTIONAL {
        {
            SELECT ?Project
            WHERE {
                ?Person foaf:pastProject ?Project .
            }
            LIMIT 1
        }
        ?Project foaf:name ?projectName .
    }
}

It should return the same Project for all Persons, because in the subquery the ?Person variable should not be instantiated from the outer scope, and just the first project of any Person should be returned up. But instead, this version of the query returns a (proper) Project for every Person.

Just by altering very slightly the query, we obtain (as expected) bad results:

prefix foaf: <http://xmlns.com/foaf/0.1/>
SELECT *
WHERE {
    ?Person a foaf:Person ;
        foaf:name ?name .
    OPTIONAL {
        {
            SELECT ?Project ?projectName
            WHERE {
                ?Person foaf:pastProject ?Project .
                ?Project foaf:name ?projectName .
            }
            LIMIT 1
        }
    }
}

What is the new behavior (if this is a feature change)?
The proper way of obtaining the intended data is to use the subquery with the variable that we want to match in the outer scope also projected. In the subquery, the additional data for it should be obtained using a GROUP BY clause and aggregate function. Performance wise, SAMPLE could be used to aggregate, but that could lead to nondeterministic results, so MIN is used.

This is the proposed way of rewriting the subqueries to remove the LIMIT 1 related to the (wrong) assumption that inner variables were already instantiated:

prefix foaf: <http://xmlns.com/foaf/0.1/>
SELECT *
WHERE {
    ?Person a foaf:Person ;
        foaf:name ?name .
    OPTIONAL {
        {
            SELECT ?Person (MIN(?Project) AS ?Project)
            WHERE {
                ?Person foaf:pastProject ?Project .
            }
            GROUP BY ?Person
        }
        ?Project foaf:name ?projectName .
    }
}

Does this PR introduce a breaking change or deprecate an API?

  • Yes
  • No

If yes, please check if the following requirements are fulfilled

  • The Breaking Change or Deprecated label has been added
  • The migration steps are described in the following section

What changes might users need to make in their application due to this PR? (migration steps)

Other information:

To evaluate the potential impact on performance the change has been verified importing a huge network (an 1.6 GB EQ XML file of a real world network). The size of the network is summarised in the following table:

Class # of objects
Substations 5,000
Voltage levels 9,000
Connectivity nodes 60,000
Busbars 12,000
Lines 9,000
Transformers 2,500
Switches 70,000
Terminals 180,000

The import of the network takes an average of 72s in a reference machine before the change.
With this change the average time for import is 73s.
The figures have been obtained from 6 sample runs on an Apple MacBook Pro 2.3 GHz 8-core Intel Core i9 with 16 GB main memory.

@rcourtier rcourtier self-requested a review March 7, 2025 08:19
@zamarrenolm zamarrenolm marked this pull request as ready for review March 11, 2025 08:53
@olperr1 olperr1 added the bug label Mar 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants