Poor performance loading a moderate number of relationships #2782

adrianiacobghiula · 2023-08-24T19:33:35Z

I am trying to load a number of relationships ~10.000 for a single node but I have found a poor performace of spring-data-neo4j while mapping the db response to objects in memory.

Example:

CREATE (p:Person {id: 0, name: 'name0'})
WITH p
UNWIND range(1, 10000, 1) as id
CREATE (k:Person {id: id, name: 'name'+id})
CREATE (p)-[:KNOWS]->(k)

Loading Person node with name0 ends up in 30+ seconds

findById - Read name0 with 10000 relations in 34048 msec
findByName - Read name0 with 10000 relations in 36735 msec

  @Node("Person")
  @AllArgsConstructor
  public static class Person {
    @Id String name;

    @Relationship("KNOWS")
    List<Person> persons;
  }

  public interface PersonSummary {

    String getName();
    List<PersonSimplified> getPersons();

    interface PersonSimplified {
      String getName();
    }
  }

public interface PersonRepository extends CrudRepository<Person, String> {

  @Query(
      """
    MATCH (p:Person) WHERE p.name = $name
    OPTIONAL MATCH (p)-[p_k:KNOWS]->(k)
    RETURN p, collect(p_k) as p_k, collect(k) as k
    """)
  PersonSummary findByName(String name);
}

  private static void testUsingFindById(PersonRepository repo) {
    long startTime = System.currentTimeMillis();
    repo.findById("name0")
        .ifPresent(
            a ->
                log.info(
                    "findById - Read {} with {} relations in {} msec",
                    a.getName(),
                    a.getPersons().size(),
                    (System.currentTimeMillis() - startTime)));
  }

  private static void testUsingFindByName(PersonRepository repo) {
    long startTime = System.currentTimeMillis();
    PersonSummary a = repo.findByName("name0");

    log.info(
        "findByName - Read {} with {} relations in {} msec",
        a.getName(),
        a.getPersons().size(),
        (System.currentTimeMillis() - startTime));
  }

FullProject LoadTestApplication: https://github.com/adrianiacobghiula/load-performance-spring-data-neo4j

Loading Person node name0 having 15.000 (instead of 10.000)
findById - Read name0 with 15000 relations in 93009 msec
findByName - Read name0 with 15000 relations in 102189 msec

DefaultNeo4jEntityConverter.createInstanceOfRelationships takes 97% of the time

The text was updated successfully, but these errors were encountered:

meistermeier · 2023-12-12T10:41:07Z

Sorry for getting this issue slip through.
I expect some impact also due to querying the database:
The one Person is related to each other person in the list and therefor the query needs fetch all other 10.000 entries.
Additionally to this, the domain you are creating is cyclic: A Person is related to a Person. This could mean that Person(name0) is related to itself or that Person(name0)->Person(nameX)->Person(name0).
In this case, Spring Data Neo4j creates cascading queries, starting with the search for Person(name0) and then it looks on every of those 10.000 nodes for further KNOWS relationships.

But this does not explain to me why it suddenly gets much better when I move the target class to AnotherPerson that only contains the name (of course also adapting the Cypher script to create correct labeled nodes).

findById - Read name0 with 10000 relations in 253 msec

There must be something in relation to the same node type pattern in the mapping bits as you have already pointed out. Investigating on this.

adrianiacobghiula · 2023-12-12T10:59:44Z

"therefor the query needs fetch all other 10.000 entries." <- fast if done with the driver
"the domain you are creating is cyclic" <- not true I create a node p and link it to 10000 others nodes

meistermeier · 2023-12-12T11:18:42Z

The thing is that Persons having Persons in Spring Data domains is a cycle. As described above, we can only assume the worst. Even if this is not your intention, SDN will load it this way specified by your domain model.
But your projection with PersonSimplified already mitigates this by avoiding the second hop to check.

But this was just an information on the behaviour of SDN. I think that the problem lies somewhere around the lines (or deeper down the stack) what you have figured out. I am currently benchmarking a few scenarios here to see in which cases we experience this drastic mapping performance degradation.

Edit: Forgot to mention that I have a not yet more specified idea on how to explain SDN that the domain might look cyclic but the data won't ever be.

meistermeier · 2023-12-12T14:38:13Z

To keep you updated and log my findings here. The problem is interesting because it works as intended.
We get the result node n0 and a collection of nX (for the explanation, we can dismiss the relationships for a while).
Every nX node describes also a relationship to another n.
The mapping phase of SDN is completely unaware of the projections that it should later map to. It is basically driven by the data it gets. Assuming it is currently mapping n0 and discovers for the KNOWS relationship the n1 node, it will go further down for all relationship fields and tries to find a node that fits in there. To do this, it has the whole pool of the nX collection and needs to check everything if it has the proper id of the defined relationship. And since I mention the relationships here, I am currently checking if a simple flip in the logic might already help. Current results for your example:

findById - Read name0 with 10000 relations in 6984 msec

Keep in mind that from its very nature of the result, we have this big bag of nodes, the same as your custom findByName query, that makes SDN look in even there is nothing to find. The observation that if we break this cycle, everything is much faster, can only happen if there is no logical cycle anywhere.

For a lot of relationships on a self-referencing type (10k+), the result will be in the format `n, collect(rel), collect(relNode)`. This brings the whole bucket of `relNodes` every time as a mappable option to the table. Prior to this change, SDN would have to check this complete bucket for potential matching nodes. Just to find out later that there is no relationship to map this one for. Checking the relationship bucket first to find if there is a relationship at all to find a target node for, improves the performance drastically. Closes #2782

meistermeier · 2023-12-12T15:52:50Z

This got automatically closed by my commit to fix the obvious problem in the code.
To keep track of the cyclic domain but acyclic graph, I created #2840

spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label Aug 24, 2023

meistermeier self-assigned this Dec 12, 2023

meistermeier added status: needs-investigation An issue that has been triaged but needs further investigation and removed status: waiting-for-triage An issue we've not yet triaged labels Dec 12, 2023

meistermeier closed this as completed in bf44942 Dec 12, 2023

meistermeier added this to the 7.1.7 (2023.0.7) milestone Dec 12, 2023

meistermeier mentioned this issue Dec 14, 2023

Neo4j 5 Client Warnings due to Missing Labels along with Slower Load Performance #2660

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance loading a moderate number of relationships #2782

Poor performance loading a moderate number of relationships #2782

adrianiacobghiula commented Aug 24, 2023 •

edited

Loading

meistermeier commented Dec 12, 2023

adrianiacobghiula commented Dec 12, 2023

meistermeier commented Dec 12, 2023 •

edited

Loading

meistermeier commented Dec 12, 2023

meistermeier commented Dec 12, 2023

Poor performance loading a moderate number of relationships #2782

Poor performance loading a moderate number of relationships #2782

Comments

adrianiacobghiula commented Aug 24, 2023 • edited Loading

meistermeier commented Dec 12, 2023

adrianiacobghiula commented Dec 12, 2023

meistermeier commented Dec 12, 2023 • edited Loading

meistermeier commented Dec 12, 2023

meistermeier commented Dec 12, 2023

adrianiacobghiula commented Aug 24, 2023 •

edited

Loading

meistermeier commented Dec 12, 2023 •

edited

Loading