Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance loading a moderate number of relationships #2782

Closed
adrianiacobghiula opened this issue Aug 24, 2023 · 5 comments
Closed

Poor performance loading a moderate number of relationships #2782

adrianiacobghiula opened this issue Aug 24, 2023 · 5 comments
Assignees
Labels
status: needs-investigation An issue that has been triaged but needs further investigation

Comments

@adrianiacobghiula
Copy link

adrianiacobghiula commented Aug 24, 2023

I am trying to load a number of relationships ~10.000 for a single node but I have found a poor performace of spring-data-neo4j while mapping the db response to objects in memory.

Example:

CREATE (p:Person {id: 0, name: 'name0'})
WITH p
UNWIND range(1, 10000, 1) as id
CREATE (k:Person {id: id, name: 'name'+id})
CREATE (p)-[:KNOWS]->(k)

Loading Person node with name0 ends up in 30+ seconds

findById - Read name0 with 10000 relations in 34048 msec
findByName - Read name0 with 10000 relations in 36735 msec

  @Node("Person")
  @AllArgsConstructor
  public static class Person {
    @Id String name;

    @Relationship("KNOWS")
    List<Person> persons;
  }

  public interface PersonSummary {

    String getName();
    List<PersonSimplified> getPersons();

    interface PersonSimplified {
      String getName();
    }
  }

public interface PersonRepository extends CrudRepository<Person, String> {

  @Query(
      """
    MATCH (p:Person) WHERE p.name = $name
    OPTIONAL MATCH (p)-[p_k:KNOWS]->(k)
    RETURN p, collect(p_k) as p_k, collect(k) as k
    """)
  PersonSummary findByName(String name);
}

  private static void testUsingFindById(PersonRepository repo) {
    long startTime = System.currentTimeMillis();
    repo.findById("name0")
        .ifPresent(
            a ->
                log.info(
                    "findById - Read {} with {} relations in {} msec",
                    a.getName(),
                    a.getPersons().size(),
                    (System.currentTimeMillis() - startTime)));
  }

  private static void testUsingFindByName(PersonRepository repo) {
    long startTime = System.currentTimeMillis();
    PersonSummary a = repo.findByName("name0");

    log.info(
        "findByName - Read {} with {} relations in {} msec",
        a.getName(),
        a.getPersons().size(),
        (System.currentTimeMillis() - startTime));
  }

FullProject LoadTestApplication: https://github.com/adrianiacobghiula/load-performance-spring-data-neo4j

Loading Person node name0 having 15.000 (instead of 10.000)
findById - Read name0 with 15000 relations in 93009 msec
findByName - Read name0 with 15000 relations in 102189 msec

DefaultNeo4jEntityConverter.createInstanceOfRelationships takes 97% of the time

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label Aug 24, 2023
@meistermeier
Copy link
Collaborator

Sorry for getting this issue slip through.
I expect some impact also due to querying the database:
The one Person is related to each other person in the list and therefor the query needs fetch all other 10.000 entries.
Additionally to this, the domain you are creating is cyclic: A Person is related to a Person. This could mean that Person(name0) is related to itself or that Person(name0)->Person(nameX)->Person(name0).
In this case, Spring Data Neo4j creates cascading queries, starting with the search for Person(name0) and then it looks on every of those 10.000 nodes for further KNOWS relationships.

But this does not explain to me why it suddenly gets much better when I move the target class to AnotherPerson that only contains the name (of course also adapting the Cypher script to create correct labeled nodes).

findById - Read name0 with 10000 relations in 253 msec

There must be something in relation to the same node type pattern in the mapping bits as you have already pointed out. Investigating on this.

@meistermeier meistermeier self-assigned this Dec 12, 2023
@meistermeier meistermeier added status: needs-investigation An issue that has been triaged but needs further investigation and removed status: waiting-for-triage An issue we've not yet triaged labels Dec 12, 2023
@adrianiacobghiula
Copy link
Author

"therefor the query needs fetch all other 10.000 entries." <- fast if done with the driver
"the domain you are creating is cyclic" <- not true I create a node p and link it to 10000 others nodes

@meistermeier
Copy link
Collaborator

meistermeier commented Dec 12, 2023

The thing is that Persons having Persons in Spring Data domains is a cycle. As described above, we can only assume the worst. Even if this is not your intention, SDN will load it this way specified by your domain model.
But your projection with PersonSimplified already mitigates this by avoiding the second hop to check.

But this was just an information on the behaviour of SDN. I think that the problem lies somewhere around the lines (or deeper down the stack) what you have figured out. I am currently benchmarking a few scenarios here to see in which cases we experience this drastic mapping performance degradation.

Edit: Forgot to mention that I have a not yet more specified idea on how to explain SDN that the domain might look cyclic but the data won't ever be.

@meistermeier
Copy link
Collaborator

To keep you updated and log my findings here. The problem is interesting because it works as intended.
We get the result node n0 and a collection of nX (for the explanation, we can dismiss the relationships for a while).
Every nX node describes also a relationship to another n.
The mapping phase of SDN is completely unaware of the projections that it should later map to. It is basically driven by the data it gets. Assuming it is currently mapping n0 and discovers for the KNOWS relationship the n1 node, it will go further down for all relationship fields and tries to find a node that fits in there. To do this, it has the whole pool of the nX collection and needs to check everything if it has the proper id of the defined relationship. And since I mention the relationships here, I am currently checking if a simple flip in the logic might already help. Current results for your example:

findById - Read name0 with 10000 relations in 6984 msec

Keep in mind that from its very nature of the result, we have this big bag of nodes, the same as your custom findByName query, that makes SDN look in even there is nothing to find. The observation that if we break this cycle, everything is much faster, can only happen if there is no logical cycle anywhere.

meistermeier added a commit that referenced this issue Dec 12, 2023
For a lot of relationships on a self-referencing type (10k+),
the result will be in the format `n, collect(rel), collect(relNode)`.
This brings the whole bucket of `relNodes` every time as a mappable option
to the table.
Prior to this change, SDN would have to check this complete bucket for potential
matching nodes. Just to find out later that there is no relationship to map this one for.
Checking the relationship bucket first to find if there is a relationship at all
to find a target node for, improves the performance drastically.

Closes #2782
meistermeier added a commit that referenced this issue Dec 12, 2023
For a lot of relationships on a self-referencing type (10k+),
the result will be in the format `n, collect(rel), collect(relNode)`.
This brings the whole bucket of `relNodes` every time as a mappable option
to the table.
Prior to this change, SDN would have to check this complete bucket for potential
matching nodes. Just to find out later that there is no relationship to map this one for.
Checking the relationship bucket first to find if there is a relationship at all
to find a target node for, improves the performance drastically.

Closes #2782
@meistermeier
Copy link
Collaborator

This got automatically closed by my commit to fix the obvious problem in the code.
To keep track of the cyclic domain but acyclic graph, I created #2840

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: needs-investigation An issue that has been triaged but needs further investigation
Projects
None yet
Development

No branches or pull requests

3 participants