Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the perf issue in building nodes from splits. #10766

Commits on Feb 15, 2024

  1. Fix the perf issue in building nodes from splits.

    Create the `relationships` object only once. Otherwise, it recomputes the whole text's hash for every node. It is very inefficient for long text.
    
    An alternative approach would be to cache the hash property. However, it wasn't so straightforward as `Document` isn't a cacheable type. I also do not know Python very well, maybe it would be enough to store a simple null and if it isn't null, then don't recompute? However, the most important reason is I'm not sure about the side effects and the existing assumption that the node is mutable and the hash always reflects the state during the call (unless we modify the object in multiple threads). This change doesn't break any assumptions. If the document was modified while we were creating nodes extracted from it, something would be very wrong.
    
    Benchmarks taken on a document attached to the bug:
    
    Before: Execution time for build_nodes_from_splits: 53.69 seconds
    
    After: Execution time for build_nodes_from_splits: 0.18 seconds
    preemoDez committed Feb 15, 2024
    Configuration menu
    Copy the full SHA
    d1fb461 View commit details
    Browse the repository at this point in the history
  2. Fix the formatting

    preemoDez committed Feb 15, 2024
    Configuration menu
    Copy the full SHA
    7b406a2 View commit details
    Browse the repository at this point in the history