Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance when committing or creating a new card is slow on very large card webs #694

Open
4 tasks
jkomoros opened this issue Aug 3, 2024 · 4 comments
Open
4 tasks

Comments

@jkomoros
Copy link
Owner

jkomoros commented Aug 3, 2024

It can take up a few seconds to create a new card, or to commit a card that you've just edited.

Looking at a performance trace it looks like the time is dominated in fingerprintForCard. Every time the set of cards changes we throw out and recompute fingerprints for every card. We should memoize it so the work is proportional to the number of changed cards, which should be very small.

In the constructor of FingerprintGenerator, pop out a separate fingerprintForCard, which is memoized, and is passed a card to use (and make sure that it fetches the wordCounts for cards via the memoized wordcountFor Cards)

Part of the reason this is slow is because technically every time a card changes, it changes the tfidf of all of the cards. However, past, say, 1000 cards, it's fine to assume the baseline idfMap is fixed and don't recalculate it every time a card changes (or at least, be OK with vending stale fingerprints that use a technically-out-of-data idfMap to set the priors for each word)

  • Lazy compute card fingerprints (test to make sure this does make things faster)
  • The baseline wordcounts for the corpus should be behind a memoization that only updates it if the number of cards increases by 10% or more. This will deliberately use stale wordcounts if cards change (since the count won't change) or when a single card is created, but still update the tfidf when a large batch of cards is downloaded
  • Card fingerprint should be memoized based on the card obj and wordcount ob
  • Make selectActiveCollectionWordCloud much faster by cheating, for example only updating the first time the collection changes, but not caring if the cards in the collection change) (or make it so that wordcloud doesn't show up when the overflow menu is expanded, which makes any edit slow)
jkomoros added a commit that referenced this issue Aug 4, 2024
In the next few commits we'll want to render additional controls next to it. It's also weird that it was card-drawer's responsibiliy to render it.

Part of #694.
jkomoros added a commit that referenced this issue Aug 4, 2024
This means that unless the user hits the 'Regeneate Word Cloud' then it's not updated.

In pracitce this makes opening the overflow menu for very alrge collections many many seconds faster.

Part of #694.
jkomoros added a commit that referenced this issue Aug 4, 2024
jkomoros added a commit that referenced this issue Aug 4, 2024
…et to 0.

This should allow card-view to reset to null when the collection changes so it doesn't show an ou tof date cloud.

Part of #694.
@jkomoros
Copy link
Owner Author

jkomoros commented Aug 31, 2024

Simply make the word clouds in the sidebar be closed by default

You can open them and it stays open, and every time you edit it closes by default

Alternate approach: maybe the pipeline can do what we do for the live word cloud while editing: do a fingerprint on top of a snapshot of the base.

Do a thing like cardsForIDF, and that uses a snapshotting mechanic similar to the other one i used, that only reupdates when a card update that changes more than 5% of cards lands.

Although why is it so slow? The IDF pipeline is heavily cached with word counts, so the work should be something like "fetch almost entirely pre-cached objects and then sum up all word counts and divide"...

The reasons it's slow right now are 1) generating the recomputed fingerprint, which includes processing all of the other cards, and 2) when you save a working notes card the card finisher recomputes the entire fingerprint generator in a blocking way to generate the title.

Also, the "similar cards" pipeline requires computing fingerprints for every card to compare against, and presumably might be hit before the embedding-based similarity shows up

jkomoros added a commit that referenced this issue Sep 1, 2024
This slows down loading by a lot and creating a card by a little.

The next step is to memoize the wordcount creation so it's much faster.

Part of #694.
jkomoros added a commit that referenced this issue Sep 1, 2024
Memoization checks that the card count hasn't changed by 10% or more since the last time we ran.

This leaves a "good enough" idfMap in most cases.

Part of #694.
jkomoros added a commit that referenced this issue Sep 1, 2024
… generator.

This will make the cardObjFingerprinting more amenable to memoization.

Part of #694.
jkomoros added a commit that referenced this issue Sep 1, 2024
This will allow easier memoization.

Left behind a method of the same name on fingerprint generator for convenience so various selector methods don't have to worry about all of the parameters to pass.

Part of #694.
jkomoros added a commit that referenced this issue Sep 1, 2024
This will allow use of memoizeFirstArg.

Part of #694.
jkomoros added a commit that referenced this issue Sep 1, 2024
This SHOULD lead to much greater speedups for creating a card or committing a card, because the idfmap is so deeply cached.

Part of #694.
jkomoros added a commit that referenced this issue Sep 1, 2024
jkomoros added a commit that referenced this issue Sep 1, 2024
idfMapForCards is called by a 0 count and then another. Moved the early bail condition to in the memoizer, so it doesn't have to run it multiple times.

Before this, it was ~40 seconds to save, now it's more like 9 seconds.

Part of #694.
@jkomoros
Copy link
Owner Author

jkomoros commented Sep 1, 2024

An implementation (which is currently SLOWER) is in fingerprint-performance.

Random musing notes to self:

A generateFingerprint is a memorize first arg that takes card, idfMap, fingerprint size and field list. It’s a thin wrapper around new fingerprint

Fingerprint generator stays mostly the same except it gets an idfmap from the cards and that is memoized to return the same thing unless cards added more than 5% of cards since last time.

Make sure the part in card finisher also uses the same caching pipeline (note that the fingerprint generator will have a different field list which will throw out the other fingerprints and require recalculation (since we currently calc every one when fingerprint generator is created) even though only one will actually be necessary and different (and the idfmap will be the same)

@jkomoros
Copy link
Owner Author

jkomoros commented Oct 6, 2024

The reason the performance is slower on the branch to load (46 seconds vs 28) is that wordCountsForSemantics is recalculated for every card. Why?

jkomoros added a commit that referenced this issue Oct 6, 2024
Before this change, the branch took 42s vs 28s on master to load.

The problem was that wordCountsForSemantics was called twice instaed of once. The reason was that sometimes it was called with an _explicit_ undefined at the end of the arguments, and sometimes with a missing undefiend.

Changing arrayEqual to ignore trailing undefined when comparing for equality fixes this cache miss.

This now makes the branch also around 28s.

Part of #694.
jkomoros added a commit that referenced this issue Oct 6, 2024
This now snapshots the IDFMap and updates it less often. It doesn't slow down loading, but it does improve the time to save a card from ~11 seconds to ~5 seconds.

Before, the IDFMap would be update every time any card changed. This meant that when a card was saved, the IDFMap was updated (a possibly expensive operation) which then invalidated all other fingerprints. But the IDFMap, for large webs, rarely changes much.

Now we only update the IDFMap if the set of changed card is greater than 10% of the size of cards. This means that after saving a single card, we use the pre-existing IDFMap.

This makes the common case of saving cards significantly more snappy.

Part of #694.

Merge branch 'fingerprint-performance'

* fingerprint-performance:
  Bring the load performance in line with what is was previously.
  Fix a performance issue when saving a card.
  Change the order of cardTFIDF and fingerprintForTFIDF to allow memoizeFirstArg.
  Wrap fingerprintForCardObj in mwmoizeFirstArg.
  Change it so cardObj is first argument for fingerprintForCardObj.
  Pop out fingerprintForCardObj.
  Factor out fingerprintForTFIDF and cardTFIDF to not be on fingerprint generator.
  Get rid of an unnecessary generator argument to Fingerprint constructor.
  Put idfMap calculation behind a memoization layer.
  Refactor it so the idfmap is calculated outside fingerprint genrator.
@jkomoros
Copy link
Owner Author

jkomoros commented Nov 2, 2024

A bug (I think) in the new pipeline: for adding cards that use a new word that's not in the corpus yet, the term should be hyper important in the fingerprint, but instead is dropped becuase it doesn't exist anywhere in the corpus. The default for terms not in the tfidf should be as close to "very rare" as possible, currently I think it's the opposite (treated as very common)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant