Performance when committing or creating a new card is slow on very large card webs #694

jkomoros · 2024-08-03T21:44:38Z

It can take up a few seconds to create a new card, or to commit a card that you've just edited.

Looking at a performance trace it looks like the time is dominated in fingerprintForCard. Every time the set of cards changes we throw out and recompute fingerprints for every card. We should memoize it so the work is proportional to the number of changed cards, which should be very small.

In the constructor of FingerprintGenerator, pop out a separate fingerprintForCard, which is memoized, and is passed a card to use (and make sure that it fetches the wordCounts for cards via the memoized wordcountFor Cards)

Part of the reason this is slow is because technically every time a card changes, it changes the tfidf of all of the cards. However, past, say, 1000 cards, it's fine to assume the baseline idfMap is fixed and don't recalculate it every time a card changes (or at least, be OK with vending stale fingerprints that use a technically-out-of-data idfMap to set the priors for each word)

Lazy compute card fingerprints (test to make sure this does make things faster)
The baseline wordcounts for the corpus should be behind a memoization that only updates it if the number of cards increases by 10% or more. This will deliberately use stale wordcounts if cards change (since the count won't change) or when a single card is created, but still update the tfidf when a large batch of cards is downloaded
Card fingerprint should be memoized based on the card obj and wordcount ob
Make selectActiveCollectionWordCloud much faster by cheating, for example only updating the first time the collection changes, but not caring if the cards in the collection change) (or make it so that wordcloud doesn't show up when the overflow menu is expanded, which makes any edit slow)

In the next few commits we'll want to render additional controls next to it. It's also weird that it was card-drawer's responsibiliy to render it. Part of #694.

Part of #694.

This means that unless the user hits the 'Regeneate Word Cloud' then it's not updated. In pracitce this makes opening the overflow menu for very alrge collections many many seconds faster. Part of #694.

Part of #694.

…et to 0. This should allow card-view to reset to null when the collection changes so it doesn't show an ou tof date cloud. Part of #694.

Part of #694.

jkomoros · 2024-08-31T15:13:52Z

Simply make the word clouds in the sidebar be closed by default

You can open them and it stays open, and every time you edit it closes by default

Alternate approach: maybe the pipeline can do what we do for the live word cloud while editing: do a fingerprint on top of a snapshot of the base.

Do a thing like cardsForIDF, and that uses a snapshotting mechanic similar to the other one i used, that only reupdates when a card update that changes more than 5% of cards lands.

Although why is it so slow? The IDF pipeline is heavily cached with word counts, so the work should be something like "fetch almost entirely pre-cached objects and then sum up all word counts and divide"...

The reasons it's slow right now are 1) generating the recomputed fingerprint, which includes processing all of the other cards, and 2) when you save a working notes card the card finisher recomputes the entire fingerprint generator in a blocking way to generate the title.

Also, the "similar cards" pipeline requires computing fingerprints for every card to compare against, and presumably might be hit before the embedding-based similarity shows up

This slows down loading by a lot and creating a card by a little. The next step is to memoize the wordcount creation so it's much faster. Part of #694.

Memoization checks that the card count hasn't changed by 10% or more since the last time we ran. This leaves a "good enough" idfMap in most cases. Part of #694.

Part of #694.

… generator. This will make the cardObjFingerprinting more amenable to memoization. Part of #694.

This will allow easier memoization. Left behind a method of the same name on fingerprint generator for convenience so various selector methods don't have to worry about all of the parameters to pass. Part of #694.

This will allow use of memoizeFirstArg. Part of #694.

This SHOULD lead to much greater speedups for creating a card or committing a card, because the idfmap is so deeply cached. Part of #694.

…eFirstArg. Part of #694.

idfMapForCards is called by a 0 count and then another. Moved the early bail condition to in the memoizer, so it doesn't have to run it multiple times. Before this, it was ~40 seconds to save, now it's more like 9 seconds. Part of #694.

jkomoros · 2024-09-01T00:21:13Z

An implementation (which is currently SLOWER) is in fingerprint-performance.

Random musing notes to self:

A generateFingerprint is a memorize first arg that takes card, idfMap, fingerprint size and field list. It’s a thin wrapper around new fingerprint

Fingerprint generator stays mostly the same except it gets an idfmap from the cards and that is memoized to return the same thing unless cards added more than 5% of cards since last time.

Make sure the part in card finisher also uses the same caching pipeline (note that the fingerprint generator will have a different field list which will throw out the other fingerprints and require recalculation (since we currently calc every one when fingerprint generator is created) even though only one will actually be necessary and different (and the idfmap will be the same)

jkomoros · 2024-10-06T22:52:45Z

The reason the performance is slower on the branch to load (46 seconds vs 28) is that wordCountsForSemantics is recalculated for every card. Why?

Before this change, the branch took 42s vs 28s on master to load. The problem was that wordCountsForSemantics was called twice instaed of once. The reason was that sometimes it was called with an _explicit_ undefined at the end of the arguments, and sometimes with a missing undefiend. Changing arrayEqual to ignore trailing undefined when comparing for equality fixes this cache miss. This now makes the branch also around 28s. Part of #694.

This now snapshots the IDFMap and updates it less often. It doesn't slow down loading, but it does improve the time to save a card from ~11 seconds to ~5 seconds. Before, the IDFMap would be update every time any card changed. This meant that when a card was saved, the IDFMap was updated (a possibly expensive operation) which then invalidated all other fingerprints. But the IDFMap, for large webs, rarely changes much. Now we only update the IDFMap if the set of changed card is greater than 10% of the size of cards. This means that after saving a single card, we use the pre-existing IDFMap. This makes the common case of saving cards significantly more snappy. Part of #694. Merge branch 'fingerprint-performance' * fingerprint-performance: Bring the load performance in line with what is was previously. Fix a performance issue when saving a card. Change the order of cardTFIDF and fingerprintForTFIDF to allow memoizeFirstArg. Wrap fingerprintForCardObj in mwmoizeFirstArg. Change it so cardObj is first argument for fingerprintForCardObj. Pop out fingerprintForCardObj. Factor out fingerprintForTFIDF and cardTFIDF to not be on fingerprint generator. Get rid of an unnecessary generator argument to Fingerprint constructor. Put idfMap calculation behind a memoization layer. Refactor it so the idfmap is calculated outside fingerprint genrator.

jkomoros · 2024-11-02T14:52:53Z

A bug (I think) in the new pipeline: for adding cards that use a new word that's not in the corpus yet, the term should be hyper important in the fingerprint, but instead is dropped becuase it doesn't exist anywhere in the corpus. The default for terms not in the tfidf should be as close to "very rare" as possible, currently I think it's the opposite (treated as very common)

jkomoros added a commit that referenced this issue Aug 4, 2024

Move word-cloud out of card-drawer.

8e67a39

In the next few commits we'll want to render additional controls next to it. It's also weird that it was card-drawer's responsibiliy to render it. Part of #694.

jkomoros added a commit that referenced this issue Aug 4, 2024

Create a collection.collectionWordcloudVersion machinery and action.

3babb0a

Part of #694.

jkomoros added a commit that referenced this issue Aug 4, 2024

Add a future TODO around word cloud generation.

28316b1

Part of #694.

jkomoros added a commit that referenced this issue Aug 4, 2024

When the collection is updated, the collectionWordCloudVersion is res…

fd53f29

…et to 0. This should allow card-view to reset to null when the collection changes so it doesn't show an ou tof date cloud. Part of #694.

jkomoros added a commit that referenced this issue Aug 4, 2024

card-view doesn't show a stale word-cloud for an old collection.

5fee5f7

Part of #694.

jkomoros added a commit that referenced this issue Sep 1, 2024

Refactor it so the idfmap is calculated outside fingerprint genrator.

3c6df55

This slows down loading by a lot and creating a card by a little. The next step is to memoize the wordcount creation so it's much faster. Part of #694.

jkomoros added a commit that referenced this issue Sep 1, 2024

Put idfMap calculation behind a memoization layer.

4312d72

Memoization checks that the card count hasn't changed by 10% or more since the last time we ran. This leaves a "good enough" idfMap in most cases. Part of #694.

jkomoros added a commit that referenced this issue Sep 1, 2024

Get rid of an unnecessary generator argument to Fingerprint constructor.

8eb8035

Part of #694.

jkomoros added a commit that referenced this issue Sep 1, 2024

Factor out fingerprintForTFIDF and cardTFIDF to not be on fingerprint…

8a5942e

… generator. This will make the cardObjFingerprinting more amenable to memoization. Part of #694.

jkomoros added a commit that referenced this issue Sep 1, 2024

Change it so cardObj is first argument for fingerprintForCardObj.

853c7a3

This will allow use of memoizeFirstArg. Part of #694.

jkomoros added a commit that referenced this issue Sep 1, 2024

Wrap fingerprintForCardObj in mwmoizeFirstArg.

5c138d3

This SHOULD lead to much greater speedups for creating a card or committing a card, because the idfmap is so deeply cached. Part of #694.

jkomoros added a commit that referenced this issue Sep 1, 2024

Change the order of cardTFIDF and fingerprintForTFIDF to allow memoiz…

88a8538

…eFirstArg. Part of #694.

jkomoros mentioned this issue Feb 15, 2025

Editing a bunch of cards can sometimes be very very slow #701

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance when committing or creating a new card is slow on very large card webs #694

Performance when committing or creating a new card is slow on very large card webs #694

jkomoros commented Aug 3, 2024 •

edited

Loading

jkomoros commented Aug 31, 2024 •

edited

Loading

jkomoros commented Sep 1, 2024

jkomoros commented Oct 6, 2024

jkomoros commented Nov 2, 2024

Performance when committing or creating a new card is slow on very large card webs #694

Performance when committing or creating a new card is slow on very large card webs #694

Comments

jkomoros commented Aug 3, 2024 • edited Loading

jkomoros commented Aug 31, 2024 • edited Loading

jkomoros commented Sep 1, 2024

jkomoros commented Oct 6, 2024

jkomoros commented Nov 2, 2024

jkomoros commented Aug 3, 2024 •

edited

Loading

jkomoros commented Aug 31, 2024 •

edited

Loading