-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance when committing or creating a new card is slow on very large card webs #694
Comments
In the next few commits we'll want to render additional controls next to it. It's also weird that it was card-drawer's responsibiliy to render it. Part of #694.
This means that unless the user hits the 'Regeneate Word Cloud' then it's not updated. In pracitce this makes opening the overflow menu for very alrge collections many many seconds faster. Part of #694.
…et to 0. This should allow card-view to reset to null when the collection changes so it doesn't show an ou tof date cloud. Part of #694.
Simply make the word clouds in the sidebar be closed by default You can open them and it stays open, and every time you edit it closes by default Alternate approach: maybe the pipeline can do what we do for the live word cloud while editing: do a fingerprint on top of a snapshot of the base. Do a thing like cardsForIDF, and that uses a snapshotting mechanic similar to the other one i used, that only reupdates when a card update that changes more than 5% of cards lands. Although why is it so slow? The IDF pipeline is heavily cached with word counts, so the work should be something like "fetch almost entirely pre-cached objects and then sum up all word counts and divide"... The reasons it's slow right now are 1) generating the recomputed fingerprint, which includes processing all of the other cards, and 2) when you save a working notes card the card finisher recomputes the entire fingerprint generator in a blocking way to generate the title. Also, the "similar cards" pipeline requires computing fingerprints for every card to compare against, and presumably might be hit before the embedding-based similarity shows up |
This slows down loading by a lot and creating a card by a little. The next step is to memoize the wordcount creation so it's much faster. Part of #694.
Memoization checks that the card count hasn't changed by 10% or more since the last time we ran. This leaves a "good enough" idfMap in most cases. Part of #694.
… generator. This will make the cardObjFingerprinting more amenable to memoization. Part of #694.
This will allow easier memoization. Left behind a method of the same name on fingerprint generator for convenience so various selector methods don't have to worry about all of the parameters to pass. Part of #694.
This will allow use of memoizeFirstArg. Part of #694.
This SHOULD lead to much greater speedups for creating a card or committing a card, because the idfmap is so deeply cached. Part of #694.
idfMapForCards is called by a 0 count and then another. Moved the early bail condition to in the memoizer, so it doesn't have to run it multiple times. Before this, it was ~40 seconds to save, now it's more like 9 seconds. Part of #694.
An implementation (which is currently SLOWER) is in fingerprint-performance. Random musing notes to self: A generateFingerprint is a memorize first arg that takes card, idfMap, fingerprint size and field list. It’s a thin wrapper around new fingerprint Fingerprint generator stays mostly the same except it gets an idfmap from the cards and that is memoized to return the same thing unless cards added more than 5% of cards since last time. Make sure the part in card finisher also uses the same caching pipeline (note that the fingerprint generator will have a different field list which will throw out the other fingerprints and require recalculation (since we currently calc every one when fingerprint generator is created) even though only one will actually be necessary and different (and the idfmap will be the same) |
The reason the performance is slower on the branch to load (46 seconds vs 28) is that wordCountsForSemantics is recalculated for every card. Why? |
Before this change, the branch took 42s vs 28s on master to load. The problem was that wordCountsForSemantics was called twice instaed of once. The reason was that sometimes it was called with an _explicit_ undefined at the end of the arguments, and sometimes with a missing undefiend. Changing arrayEqual to ignore trailing undefined when comparing for equality fixes this cache miss. This now makes the branch also around 28s. Part of #694.
This now snapshots the IDFMap and updates it less often. It doesn't slow down loading, but it does improve the time to save a card from ~11 seconds to ~5 seconds. Before, the IDFMap would be update every time any card changed. This meant that when a card was saved, the IDFMap was updated (a possibly expensive operation) which then invalidated all other fingerprints. But the IDFMap, for large webs, rarely changes much. Now we only update the IDFMap if the set of changed card is greater than 10% of the size of cards. This means that after saving a single card, we use the pre-existing IDFMap. This makes the common case of saving cards significantly more snappy. Part of #694. Merge branch 'fingerprint-performance' * fingerprint-performance: Bring the load performance in line with what is was previously. Fix a performance issue when saving a card. Change the order of cardTFIDF and fingerprintForTFIDF to allow memoizeFirstArg. Wrap fingerprintForCardObj in mwmoizeFirstArg. Change it so cardObj is first argument for fingerprintForCardObj. Pop out fingerprintForCardObj. Factor out fingerprintForTFIDF and cardTFIDF to not be on fingerprint generator. Get rid of an unnecessary generator argument to Fingerprint constructor. Put idfMap calculation behind a memoization layer. Refactor it so the idfmap is calculated outside fingerprint genrator.
A bug (I think) in the new pipeline: for adding cards that use a new word that's not in the corpus yet, the term should be hyper important in the fingerprint, but instead is dropped becuase it doesn't exist anywhere in the corpus. The default for terms not in the tfidf should be as close to "very rare" as possible, currently I think it's the opposite (treated as very common) |
It can take up a few seconds to create a new card, or to commit a card that you've just edited.
Looking at a performance trace it looks like the time is dominated in fingerprintForCard. Every time the set of cards changes we throw out and recompute fingerprints for every card. We should memoize it so the work is proportional to the number of changed cards, which should be very small.
In the constructor of FingerprintGenerator, pop out a separate fingerprintForCard, which is memoized, and is passed a card to use (and make sure that it fetches the wordCounts for cards via the memoized wordcountFor Cards)
Part of the reason this is slow is because technically every time a card changes, it changes the tfidf of all of the cards. However, past, say, 1000 cards, it's fine to assume the baseline idfMap is fixed and don't recalculate it every time a card changes (or at least, be OK with vending stale fingerprints that use a technically-out-of-data idfMap to set the priors for each word)
The text was updated successfully, but these errors were encountered: