Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential for significant perf improvements in large repos #53

Open
JacksonKearl opened this issue Jul 21, 2020 · 9 comments
Open

Potential for significant perf improvements in large repos #53

JacksonKearl opened this issue Jul 21, 2020 · 9 comments

Comments

@JacksonKearl
Copy link

JacksonKearl commented Jul 21, 2020

First, thanks for your work on this project 🙂

The current implementation is fairly slow with large repos, for instance vscode, which has around 5000 files or typescript, which has around 50k. It takes about a minute to clone vscode with --singleBranch and --depth 1, and doesn't manage to clone typescript in the ~15 minutes I waited.

By adding batching to the indexdb writes (Put all writes into a single transaction rather than one transaction per file) and changing the autoinc in the cachefs to increment a counter rather than search for the highest inode (the search means writing N files is N^2 time), I am able to see vscode clone in ~20 seconds and typescript clone in about 2 minutes. this is approx 3x slower than native for vscode and 6x slower than native for typescript.

Batching:

diff --git a/idb-keyval.ts b/idb-keyval.ts
index 45a0d97..94920ef 100644
--- a/idb-keyval.ts
+++ b/idb-keyval.ts
@@ -2,10 +2,12 @@ export class Store {
   private _dbp: Promise<IDBDatabase> | undefined;
   readonly _dbName: string;
   readonly _storeName: string;
+  readonly id: string
 
   constructor(dbName = 'keyval-store', readonly storeName = 'keyval') {
     this._dbName = dbName;
     this._storeName = storeName;
+    this.id = `dbName:${dbName};;storeName:${storeName}`
     this._init();
   }
 
@@ -44,6 +46,31 @@ export class Store {
   }
 }
 
+class Batcher<T> {
+  private ongoing: Promise<void> | undefined
+  private items: { item: T, onProcessed: () => void }[] = []
+
+  constructor(private executor: (items: T[]) => Promise<void>) { }
+
+  private async process() {
+    const toProcess = this.items;
+    this.items = [];
+    await this.executor(toProcess.map(({ item }) => item))
+    toProcess.map(({ onProcessed }) => onProcessed())
+    if (this.items.length) {
+      this.ongoing = this.process()
+    } else {
+      this.ongoing = undefined
+    }
+  }
+
+  async queue(item: T): Promise<void> {
+    const result = new Promise<void>((resolve) => this.items.push({ item, onProcessed: resolve }))
+    if (!this.ongoing) this.ongoing = this.process()
+    return result
+  }
+}
+
 let store: Store;
 
 function getDefaultStore() {
@@ -58,10 +85,17 @@ export function get<Type>(key: IDBValidKey, store = getDefaultStore()): Promise<
   }).then(() => req.result);
 }
 
+const setBatchers: Record<string, Batcher<{ key: IDBValidKey, value: any }>> = {}
 export function set(key: IDBValidKey, value: any, store = getDefaultStore()): Promise<void> {
-  return store._withIDBStore('readwrite', store => {
-    store.put(value, key);
-  });
+  if (!setBatchers[store.id]) {
+    setBatchers[store.id] = new Batcher((items) =>
+      store._withIDBStore('readwrite', store => {
+        for (const item of items) {
+          store.put(item.value, item.key)
+        }
+      }))
+  }
+  return setBatchers[store.id].queue({ key, value })
 }
 
 export function update(key: IDBValidKey, updater: (val: any) => any, store = getDefaultStore()): Promise<void> {

Counter:

diff --git a/src/CacheFS.js b/src/CacheFS.js
index ed26c57..0dc6950 100755
--- a/src/CacheFS.js
+++ b/src/CacheFS.js
@@ -5,6 +5,7 @@ const STAT = 0;
 
 module.exports = class CacheFS {
   constructor() {
+    this._maxInode = 0
   }
   _makeRoot(root = new Map()) {
     root.set(STAT, { mode: 0o777, type: "dir", size: 0, ino: 0, mtimeMs: Date.now() });
@@ -38,16 +39,7 @@ module.exports = class CacheFS {
     return count;
   }
   autoinc () {
-    let val = this._maxInode(this._root.get("/")) + 1;
-    return val;
-  }
-  _maxInode(map) {
-    let max = map.get(STAT).ino;
-    for (let [key, val] of map) {
-      if (key === STAT) continue;
-      max = Math.max(max, this._maxInode(val));
-    }
-    return max;
+    return ++this._maxInode;
   }
   print(root = this._root.get("/")) {
     let str = "";

Please let me know if you'd consider incorporating these changes... the batching should be safe, I'm not super sure about the autoinc, but I don't see a reason why it would cause issues (the main difference is deleting a file would free up its inode value in the original implementation but doesn't here, but that shouldn't be a problem AFAIK)

@JacksonKearl
Copy link
Author

cc @wmhilton

@ncortines
Copy link

I'm on the same boat as @JacksonKearl . isomorphic-git is very fast and has great potential. However, I see how the fs layer quickly becomes a bottleneck as things scale up. These are the result times from executing git.clone against different repository sizes using isomorphic-git + lightning-fs, on Chrome and Firefox, as well as from the terminal console:

Number of files Chrome Firefox Terminal
1000 19s 5s 1s
10000 185s 102s 4s

I can see there is a separate effort #28, to support native fs in the future, but that feature seems to be still in a quite early phase of adoption.

I can understand the concept of batching write operations does kind of break the fs contract, as it exposes the fact that the implementation is backed by a database. However I can see how this issue can easily become a deal breaker for many.

@wmhilton , I'd love to hear your opinion and suggestions.

Thanks,
Juan

@JacksonKearl
Copy link
Author

I can understand the concept of batching write operations does kind of break the fs contract, as it exposes the fact that the implementation is backed by a database.

I don't see how this violates the FS contract? The contract only says that once I've called writeFile, I get a promise, and when that promise is resolved I can read the file. It doesn't require anything about the backend.

@ncortines
Copy link

@JacksonKearl I apologise if I haven't fully understood your proposal, but I got the idea was to wrap several file write operations in a single IDB transaction, rather than having a separate transaction for each file. In a pure fs implementation you wouldn't have to wait for a batch of file write operations to complete is order to start reading from (let's say) the first file in the batch. And I don't know if this is too relevant to isomorphic-git implementation, but perhaps it is a reason why it is like it is today.

@JacksonKearl
Copy link
Author

I don't know what you mean by "pure" here. Any asynchronous filesystem (which is pretty much all of them, besides solely in-memory ones) doesn't guarantee being able to read until the write promise resolves. That's kinda the whole point really. That property is preserved here.

@ncortines
Copy link

I was referring to reading from a file already saved in the context of the current transaction, rather than the file being currently written. By wrapping all file write operations in a single transaction, the overall time required to write all files is significantly reduced but the time required to be able to read from the first file in the batch is significantly increased. It's a trade off. Operations like clone would surely benefit from this. Not sure about others.

@billiegoose
Copy link
Member

@JacksonKearl That sounds really promising for speeding up git checkout! Have you got a branch I could look at?

Off the top of my head, I don't see how you could implement writeFile to do write batching universally, without adding a debounce that would slow down writeFile in the naive case (for (const file of files) { await writeFile(file, 'asdf') }).

@JacksonKearl
Copy link
Author

@wmhilton no unfortunately we opted for a different route (downloading via archive endpoints) and are no longer using this project.

The code sample I included in the original post (and below) has an implementation of batching with no extra cost for single operations. Basically you send off operations immediately if there are none in progress, or batch them together if there are some in progress, and send them all off when the in progress batch finishes. It's the kind of thing that you'd think the DB would do for you but I guess not yet.

It's interesting to note that in my testing when firing a sequence of N single put requests the first one wont resolve until the last one is fired, so there is still some sort of batching going on, but just much less efficiently than grouping all changes into a single transaction.

The Batcher class:

class Batcher<T> {
  private ongoing: Promise<void> | undefined
  private items: { item: T, onProcessed: () => void }[] = []

  constructor(private executor: (items: T[]) => Promise<void>) { }

  private async process() {
    const toProcess = this.items;
    this.items = [];
    await this.executor(toProcess.map(({ item }) => item))
    toProcess.map(({ onProcessed }) => onProcessed())
    if (this.items.length) {
      this.ongoing = this.process()
    } else {
      this.ongoing = undefined
    }
  }

  async queue(item: T): Promise<void> {
    const result = new Promise<void>((resolve) => this.items.push({ item, onProcessed: resolve }))
    if (!this.ongoing) this.ongoing = this.process()
    return result
  }
}

(this is roughly inspired by the Throttler class used extensively internally in vscode)

@ncortines
Copy link

Another thing to watch out for is memory footprint, specially when dealing with fast emitting sources.

I think the File System Access API adoption (#28), plus support for writable streams, would bring an important leap in performance.

I would personally put my efforts on that front :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants