-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
@tus/server: add GCS locker #616
base: main
Are you sure you want to change the base?
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting in the time to contribute!
I'm not an expert on (distributed) locking, but conceptually I think GCS storage as a locker only makes sense if you're already deploying your server within GCS infrastructure (so it's faster) and you have a bucket in the region where the uploads happen. My assumption is if those conditions aren't met, things will be slow? AFAIK GCS has strong consistency within the same region but eventual consistency for multi-region.
Maybe you can elaborate on your use case?
Indeed I haven't even thought about using this locker with a store other than GCS. In my case, the storage bucket and the locker bucket is the same, and I think the only case they should be separated is when the storage bucket is not in standard storage class. Anyways, I'm not sure i.e. Firestore would greatly overperform GCS in case of different storage. Regarding region latency, the user should be aware of that and choose a suitable region. Of course a redis based implementation would be much better, but this may be a considerable alternative until thats not implemented. Shall I move this locker to the gcs-store package to suggest the primary application? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting because such approaches would allow tus server to implement lockers directly on top of cloud storages instead of using external tools like Redis. However, I would like to see some evidence that this approach actually provides exclusive access to uploads. Is there some blog post that looked into the mechanisms at play here? Are all involved operations strongly consistent?
GCS is strongly consistent, but indeed concurrency was not ensured in my previous approach. I have reworked the code based on this article. Note that I had to upgrade @google-cloud/storage because previous version was missing a type export. Also, this feature should be moved to a separate package or into gcs-store, as I'm importing from @google-cloud/storage. |
Really nice article, thanks for sharing. It does also say this:
But here we are using it for individual uploads, not batches. Or even smaller with a resumed uploads (or where a client sets |
For the last 10 days it has been running in production without problems. We have about 5000 uploads per day. In e2e tests it was indeed slightly slower for 140 files compared to xhr, but I could easily compensate this by increasing the number of parallel uploads. If I measure individual uploads, the time elapsed between lock and unlock is mostly 20-400 ms in case of memory locker, and 300-400 for gcs locker. |
That's great to hear! I'm in favor adding this into the package then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks very good! Also happy with the extensive code comments.
Some things needed:
- The build is currently failing
- We need to update the
peerDependencies
to not allow any version of@google-cloud/storage
. - Docs. We should also talk about when to (not) use this lock and the things to watch out for, such as what values to set for the ttl and watch interval.
- A test similar to this:
tus-node-server/test/e2e.test.ts
Lines 1045 to 1145 in a0f9da1
describe('File Store with Locking', () => { before(() => { server = new Server({ path: STORE_PATH, datastore: new FileStore({directory: `./${STORE_PATH}`}), locker: new MemoryLocker(), }) listener = server.listen() agent = request.agent(listener) }) after((done) => { // Remove the files directory rimraf(FILES_DIRECTORY, (err) => { if (err) { return done(err) } // Clear the config // @ts-expect-error we can consider a generic to pass to // datastore to narrow down the store type const uploads = (server.datastore.configstore as Configstore).list?.() ?? [] for (const upload in uploads) { // @ts-expect-error we can consider a generic to pass to // datastore to narrow down the store type await(server.datastore.configstore as Configstore).delete(upload) } listener.close() return done() }) }) it('will allow another request to acquire the lock by cancelling the previous request', async () => { const res = await agent .post(STORE_PATH) .set('Tus-Resumable', TUS_RESUMABLE) .set('Upload-Length', TEST_FILE_SIZE) .set('Upload-Metadata', TEST_METADATA) .set('Tus-Resumable', TUS_RESUMABLE) .expect(201) assert.equal('location' in res.headers, true) assert.equal(res.headers['tus-resumable'], TUS_RESUMABLE) // Save the id for subsequent tests const file_id = res.headers.location.split('/').pop() const file_size = parseInt(TEST_FILE_SIZE, 10) // Slow down writing const originalWrite = server.datastore.write.bind(server.datastore) sinon.stub(server.datastore, 'write').callsFake((stream, ...args) => { const throttleStream = new Throttle({bps: file_size / 4}) return originalWrite(stream.pipe(throttleStream), ...args) }) const data = Buffer.alloc(parseInt(TEST_FILE_SIZE, 10), 'a') const httpAgent = new Agent({ maxSockets: 2, maxFreeSockets: 10, timeout: 10000, keepAlive: true, }) const createPatchReq = (offset: number) => { return agent .patch(`${STORE_PATH}/${file_id}`) .agent(httpAgent) .set('Tus-Resumable', TUS_RESUMABLE) .set('Upload-Offset', offset.toString()) .set('Content-Type', 'application/offset+octet-stream') .send(data.subarray(offset)) } const req1 = createPatchReq(0).then((e) => e) await wait(100) const req2 = agent .head(`${STORE_PATH}/${file_id}`) .agent(httpAgent) .set('Tus-Resumable', TUS_RESUMABLE) .expect(200) .then((e) => e) const [res1, res2] = await Promise.allSettled([req1, req2]) assert.equal(res1.status, 'fulfilled') assert.equal(res2.status, 'fulfilled') assert.equal(res1.value.statusCode, 400) assert.equal(res1.value.headers['upload-offset'] !== TEST_FILE_SIZE, true) assert.equal(res2.value.statusCode, 200) // Verify that we are able to resume even if the first request // was cancelled by the second request trying to acquire the lock const offset = parseInt(res2.value.headers['upload-offset'], 10) const finishedUpload = await createPatchReq(offset) assert.equal(finishedUpload.statusCode, 204) assert.equal(finishedUpload.headers['upload-offset'], TEST_FILE_SIZE) }).timeout(20000) }) })
If you need help with any of these let me know.
Thank you for the article, I will have a look at it! I am wondering if S3 has similar capabilities and a locker can be implemented nowadays ontop of it as well. |
@netdown still interested in getting this over the finish line? |
Yes, but I've been busy the last few weeks and I expect the same at least until July. Feel free to complete the PR if you have the time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for my delayed review! I just read the accompanying blog post and wanted to leave some comments about it first. Some additional background information can be found at gcslock, which was a previous GCS-based lock. The only valuable comment online I was able to find about the proposed algorithm is on Lobsters by Aphyr, who is quite experienced in testing distributed systems and databases. However, his comment was more about a general issue with distributed locks and not about this GCS-based approach in particular. The same critique can also be applied to Redis-based locks and there is not much we can do on our end as far as I know.
The proposed algorithm on its own seems sound to me (although I am no expert). It relies on the storage offering strong consistency which is the case with GCS. While there are many S3-compatible storage I am not aware of any GCS-compatible storages. So we don't have to worry much about storages with a GCS-like interface that are not strongly consistent.
In addition, the propsed algorithm also provides "instant recovery from stale locks" if the lock was left stale by the same actor that now tries to acquire it. This functionality attaches an identity to each lock, which is dangerous for tus-node-server as we do not want two requests that are processed by the same tus-node-server instance to interfere with the same lock. This PR does not implement this feature but this difference to the blog post should still be noted in the code somewhere.
The author also acknowledges that this algorithm does not offer low-latency:
A locking operation's average speed is in the order of hundreds of milliseconds.
This is probably fine for large file uploads, which are I/O-bound, but still work documenting somewhere.
Finally, while reading the article, I hoped that a similar approach might be possible for S3, but this does not seem possible at the first glance as it does not offer conditional writes like GCS does.
//On the first attempt, retry after current I/O operations are done, else use an exponential backoff | ||
const waitFn = (then: () => void) => | ||
attempt > 0 | ||
? setTimeout(then, (attempt * this.locker.watchInterval) / 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice if it also added random jitter.
* main: (59 commits) Replace demo folder with StackBlitz (tus#704) @tus/gcs-store: correctly pass content type (tus#702) @tus/s3-store: fix zero byte files (tus#700) Update package-lock.json [ci] release (tus#696) fix: handling consistent cancellation across stream and locks (tus#699) @tus/s3-store: Change private modifier into protected (tus#698) Create funding-manifest-urls Bump @aws-sdk/client-s3 from 3.703.0 to 3.717.0 (tus#695) Bump mocha from 10.4.0 to 11.0.1 (tus#693) Bump @biomejs/biome from 1.9.2 to 1.9.4 (tus#694) [ci] release (tus#690) Bump @aws-sdk/client-s3 from 3.701.0 to 3.703.0 (tus#685) @tus/s3-store: fix part number increment (tus#689) Revert "Bump rimraf from 3.0.2 to 6.0.1 (tus#681)" Bump @aws-sdk/client-s3 from 3.682.0 to 3.701.0 (tus#683) Bump @changesets/cli from 2.27.9 to 2.27.10 (tus#682) Bump rimraf from 3.0.2 to 6.0.1 (tus#681) Bump @types/node from 20.11.5 to 22.10.1 (tus#679) Ignore JSON for Biome formatting ...
Update:
- this.currentMetaGeneration = 0
+ this.currentMetaGeneration = (await this.getMeta()).metageneration
Note that in tus-node-server/packages/gcs-store/src/locker/GCSLocker.ts Lines 125 to 134 in b5e0bfb
but inside the tus-node-server/packages/gcs-store/src/locker/GCSLock.ts Lines 40 to 52 in b5e0bfb
From my understanding this wasn't needed and just caused for repetitive calls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a few minor changes, otherwise LGTM 👍
* Check if the lock is healthy, delete if not. | ||
* Returns TRUE if the lock is healthy. | ||
*/ | ||
protected async insureHealth() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
protected async insureHealth() { | |
protected async ensureHealth() { |
Could this be a naming mistake? "ensure" seems more appropriate than "insure".
public async create(exp: number) { | ||
const metadata = { | ||
metadata: {exp}, | ||
// TODO: this does nothing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this TODO comment?
|
||
if (!isHealthy) { | ||
log('lock not healthy. calling GCSLock.take() again') | ||
return await this.take(cancelHandler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when the lock is unhealthy and cannot be taken again? Is access to the upload resources on GCP then taken away? Since the locker cannot ensure exclusive access, saving uploaded data to GCS should be stopped.
await this.deleteReleaseRequest() | ||
await this.lockFile.delete({ifGenerationMatch: this.currentMetaGeneration}) | ||
} catch (err) { | ||
//Probably already deleted, no need to report |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only errors about the object not existing should be ignored. All other errors should be thrown.
protected async deleteReleaseRequest() { | ||
try { | ||
await this.releaseFile.delete() | ||
} catch (err) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only errors about the object not existing should be ignored. All other errors should be thrown.
This PR is not complete yet, it misses unit tests (the code is tested), readme updates and changeset. Despite all that, I would like to ask you to review my approach first so I won't write needless tests. I have documented the process in detail, but feel free to ask questions.