-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data deduplication / CAS storage #163
Comments
Thanks for your question. No it doesn't implement de-dup/CAS. This package is meant to be a (relatively) simple package to expose the MongoDB gridFS implementation to Meteor, and gridFS does not implement these features. The use of MD5 is specified by the gridFS specification and implemented in the MongoDB drivers and DB server software. You are free to add another hash (or any other information) as file metadata to meet the needs of your application. It should be possible/straighforward to implement a CAS/dedup solution on top of gridFS, and this has probably already been done (or could be added easily enough by writing a MongoDB backend for one of the many such systems that are under active development (e.g. Dat, Noms, Restic, etc.) But fileCollection will not support any of these directly because it is outside the scope of this project to do so. |
Thanks for the answers! As for the CAS/dedup - I'm didn't read enough gridFS related docs, but isn't the dedup there kind of by default (though only relying on md5)? ... i can use md5 to request a file from girdFS via fileCollection, so I was assuming, that fileCollection would, upon inserting the same data under a different filename, just ignore the data and just add a new filename - basically deduplication upon insertion. |
Nope. There's no deduping or reference counting or anything in gridFS. Each file has its own chunks regardless of the MD5 sum. And the chunks themselves are not deduped or individually hashed in any way. It's very simple. So simple in fact that it is not inherently safe for concurrent writes (e.g. there is no locking of any kind). MongoDB has been "talking" about redesigning it for years, but I've seen no recent progress on that either. |
To clarify: not safe for concurrent writes/reads to any given file. FileCollection does implement locking on top of MongoDB to make such operations safe, although more recently MongoDB has actually been de-featured in this respect (rather than fixing it) because of the mythical replacement technology that has yet to materialize. |
Aah, thats a bit of a letdown on Mongo's side. Please correct me if I'm wrong, could a "poor-mans" dedup-on-write be simply implemented as a few-liner if one would query FileCollection first for the md5 to be written and decide if to write it or if to only update references? |
Sure, if all you want is file-level dedup, that could work (probably a bit more than a "few liner" though). You'd need to implement reference counting and ensure that the inc/dec logic is safe for concurrent operations, and come up with a scheme to implement "per copy" metadata, etc. In general, gridFS is very simple, but it is also pretty flexible in terms of making it possible for lots of higher level functionality to be built on top at the application level, precisely because it specifies so little. |
Oh, I believed/hoped that I could rely on Mongo for concurrency safety. Thanks for the info, I'll have to read a bit more on gridFS to figure out the limitations. |
You should check out my gridFS locking package (and the sister gridFS streaming package). Lots of good info there, and file-collection is built on top of it. |
Lovely! Thanks lots for all the infos :) |
Hi, a quick question: does meteor-file-collection have data deduplication based on hash comparisons buit-in, or in other words, is it a content-adressable storage? Did you consider choosing a different hash (i.e. sha256) or adding another one to protect from (quite theoretical for a file based storage, I admit) md5 collisions? Thanks in advance, P.
The text was updated successfully, but these errors were encountered: