-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added use of subdirs for cache items. #11
base: master
Are you sure you want to change the base?
Conversation
It will split things up in groups of 10,000
Needs testing, but is a great idea. Segmenting huge amounts of files like this is very common and makes sense to add it here too. Edit: Well, there's one issue at the moment. It really needs to be addressed: This code needs to be backwards-compatible so that it can still use "v1" cache directories for everyone who has existing data. That could be achieved by adding a "cache-meta.json" to the root of the cache dir, and setting the cache version to "2" in it. If that's missing, use the v1 cache storage technique. So we don't disrupt anybody's current data. |
Maybe the best strategy would be to migrate existing caches. Adding a full versioning scheme doesn't make much sense. Caches aren't really supposed to be compatible for multiple versions. And a migration function could simply check if 0.pt exists in the base directory. |
I should add that, longer term, I'd also like to see a format that is more friendly for adding entries, or management in general. When I summarized things for chatgpt, and specifically mentioned scaling to 1million+ files, it suggested primarily either sqlite format, or LMDB format instead for the index. Thoughts? |
If you mean the aggregate.pt file, that's not an index. It contains part of the cache that needs to be loaded at all times because it's frequently and randomly accessed. |
err... thats an index. |
no. It doesn't map anything. As I said, it just saves some part of the cache. The file name is one of those parts. But it also contains things like the original resolution, cropped resolution, and other things depending on your configuration. |
Maybe we are getting unneccessarily caught up with definitions like "map" and "index", when that isnt the most important thing here. So let me rephrase my earlier question: It sort of seems just like "well i'm using pytorch for this other stuff so I'm going to just go ahead and do the same thing over here", even though it isnt a good fit. I'm asking if there is a specific reason for it, because otherwise, at some point in the future, I may submit a PR that changes the internal format to one of the other two more efficient formats that I mentioned earlier |
Again, because it's not an index. It's part of the cache just like the .pt files. It can contain the same data types. The only reason it even exists in the first place is because some of the cache content is frequently and randomly accessed, so it made sense to combine that part of the data into a separate file. Changing the format would make it far more complicated to store other data types like tuples, lists or tensors in that file. |
ah.. so now it almost makes sense why its called "aggregate.pt". |
It will split things up in groups of 10,000, so that we dont have 1,000,000 items in a single directory.
Needs testing though.