Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added use of subdirs for cache items. #11

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ppbrown
Copy link

@ppbrown ppbrown commented Nov 24, 2024

It will split things up in groups of 10,000, so that we dont have 1,000,000 items in a single directory.

Needs testing though.

It will split things up in groups of 10,000
@Arcitec
Copy link

Arcitec commented Nov 24, 2024

Needs testing, but is a great idea. Segmenting huge amounts of files like this is very common and makes sense to add it here too.

Edit: Well, there's one issue at the moment. It really needs to be addressed: This code needs to be backwards-compatible so that it can still use "v1" cache directories for everyone who has existing data. That could be achieved by adding a "cache-meta.json" to the root of the cache dir, and setting the cache version to "2" in it. If that's missing, use the v1 cache storage technique. So we don't disrupt anybody's current data.

@Nerogar
Copy link
Owner

Nerogar commented Nov 26, 2024

It really needs to be addressed: This code needs to be backwards-compatible so that it can still use "v1" cache directories for everyone who has existing data

Maybe the best strategy would be to migrate existing caches. Adding a full versioning scheme doesn't make much sense. Caches aren't really supposed to be compatible for multiple versions. And a migration function could simply check if 0.pt exists in the base directory.

@ppbrown
Copy link
Author

ppbrown commented Nov 26, 2024

I should add that, longer term, I'd also like to see a format that is more friendly for adding entries, or management in general.
I'm not sure that the current format, of having the "index" be a pytorch file, is the best choice for that?

When I summarized things for chatgpt, and specifically mentioned scaling to 1million+ files, it suggested primarily either sqlite format, or LMDB format instead for the index.

Thoughts?

@Nerogar
Copy link
Owner

Nerogar commented Nov 26, 2024

If you mean the aggregate.pt file, that's not an index. It contains part of the cache that needs to be loaded at all times because it's frequently and randomly accessed.

@ppbrown
Copy link
Author

ppbrown commented Nov 26, 2024

err...
you say it isnt an index.
maybe it has additional functionality, but... it ALSO functions as an index.
It maps (original location of source image) to (entrynum.pt file)

thats an index.

@Nerogar
Copy link
Owner

Nerogar commented Nov 27, 2024

no. It doesn't map anything. As I said, it just saves some part of the cache. The file name is one of those parts. But it also contains things like the original resolution, cropped resolution, and other things depending on your configuration.
The actual mapping is always done using the input of previous modules. For OneTrainer, that's (currently) a sorted list of file names from the original concept directory. This is also the reason why the original files are always needed.

@ppbrown
Copy link
Author

ppbrown commented Nov 27, 2024

Maybe we are getting unneccessarily caught up with definitions like "map" and "index", when that isnt the most important thing here.

So let me rephrase my earlier question:
Why are you using a pytorch object for this, when it seems like an actual database object for all the stuff would seem to fit the usage much better?

It sort of seems just like "well i'm using pytorch for this other stuff so I'm going to just go ahead and do the same thing over here", even though it isnt a good fit.

I'm asking if there is a specific reason for it, because otherwise, at some point in the future, I may submit a PR that changes the internal format to one of the other two more efficient formats that I mentioned earlier
(sqlite or LMDB)

@Nerogar
Copy link
Owner

Nerogar commented Nov 27, 2024

Again, because it's not an index. It's part of the cache just like the .pt files. It can contain the same data types. The only reason it even exists in the first place is because some of the cache content is frequently and randomly accessed, so it made sense to combine that part of the data into a separate file. Changing the format would make it far more complicated to store other data types like tuples, lists or tensors in that file.

@ppbrown
Copy link
Author

ppbrown commented Nov 27, 2024

ah.. so now it almost makes sense why its called "aggregate.pt".
Its just a "bunch o stuff".
Thanks for the info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants