Added use of subdirs for cache items. #11

ppbrown · 2024-11-24T06:02:52Z

It will split things up in groups of 10,000, so that we dont have 1,000,000 items in a single directory.

Needs testing though.

It will split things up in groups of 10,000

Arcitec · 2024-11-24T16:09:19Z

Needs testing, but is a great idea. Segmenting huge amounts of files like this is very common and makes sense to add it here too.

Edit: Well, there's one issue at the moment. It really needs to be addressed: This code needs to be backwards-compatible so that it can still use "v1" cache directories for everyone who has existing data. That could be achieved by adding a "cache-meta.json" to the root of the cache dir, and setting the cache version to "2" in it. If that's missing, use the v1 cache storage technique. So we don't disrupt anybody's current data.

Nerogar · 2024-11-26T19:40:18Z

It really needs to be addressed: This code needs to be backwards-compatible so that it can still use "v1" cache directories for everyone who has existing data

Maybe the best strategy would be to migrate existing caches. Adding a full versioning scheme doesn't make much sense. Caches aren't really supposed to be compatible for multiple versions. And a migration function could simply check if 0.pt exists in the base directory.

ppbrown · 2024-11-26T21:38:05Z

I should add that, longer term, I'd also like to see a format that is more friendly for adding entries, or management in general.
I'm not sure that the current format, of having the "index" be a pytorch file, is the best choice for that?

When I summarized things for chatgpt, and specifically mentioned scaling to 1million+ files, it suggested primarily either sqlite format, or LMDB format instead for the index.

Thoughts?

Nerogar · 2024-11-26T22:16:16Z

If you mean the aggregate.pt file, that's not an index. It contains part of the cache that needs to be loaded at all times because it's frequently and randomly accessed.

ppbrown · 2024-11-26T23:03:46Z

err...
you say it isnt an index.
maybe it has additional functionality, but... it ALSO functions as an index.
It maps (original location of source image) to (entrynum.pt file)

thats an index.

Nerogar · 2024-11-27T18:12:32Z

no. It doesn't map anything. As I said, it just saves some part of the cache. The file name is one of those parts. But it also contains things like the original resolution, cropped resolution, and other things depending on your configuration.
The actual mapping is always done using the input of previous modules. For OneTrainer, that's (currently) a sorted list of file names from the original concept directory. This is also the reason why the original files are always needed.

ppbrown · 2024-11-27T19:04:51Z

Maybe we are getting unneccessarily caught up with definitions like "map" and "index", when that isnt the most important thing here.

So let me rephrase my earlier question:
Why are you using a pytorch object for this, when it seems like an actual database object for all the stuff would seem to fit the usage much better?

It sort of seems just like "well i'm using pytorch for this other stuff so I'm going to just go ahead and do the same thing over here", even though it isnt a good fit.

I'm asking if there is a specific reason for it, because otherwise, at some point in the future, I may submit a PR that changes the internal format to one of the other two more efficient formats that I mentioned earlier
(sqlite or LMDB)

Nerogar · 2024-11-27T19:25:41Z

Again, because it's not an index. It's part of the cache just like the .pt files. It can contain the same data types. The only reason it even exists in the first place is because some of the cache content is frequently and randomly accessed, so it made sense to combine that part of the data into a separate file. Changing the format would make it far more complicated to store other data types like tuples, lists or tensors in that file.

ppbrown · 2024-11-27T19:47:36Z

ah.. so now it almost makes sense why its called "aggregate.pt".
Its just a "bunch o stuff".
Thanks for the info.

Added use of subdirs for cache items.

04b0d8f

It will split things up in groups of 10,000

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added use of subdirs for cache items. #11

Added use of subdirs for cache items. #11

ppbrown commented Nov 24, 2024

Arcitec commented Nov 24, 2024 •

edited

Loading

Nerogar commented Nov 26, 2024

ppbrown commented Nov 26, 2024 •

edited

Loading

Nerogar commented Nov 26, 2024

ppbrown commented Nov 26, 2024

Nerogar commented Nov 27, 2024

ppbrown commented Nov 27, 2024

Nerogar commented Nov 27, 2024

ppbrown commented Nov 27, 2024

Added use of subdirs for cache items. #11

Are you sure you want to change the base?

Added use of subdirs for cache items. #11

Conversation

ppbrown commented Nov 24, 2024

Arcitec commented Nov 24, 2024 • edited Loading

Nerogar commented Nov 26, 2024

ppbrown commented Nov 26, 2024 • edited Loading

Nerogar commented Nov 26, 2024

ppbrown commented Nov 26, 2024

Nerogar commented Nov 27, 2024

ppbrown commented Nov 27, 2024

Nerogar commented Nov 27, 2024

ppbrown commented Nov 27, 2024

Arcitec commented Nov 24, 2024 •

edited

Loading

ppbrown commented Nov 26, 2024 •

edited

Loading