Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading large HDFDatasets inside MetaDataset is slow #1669

Open
dorian-K opened this issue Dec 13, 2024 · 5 comments
Open

Loading large HDFDatasets inside MetaDataset is slow #1669

dorian-K opened this issue Dec 13, 2024 · 5 comments
Assignees

Comments

@dorian-K
Copy link
Contributor

dorian-K commented Dec 13, 2024

I am loading two HDFDatasets (each ~8Gb, 40310479 seqs) inside a MetaDataset. This process takes about 30 minutes and uses up ~36Gb of RAM, which I find excessive.
Profiling the program with py-spy suggests that _update_tag_idx in returnn/datasets/cached.py:148 is the bottleneck here.

(cc @patrick-wilken @NeoLegends @JackTemaki)

@patrick-wilken
Copy link
Contributor

For CombinedDataset there used to be a similar issue, that's why I changed its implementation so that the sequence order is passed to the sub-datasets via indices instead of sequence tags (already years ago).
MetaDataset does it via tags, and as you notice this will trigger the creation of the CachedDataset._tag_idx mapping which in your case contains 2 * 40M strings. Also, it accesses the HDF file 2 * 40M times to get the tags one by one which is slow. Not only because it's a file access, but also because the h5py library is not optimized for this and has some overhead in each call.
Also, the attributes MetaDataset.seq_list_original and MetaDataset.tag_idx probably just get large in your case.
I never had to use MetaDataset myself for so many sequences, so I don't have an existing solution, but for CombinedDataset the important points to get a setup that scales pretty much to arbitrary dataset sizes were:

  • don't use sequence tags at all for initialization / sequence order
  • (try to) remove all variables* of size O(total_num_sequences)
  • minimize number of HDF accesses
  • use sub-epochs and apply (non-default) sequence ordering only on sub-epoch level (especially length-based methods)

*We even use a custom variant of the HDFDataset that avoids file_seq_start by storing those indices in the HDF instead of the sequence lengths.

@albertz
Copy link
Member

albertz commented Dec 13, 2024

don't use sequence tags at all for initialization / sequence order

This doesn't really work for MetaDataset because we sometimes (even often?) have datasets where the underlying seq order (and thus the seq indices) is different.

(try to) remove all variables of size O(total_num_sequences)

I thought you still have the seq indices as order?

apply (non-default) sequence ordering only on sub-epoch level (especially length-based methods)

How do you do that?

We even use a custom variant of the HDFDataset that avoids file_seq_start by storing those indices in the HDF instead of the sequence lengths.

But this is not in master? Why did you not make a PR for this?

@patrick-wilken
Copy link
Contributor

I thought you still have the seq indices as order?

get_seq_order_for_epoch() was changed to return a range when possible because of this. Only works for "default" and "reverse" sequence order of course, but importantly also with partition_epoch.

What we use for large MT trainings is: HDF files which are already shuffled themselves, put into HDFDatasets with default sequence order, combined via CombinedDataset with sampling_sizes set. So, a fixed amount of sequences is taken from each HDFDataset in each epoch in order (next to zero time and memory complexity for computing sequence order), and for the CombinedDataset we do use a more complex sequence ordering ("laplace"), but it operates only on sum of sampling sizes many sequences (e.g. 1M), independent of the total number of sequences.
If you don't want sampling_sizes, I think setting partition_epoch in the HDFDatasets should be equally efficient.

True, for MetaDataset it is trickier in general where you have to assume the sub-datasets contain different / differently ordered sequences. But if you need such a sequence mapping at runtime there is not much you can do. Maybe ask the sub-datasets to translate tag to index before storing all in memory, but that might again be slow.
But I would rather try to have parallel datasets in the first place, if that is at all practical in your case.
Still there are optimizations possible, I believe. For example, MetaDataset.tag_idx: is it used in your particular case at all?

@dorian-K
Copy link
Contributor Author

dorian-K commented Jan 8, 2025

I ended up splitting the dataset up into smaller chunks and loading them with DistributeFilesDataset

@dorian-K dorian-K closed this as completed Jan 8, 2025
@albertz albertz reopened this Jan 8, 2025
@albertz
Copy link
Member

albertz commented Jan 8, 2025

I ended up splitting the dataset up into smaller chunks and loading them with DistributeFilesDataset

This is just a workaround, not really a solution to the problem itself, i.e. the issue itself still persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants