Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feats/bucket lord #23

Merged
merged 13 commits into from
Oct 23, 2023
Merged

Feats/bucket lord #23

merged 13 commits into from
Oct 23, 2023

Conversation

TimotheeMickus
Copy link
Collaborator

closes #12

@TimotheeMickus TimotheeMickus self-assigned this Sep 27, 2023
@TimotheeMickus
Copy link
Collaborator Author

it's implemented, it runs, and it seems to be about as efficient as the previous solution—but I don't know if it's an actual improvement in terms of padding efficiency.

The buckets are likely very sparsely populated (with n_buckets² buckets total, that's a lot), so maybe we want fewer buckets and binning in comparable groups, rather than exact length matches (maybe we want n_buckets ** ½ or something).

I'll see if I can come up with stats in terms of how much padding we observe with this solution.

@TimotheeMickus TimotheeMickus marked this pull request as ready for review October 1, 2023 11:05
Copy link
Collaborator

@Waino Waino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check all uses of self._lens.

mammoth/inputters/dataloader.py Outdated Show resolved Hide resolved
mammoth/inputters/dataloader.py Outdated Show resolved Hide resolved
@Waino
Copy link
Collaborator

Waino commented Oct 2, 2023

The buckets are likely very sparsely populated (with n_buckets² buckets total, that's a lot), so maybe we want fewer buckets and binning in comparable groups, rather than exact length matches (maybe we want n_buckets ** ½ or something).

Indeed. I think that the original 1-d buckets were probably already unnecessarily sparse due to being based on exact rather than quantized lengths. There is a tradeoff between minimizing the amount of padding and minimizing the amount of correlation between examples in a batch. Being extremely aggressive with minimizing padding may not be optimal for training, if it results in a very unshuffled data order (and large memory use).

I think that a variable-granularity binning with finegrained bins near the mode of the lengths and rough outside would be good:

  1. the amount of extra padding is minimized for the very common batches near the mode of the distribution. These have a lot of data to choose from, so waiting for the batch to fill up will not lead to it always containing the exact same sentences.
  2. the outliers are rare enough to not matter, so no sense binning them very exactly.

@TimotheeMickus
Copy link
Collaborator Author

TimotheeMickus commented Oct 2, 2023

The buckets are likely very sparsely populated (with n_buckets² buckets total, that's a lot), so maybe we want fewer buckets and binning in comparable groups, rather than exact length matches (maybe we want n_buckets ** ½ or something).

Indeed. I think that the original 1-d buckets were probably already unnecessarily sparse due to being based on exact rather than quantized lengths. There is a tradeoff between minimizing the amount of padding and minimizing the amount of correlation between examples in a batch. Being extremely aggressive with minimizing padding may not be optimal for training, if it results in a very unshuffled data order (and large memory use).

Memory use isn't a concern (there's still only pool_size items considered at any point). I think the system at least gives reasonable guarantees for shuffling near the mode of the length distribution. So the problem is likely more specifically the tail of the distribution.

I think that a variable-granularity binning with finegrained bins near the mode of the lengths and rough outside would be good:

1. the amount of extra padding is minimized for the very common batches near the mode of the distribution. These have a lot of data to choose from, so waiting for the batch to fill up will not lead to it always containing the exact same sentences.

2. the outliers are rare enough to not matter, so no sense binning them very exactly.

That's not easy to pre-compute, and requires different parameters per language or per task. It's not impossible, but it's going to be very involved. My concrete approach to that would be to try to automate that in the config config while reading the corpora length, e.g. with a process similar to the following:

  1. fit some random variable to the observed length distribution on a subset of the data
  2. compute equiprobable bins through the cdf
  3. use bins from 2. to define lists of bucket sizes,
  4. store in a new task key

We could obviously skip 1 and 2 and rely on hard observed data if we compute the full stats of each corpus. Do remark that in any case, this does involve configuring bucket sizes in the task.

If on the other we only care about ensuring that tail items are lumped in a single bucket, then I would consider just defining an outlier bucket by cutting off anything below an expected max length (say 128 or less), which can already be done by setting a lower n_bucket value.

In the present 2D case, we could also rely on length ratios to define how many of the offset diagonals we tolerate before sending an example to the outlier bucket.

There is a tradeoff between minimizing the amount of padding and minimizing the amount of correlation between examples in a batch.

If we're stressed out about wrapping around the full corpus and constructing batches with very similar examples (which is justifiable for LR langs in my opinion), then we can

  1. remove the infinite cycling from the dataset,
  2. use a flag to mark when the example stream has been exhausted
  3. restart the iteration over the example stream when we've hit self.is_empty()

@Waino
Copy link
Collaborator

Waino commented Oct 2, 2023

Memory use isn't a concern (there's still only pool_size items considered at any point). I think the system at least gives reasonable guarantees for shuffling near the mode of the length distribution. So the problem is likely more specifically the tail of the distribution.

In order for the rare buckets out there in the tail of the distribution to ever see enough data to be filled, pool_size needs to be increased to compensate. This would have a memory impact.

If pool_size is not increased, then these outlier buckets will just gather dust and never be used?

That's not easy to pre-compute

It doesn't need to be exact. A warmup step that reads the first 1k--10k lines of the file and estimates the mode from that would probably be good enough.

and requires different parameters per language or per task.

Yes, this would need to happen per task. Aggregating over languages would be tricky and not worth it.

try to automate that in the config config while reading the corpora length

We could compute this during config-config instead of in the beginning of training.
However, this makes config-config kind of mandatory to use, unless you want to estimate the parameter(s) yourself or hazard a guess. If there is only one parameter per task (the mode of the length), then this is still doable. For a full list of bucket sizes it is a lot to ask.

  1. fit some random variable to the observed length distribution on a subset of the data
  2. compute equiprobable bins through the cdf
  3. use bins from 2. to define lists of bucket sizes,
  4. store in a new task key

Yes, we could do it like this. This would be theoretically very elegant: the bins would be approximately equiprobable.

I'd just go with a rough approach where there are a few small (maybe size 1) bins around the mode and everything else gets chucked in a collector bin (or a few of those, if we want a little bit more granularity). This would not be very principled, but can be defined based on just one parameter.

We could obviously skip 1 and 2 and rely on hard observed data if we compute the full stats of each corpus.

It would be overkill to use the whole corpus to compute this.

If on the other we only care about ensuring that tail items are lumped in a single bucket, then I would consider just defining an outlier bucket by cutting off anything below an expected max length (say 128 or less), which can already be done by setting a lower n_bucket value.

I guess this also does the trick. It is the longer side that has the long tail, on the shorter side there is a minimum value that is quickly reached.

In the present 2D case, we could also rely on length ratios to define how many of the offset diagonals we tolerate before sending an example to the outlier bucket.

Not sure I follow.

There is a tradeoff between minimizing the amount of padding and minimizing the amount of correlation between examples in a batch.

If we're stressed out about wrapping around the full corpus and constructing batches with very similar examples (which is justifiable for LR langs in my opinion), then we can

  1. remove the infinite cycling from the dataset,
  2. use a flag to mark when the example stream has been exhausted
  3. restart the iteration over the example stream when we've hit self.is_empty()

Wouldn't this just throw away the outliers entirely, instead of collecting them to (rare) outlier batches with (potentially several copies of) just the outliers?

@TimotheeMickus
Copy link
Collaborator Author

In order for the rare buckets out there in the tail of the distribution to ever see enough data to be filled, pool_size needs to be increased to compensate. This would have a memory impact.

If pool_size is not increased, then these outlier buckets will just gather dust and never be used?

In the 1D impl currently on main, anything with a length above n_buckets gets chucked into the last bucket. so self._buckets[-1] is already an outlier bucket

It doesn't need to be exact. A warmup step that reads the first 1k--10k lines of the file and estimates the mode from that would probably be good enough.

[...] this makes config-config kind of mandatory to use, unless you want to estimate the parameter(s) yourself or hazard a guess.

Reading 10k more lines would slow down the initialization. For simplicity, I'd consider using pool_size items for the warmup, and offloading bucket definition in the init function of the LAB. If we also offload bucket definition to the LAB itself then we don't need to overcomplicate the config either.

If so, I would suggest implementing something like:

  1. starting by computing an exact length-quantized bucket arrays
  2. define a merge operation of buckets b1, b2 so that they share a ref to the same list
  3. find the two most underused adjacent buckets and merge them
  4. repeat until we get to the target number of buckets

cons:

  • It's a perf hit on startup
  • It's a mess to save the structure when resuming training (you also need the merge info)
  • kind of a mess to do weighted sampling (we need to drop duplicate refs first, or sample from the merge info)
  • pretty opaque for outsider devs

pros:

  • makes the config config optional again
  • simplifies the config definition to something humanly readable
  • easy to adapt the current behavior (the shared bucket ref does the rerouting)

If on the other we only care about ensuring that tail items are lumped in a single bucket, then I would consider just defining an outlier bucket by cutting off anything below an expected max length (say 128 or less), which can already be done by setting a lower n_bucket value.

I guess this also does the trick. It is the longer side that has the long tail, on the shorter side there is a minimum value that is quickly reached.

This is the simpler option, yes...

In the present 2D case, we could also rely on length ratios to define how many of the offset diagonals we tolerate before sending an example to the outlier bucket.

Not sure I follow.

In the current version, we define buckets for (src_len, tgt_len) tuples. What happens if the target has an outlier length but source is in the mode?

There is a tradeoff between minimizing the amount of padding and minimizing the amount of correlation between examples in a batch.

If we're stressed out about wrapping around the full corpus and constructing batches with very similar examples (which is justifiable for LR langs in my opinion), then we can

  1. remove the infinite cycling from the dataset,
  2. use a flag to mark when the example stream has been exhausted
  3. restart the iteration over the example stream when we've hit self.is_empty()

Wouldn't this just throw away the outliers entirely, instead of collecting them to (rare) outlier batches with (potentially several copies of) just the outliers?

no, you'd keep creating batches until self._is_empty(), yield those final batches, and then you'd restart the dataset iteration. Nothing lost, except the last batch might be smaller than expected.

@Waino
Copy link
Collaborator

Waino commented Oct 2, 2023

can already be done by setting a lower n_bucket value

Ok. After clearing up some of my misunderstandings, I guess we could just lower the n_bucket value, so that the "high granularity" area contains a reasonable number of buckets when we go to 2D. Then everything else would either go into a single outlier bucket, or we could have e.g. 3: {source|target|both} exceeded the max length.

no, you'd keep creating batches until self._is_empty(), yield those final batches, and then you'd restart the dataset iteration. Nothing lost, except the last batch might be smaller than expected.

This also sounds great.

Pick whichever solution you prefer.

@TimotheeMickus
Copy link
Collaborator Author

Ok. After clearing up some of my misunderstandings, I guess we could just lower the n_bucket value, so that the "high granularity" area contains a reasonable number of buckets when we go to 2D.

That was the idea, yes

Then everything else would either go into a single outlier bucket, or we could have e.g. 3: {source|target|both} exceeded the max length.

I guess having everything type of outlier into a single bucket would make sense, but i don't really know how to make this work with the spiralling bucket search pattern. Let me think on it.

Pick whichever solution you prefer.

They're not mutually exclusive, we can implement both.

And we're dropping the over-engineered solution for equiprobable bins, correct?
Do we want that as a rc2 feature, along with issues #8, #17, #24 and #25 ?

@TimotheeMickus TimotheeMickus mentioned this pull request Oct 2, 2023
@TimotheeMickus TimotheeMickus marked this pull request as draft October 2, 2023 18:46
@TimotheeMickus
Copy link
Collaborator Author

Some further comments:

  • there was a bug in the spiralling pattern. This should be fixed, but I definitely need to write unittests as well
  • there is a possibility that the re-initialization of the example stream and the refurbishing of the buckets end up taking quite some time. Hopefully the comm will be resilient enough if the data pipe ends up too slow, otherwise we might want to fiddle with the batch queue prefetching
  • I ended up removing self._lens and just calling len(bucket) as appropriate instead, to avoid any discrepancy

I'm also strongly considering using this PR to also give a somewhat less nondescript name to the DynamicDatasetIterator, which does not define the dynamic iteration process nor directly iterates over datasets (a multiplexer of streams of batches from tasks? TaskBatchMux makes for a very fun tongue-twister at the very least).

@TimotheeMickus TimotheeMickus marked this pull request as ready for review October 16, 2023 09:02
@TimotheeMickus TimotheeMickus requested a review from Waino October 16, 2023 09:08
@TimotheeMickus
Copy link
Collaborator Author

re-smoke tested post merge w/ main, so should be good to merge

@TimotheeMickus TimotheeMickus merged commit ed7261c into main Oct 23, 2023
@TimotheeMickus TimotheeMickus deleted the feats/bucket-lord branch October 23, 2023 06:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

update buckets in LookAheadBucketting
2 participants