Fix shard computation in `NoShuffleBeamWriter` with overlapping split names #11070

lgeiger · 2025-06-16T15:14:55Z

Currently NoShuffleBeamWriter would compute the wrong number of shards for cases where splits have overlapping names.

E.g. given a dataset with two splits foo and foo_bar the generated tfrecord files would look something like this:

builder-foo.tfrecord-00000-of-00003
builder-foo.tfrecord-00001-of-00003
builder-foo.tfrecord-00002-of-00003
builder-foo_bar.tfrecord-00000-of-00002
builder-foo_bar.tfrecord-00001-of-00002

The current regex for the foo split would be builder-foo* which also matches the files of the foo_bar split. This results in the number of shards for foo to also include shards from the other split. This fixes it by changing the regex to builder-foo.*.

names

lgeiger · 2025-06-17T10:01:09Z

@tomvdw Do you mind having a look at this?

Fix shard computation in NoShuffleBeamWriter with overlapping split

132b530

names

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix shard computation in `NoShuffleBeamWriter` with overlapping split names #11070

Fix shard computation in `NoShuffleBeamWriter` with overlapping split names #11070

Uh oh!

lgeiger commented Jun 16, 2025

Uh oh!

lgeiger commented Jun 17, 2025

Uh oh!

Uh oh!

Fix shard computation in NoShuffleBeamWriter with overlapping split names #11070

Are you sure you want to change the base?

Fix shard computation in NoShuffleBeamWriter with overlapping split names #11070

Uh oh!

Conversation

lgeiger commented Jun 16, 2025

Uh oh!

lgeiger commented Jun 17, 2025

Uh oh!

Uh oh!

Fix shard computation in `NoShuffleBeamWriter` with overlapping split names #11070

Fix shard computation in `NoShuffleBeamWriter` with overlapping split names #11070