Skip to content

Commit

Permalink
fix: Deal with situation where num_samples is too large
Browse files Browse the repository at this point in the history
  • Loading branch information
saattrupdan committed Apr 23, 2024
1 parent 3116d6a commit 1551651
Showing 1 changed file with 4 additions and 1 deletion.
5 changes: 4 additions & 1 deletion src/foqa/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,11 @@ def build_dataset(config: DictConfig) -> None:
)
.shuffle(seed=config.seed)
.filter(lambda x: len(x["text"]) > config.min_article_length)
.select(range(config.num_samples))
)
assert isinstance(dataset, Dataset)

num_samples = min(config.num_samples, len(dataset))
dataset = dataset.select(range(num_samples))

records_path = Path(config.dirs.data) / config.dirs.raw / "records.jsonl"
if records_path.exists():
Expand Down

0 comments on commit 1551651

Please sign in to comment.