Skip to content

Commit

Permalink
Merge pull request huggingface#232 from QasidSaleem/optimize_num_sent…
Browse files Browse the repository at this point in the history
…ences_computation

checks if min_num_sentences is disabled or not before computing the n…
  • Loading branch information
hynky1999 authored Jun 28, 2024
2 parents 1cece66 + 142279c commit 1e27cc8
Showing 1 changed file with 2 additions and 1 deletion.
3 changes: 2 additions & 1 deletion src/datatrove/pipeline/filters/c4_filters.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,8 @@ def filter(self, doc: Document) -> bool | tuple[bool, str]:
if self.filter_policy and any(p in line_l for p in POLICY_SUBSTRINGS):
self.stat_update("line-filter-policy")
continue
num_sentences += len(self.tokenizer.sent_tokenize(line)) if self.split_paragraph else 1
if self.min_num_sentences != -1:
num_sentences += len(self.tokenizer.sent_tokenize(line)) if self.split_paragraph else 1
kept_lines.append(line)
self.stat_update("line-kept")
if num_sentences < self.min_num_sentences:
Expand Down

0 comments on commit 1e27cc8

Please sign in to comment.