Skip to content

Commit

Permalink
mixing langs
Browse files Browse the repository at this point in the history
  • Loading branch information
soldni committed Jan 2, 2025
1 parent adca577 commit 79747b1
Show file tree
Hide file tree
Showing 2 changed files with 4,919 additions and 1 deletion.
6 changes: 5 additions & 1 deletion configs/cc-news/dedupe_by_lang.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ for lang in "${langs[@]}"; do
size=$(expr $size + $(stat -c %s "$file"))
done < <(find "${base_dir}/${lang}" -type f \( -name "*.zst" -o -name "*.gz" -o -name "*.gzip" -o -name "*.json" -o -name "*.jsonl" \) -print0)

# sort documents by name
documents=($(echo "${documents[@]}" | tr ' ' '\n' | sort))

# run deduplication
echo "Running fuzzy dedupe for ${lang} with ${size} bytes Bloom filter (files: ${#documents[@]})"

Expand Down Expand Up @@ -60,7 +63,8 @@ EOF

set -ex
# Run dolma with the temporary config file
dolma -c "$temp_config_file" dedupe --processes $(expr $(nproc) - 4)
dolma -c "$temp_config_file" dedupe --processes "${processes}"
# cat "$temp_config_file"
set +ex


Expand Down
Loading

0 comments on commit 79747b1

Please sign in to comment.