Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warc Backoff #160

Open
wants to merge 184 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
184 commits
Select commit Hold shift + click to select a range
3231c83
better listing
soldni May 7, 2024
7da8e1f
better resolvers
soldni May 7, 2024
6859981
better list
soldni May 8, 2024
6e8f31d
parsing from stdin
soldni May 8, 2024
6395416
fixed behavior with old omegaconf
soldni May 10, 2024
f6db05e
math parsers
soldni May 10, 2024
8f282cf
math parsers
soldni May 10, 2024
acf4ce5
science
soldni May 10, 2024
cbc448d
pipeline
soldni May 10, 2024
db2491c
errors
soldni May 10, 2024
149fea8
added backoff
soldni May 10, 2024
3bde00f
added backoff
soldni May 10, 2024
3903c7b
logging
soldni May 10, 2024
875b8cb
unused
soldni May 10, 2024
302e0a5
url fixes
soldni May 11, 2024
289d973
speedup
soldni May 11, 2024
38ca143
speedup
soldni May 11, 2024
0f848d0
data
soldni May 12, 2024
6351595
fix
soldni May 13, 2024
802a4de
extensions
soldni May 13, 2024
967fcbd
debug
soldni May 13, 2024
0e86ef9
backoff
soldni May 14, 2024
71159cc
missed a continue
soldni May 14, 2024
31cfc7d
more
soldni May 15, 2024
57587e3
more
soldni May 15, 2024
673d3f5
logging backoff
soldni May 15, 2024
478b387
removed unused deps
soldni May 15, 2024
01f796f
match
soldni May 15, 2024
256c3fb
removed profiler
soldni May 15, 2024
ccc2aa2
config
soldni May 15, 2024
78ad625
Merge branch 'soldni/backoff' of https://github.com/allenai/dolma int…
soldni May 15, 2024
cae4084
fixed api
soldni May 15, 2024
a6fffe7
fixed test
soldni May 15, 2024
4988f62
batching
soldni May 17, 2024
7306fcd
added support for batching
soldni May 17, 2024
8e57c2f
saving owm for later
soldni May 17, 2024
456f831
loosening reqs
soldni May 17, 2024
8ec9880
old python compatible syntax
soldni May 17, 2024
33a725c
html spans
soldni May 17, 2024
5f4ffc1
fixed
soldni May 17, 2024
b5e2b0d
Merge branch 'main' into soldni/backoff
soldni May 17, 2024
28be895
wip
soldni May 20, 2024
d3efee4
Merge branch 'main' into soldni/backoff
soldni May 21, 2024
aee8ba6
adding support for batching
soldni May 22, 2024
ff1e496
better eval
soldni May 22, 2024
ed28a7a
fixed minor failure
soldni May 22, 2024
0953e80
merge
soldni May 22, 2024
3c4f17d
Merge branch 'soldni/backoff' of https://github.com/allenai/dolma int…
soldni May 22, 2024
2cc4084
small fix in math for processors
soldni May 22, 2024
eda41c3
using native types when possible
soldni May 22, 2024
d93a54f
indent
soldni May 22, 2024
29dca70
copyright
soldni May 22, 2024
dc2fa98
better string
soldni May 22, 2024
97b3bd2
comment
soldni May 22, 2024
845072e
progressbar
soldni May 23, 2024
35719fc
added support for old-style retries_on_error
soldni May 23, 2024
67b3bda
added support for retries_on_error
soldni May 23, 2024
155319c
data
soldni May 23, 2024
d8cb681
deps
soldni May 23, 2024
e6270dc
get_annotations not available
soldni May 23, 2024
75a5b0d
fixes
soldni May 23, 2024
86371d6
quoting type aliases
soldni May 23, 2024
73aad08
3.8 compatibility
soldni May 23, 2024
b9ec3eb
more style
soldni May 23, 2024
e42f9fc
pyi
soldni May 23, 2024
d9cbac0
Merge branch 'soldni/pbar2' into soldni/backoff
soldni May 23, 2024
09aa96a
progress
soldni May 24, 2024
be6c984
viz pbar
soldni May 24, 2024
5c90e9f
fixes
soldni May 24, 2024
f5c696c
fixing small regression in tests
soldni May 24, 2024
e941f05
order from user
soldni May 24, 2024
99264b0
same order
soldni May 24, 2024
8f86b62
tests
soldni May 24, 2024
88a9e55
Merge branch 'soldni/pbar2' into soldni/backoff
soldni May 24, 2024
18a3f71
older
soldni May 24, 2024
f0e8af4
note for common runtime
soldni May 24, 2024
36c18d2
removing attempts
soldni May 24, 2024
708affc
progressbar
soldni May 25, 2024
824e11e
progressbar
soldni May 25, 2024
bb054ca
adding linearizers
soldni May 28, 2024
c2315e7
Merge branch 'soldni/backoff' of https://github.com/allenai/dolma int…
soldni May 28, 2024
3cec13a
license script
soldni May 28, 2024
81c473a
script
soldni May 28, 2024
42e1514
science
soldni May 28, 2024
4336a46
added better tests
soldni May 28, 2024
e01b408
added tests
soldni May 28, 2024
46135ab
types
soldni May 28, 2024
a8803e8
sorting
soldni May 29, 2024
002611f
name
soldni May 29, 2024
b090d25
send
soldni May 29, 2024
486ef25
typo
soldni May 29, 2024
275eb95
spacing
soldni May 29, 2024
26b791b
skipping big tests
soldni May 29, 2024
c9f5888
optional tests w large download
soldni May 29, 2024
628dd14
corner case failure
soldni May 29, 2024
bc61e05
quantized
soldni May 29, 2024
36b4275
s3 destination
soldni May 29, 2024
d7d629f
commit
soldni May 29, 2024
47b00dc
style
soldni May 30, 2024
af2f820
owm
soldni May 30, 2024
19c13cf
science v2
soldni May 30, 2024
2678191
science v2
soldni May 30, 2024
851e767
science v1
soldni May 30, 2024
5401ae5
science v1
soldni May 30, 2024
7dc4ad9
edit
soldni May 31, 2024
5763cc3
config
soldni May 31, 2024
493a18f
fixes total
soldni May 31, 2024
c436c0d
minor fix
soldni Jun 1, 2024
537b7a7
linersizer
soldni Jun 1, 2024
7221a0b
fallback
soldni Jun 1, 2024
5d09315
added compression
soldni Jun 2, 2024
937ccc0
adding test data
soldni Jun 2, 2024
e93cc48
wip
soldni Jun 4, 2024
b7c5c59
tests
soldni Jun 4, 2024
f6bbf23
tests
soldni Jun 4, 2024
33da8dc
fixed tests
soldni Jun 4, 2024
fd041c1
fixes
soldni Jun 5, 2024
68d6b35
added flags in config
soldni Jun 5, 2024
84370ca
Merge branch 'soldni/zst' into soldni/backoff
soldni Jun 5, 2024
928b85a
better error handling
soldni Jun 7, 2024
f5767c2
Merge branch 'main' into soldni/backoff
soldni Jun 7, 2024
592b7a8
Merge branch 'soldni/mixer-fix' into soldni/backoff
soldni Jun 7, 2024
a3ddbd2
added files (to be removed)
soldni Jun 7, 2024
1c6a62a
new stuff
soldni Jun 7, 2024
16b8dc9
other names
soldni Jun 7, 2024
f182d0d
update
soldni Jun 7, 2024
3227a11
configs
soldni Jun 7, 2024
75d2938
small fix gopher tagger
soldni Jun 8, 2024
f8b771b
addding configs
soldni Jun 8, 2024
cea2d1e
wip
soldni Jun 8, 2024
426ef1a
optional
soldni Jun 8, 2024
ed9cf90
para
soldni Jun 9, 2024
8a4dbfc
new resolver
soldni Jun 9, 2024
5f5feeb
random delay
soldni Jun 9, 2024
2b20cd4
delay
soldni Jun 9, 2024
6e9fad2
jitter log
soldni Jun 9, 2024
76c4f6b
feix
soldni Jun 9, 2024
9a86e09
dedup
soldni Jun 9, 2024
a1fb0e2
test
soldni Jun 9, 2024
cd3f38d
new steps
soldni Jun 9, 2024
1ee4bed
fixes
soldni Jun 9, 2024
d818395
all
soldni Jun 9, 2024
ffd7ccd
all
soldni Jun 9, 2024
06ecd9c
indent
soldni Jun 9, 2024
6102ef8
keyword
soldni Jun 9, 2024
e42b7c3
wip
soldni Jun 9, 2024
5107c34
discarding fields
soldni Jun 9, 2024
eaca238
w
soldni Jun 9, 2024
ca29511
sizes
soldni Jun 10, 2024
965c053
lciense
soldni Jun 10, 2024
4af5f21
fixing paths
soldni Jun 10, 2024
1c27e33
scripts to get labels
soldni Jun 11, 2024
def2027
exp
soldni Jun 9, 2024
f1c877a
test, stats
soldni Jun 13, 2024
3e07e51
optional id
soldni Jun 21, 2024
38a3122
reverted
soldni Jun 21, 2024
b487104
ext
soldni Jun 21, 2024
7462aa8
missed configs
soldni Jun 24, 2024
c8e4d7c
added function to count top k tokens
soldni Jun 24, 2024
ba66c91
missing
soldni Jun 24, 2024
d7558db
count
soldni Jul 6, 2024
a6e74a7
added option for tokenizer to split on special tokens
soldni Jul 13, 2024
a576020
added configs
soldni Jul 13, 2024
f31662a
Merge branch 'soldni/tiktoken' into soldni/backoff
soldni Jul 13, 2024
c31fab3
encoding special tokens
soldni Jul 13, 2024
f405d47
more paths
soldni Jul 15, 2024
3498632
configs
soldni Jul 15, 2024
e560e99
Merge branch 'main' into soldni/backoff
soldni Aug 8, 2024
58a84d4
cc-news-new
soldni Aug 21, 2024
ab586a1
Merge branch 'main' into soldni/backoff
soldni Aug 21, 2024
d7998ff
version
soldni Aug 23, 2024
5dd7611
adding new lengths
soldni Aug 24, 2024
bd46c36
script
soldni Aug 24, 2024
04277c4
partitions
soldni Aug 25, 2024
1768ff0
small
soldni Aug 27, 2024
a50fcaa
100 chars
soldni Aug 27, 2024
de42c1a
datasets
soldni Aug 27, 2024
d34012e
reformatted
soldni Aug 27, 2024
15c3ca6
Merge branch 'soldni/backoff' of https://github.com/allenai/dolma int…
soldni Aug 27, 2024
af13c63
.
soldni Oct 1, 2024
15608e1
science
soldni Oct 8, 2024
7ea0862
added option to provide compression during tagging
soldni Oct 8, 2024
4c3cab6
Merge branch 'soldni/compression2' into soldni/backoff
soldni Oct 8, 2024
da4957c
configs
soldni Nov 12, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 1 addition & 1 deletion .devcontainer/postInstall.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@

PATH=/home/vscode/.cargo/bin:$PATH
cd dolma
source /home/vscode/miniforge3/bin/activate && pip install cmake "maturin[patchelf]>=1.1,<2.0"
source /home/vscode/miniforge3/bin/activate && pip install cmake "maturin>=1.5,<2.0"
1 change: 1 addition & 0 deletions .github/workflows/CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ permissions:
env:
DOLMA_TESTS_SKIP_AWS: ${{ secrets.AWS_ACCESS_KEY_ID == '' && 'true' || 'false' }}
DOLMA_TEST_S3_PREFIX: s3://dolma-tests
DOLMA_TEST_SKIP_LARGE_MODELS: "true"
RUST_CHANNEL: stable

jobs:
Expand Down
2 changes: 1 addition & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

26 changes: 25 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,5 +1,29 @@
UNAME := $(shell uname)

ifeq ($(UNAME), Darwin)
OS_MESSAGE := "MacOS detected"
CMAKE_SETUP := "which cmake || brew install cmake"
PROTOBUF_SETUP := "which protoc || brew install protobuf"
OPENSSL_SETUP := "which openssl || brew install openssl"
else ifeq ($(UNAME), Linux)
OS_MESSAGE := "Linux detected"
CMAKE_SETUP := "which cmake || sudo apt-get install --yes build-essential cmake"
PROTOBUF_SETUP := "which protoc || sudo apt-get install --yes protobuf-compiler"
OPENSSL_SETUP := "which openssl || sudo apt-get install --yes libssl-dev"
else
OS_MESSAGE := "Unsupported OS; please install rust, cmake, protobuf, and openssl manually"
CMAKE_SETUP := ""
PROTOBUF_SETUP := ""
OPENSSL_SETUP := ""
endif

setup:
@./setup.sh
@echo "${OS_MESSAGE}: installing..."
$(shell "${CMAKE_SETUP}")
$(shell "${PROTOBUF_SETUP}")
$(shell "${OPENSSL_SETUP}")
which cargo || curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
which maturin || pip install 'maturin>=1.5,<2.0'

publish:
maturin publish
Expand Down
73 changes: 73 additions & 0 deletions configs/cc-news/dedupe-month.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
#! /usr/bin/env bash

# documents:
# - s3://ai2-llm/pretraining-data/sources/c4/v0/documents/train/*.gz

# dedupe:
# name: dedupe_para_ngrams_13_1
# paragraphs:
# attribute_name: dedupe_para_ngrams_13_1
# by_ngram:
# ngram_length: 13
# stride: 1
# overlap_threshold: 0.5
# skip_empty: true

# bloom_filter:
# file: ${oc.env:HOME}/c4_dedupe_para_ngrams_13_1.bin
# read_only: false
# # estimated doc count is obtained by counting number of words in paragraphs
# # then dividing by 13 (ngram_length) and multiplying by 2 (for each ngram)
# estimated_doc_count: 359_916_731_334
# desired_false_positive_rate: 0.1

# processes: 188
# work_dir:
# input: /tmp/c4_dedupe_para_ngrams_13_1/input
# output: /tmp/c4_dedupe_para_ngrams_13_1/output

# run years between 2016 and 2024
for year in {2016..2024}; do
# run months between 1 and 12
for month in {1..12}; do
# skip months after 7 if year is 2024
if [ $year -eq 2024 ] && [ $month -gt 7 ]; then
continue
fi

# skip months before 8 if year is 2016
if [ $year -eq 2016 ] && [ $month -lt 8 ]; then
continue
fi

# rename month to 2 digits
month=$(printf "%02d" $month)

documents="s3://ai2-llm/pretraining-data/sources/cc-news/v0-resiliparse/documents/${year}-${month}/*.zst"

size=$(aws s3api list-objects --bucket ai2-llm --prefix "pretraining-data/sources/cc-news/v0-resiliparse/documents/${year}-${month}/" --output json --query "[sum(Contents[].Size)]" | jq '.[0]' -rc)

# run deduplication
echo "Running fuzzy dedupe for ${year}-${month} with ${size} bytes Bloom filter"

set -ex

dolma dedupe \
--documents ${documents} \
--dedupe.name dedupe_ngrams_13_1 \
--dedupe.paragraphs.attribute_name dedupe_ngrams_13_1 \
--dedupe.paragraphs.by_ngram.ngram_length 13 \
--dedupe.paragraphs.by_ngram.stride 1 \
--dedupe.paragraphs.by_ngram.overlap_threshold 0.5 \
--dedupe.skip_empty \
--bloom_filter.file "${HOME}/cc-news/dedupe_ngrams_13_1-${year}-${month}.bin" \
--no-bloom_filter.read_only \
--bloom_filter.estimated_doc_count $size \
--bloom_filter.desired_false_positive_rate 0.01 \
--processes "$(expr $(nproc) - 4)" \
--work_dir.input /tmp/cc-news/dedupe_ngrams_13_1/${year}-${month}/input \
--work_dir.output /tmp/cc-news/dedupe_ngrams_13_1/${year}-${month}/output

set +ex
done
done
114 changes: 114 additions & 0 deletions configs/cc-news/dedupe-year.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
#! /usr/bin/env bash

# documents:
# - s3://ai2-llm/pretraining-data/sources/c4/v0/documents/train/*.gz

# dedupe:
# name: dedupe_para_ngrams_13_1
# paragraphs:
# attribute_name: dedupe_para_ngrams_13_1
# by_ngram:
# ngram_length: 13
# stride: 1
# overlap_threshold: 0.5
# skip_empty: true

# bloom_filter:
# file: ${oc.env:HOME}/c4_dedupe_para_ngrams_13_1.bin
# read_only: false
# # estimated doc count is obtained by counting number of words in paragraphs
# # then dividing by 13 (ngram_length) and multiplying by 2 (for each ngram)
# estimated_doc_count: 359_916_731_334
# desired_false_positive_rate: 0.1

# processes: 188
# work_dir:
# input: /tmp/c4_dedupe_para_ngrams_13_1/input
# output: /tmp/c4_dedupe_para_ngrams_13_1/output

# run years between 2016 and 2024
for year in {2016..2024}; do
# run months between 1 and 12

# Initialize an empty array to store document paths and a variable for total size
documents=()
size=0

# Collect all month document paths into the array and accumulate size
for month in {1..12}; do
# Skip months after 7 if year is 2024
if [ $year -eq 2024 ] && [ $month -gt 7 ]; then
continue
fi

# Skip months before 8 if year is 2016
if [ $year -eq 2016 ] && [ $month -lt 8 ]; then
continue
fi

# Format month as 2 digits
month=$(printf "%02d" $month)

# Add the document path for this month to the array
documents+=("s3://ai2-llm/pretraining-data/sources/cc-news/v0-resiliparse/documents/${year}-${month}/*.zst")

# Get the size for this month and add it to the total size
month_size=$(aws s3api list-objects --bucket ai2-llm --prefix "pretraining-data/sources/cc-news/v0-resiliparse/documents/${year}-${month}/" --output json --query "[sum(Contents[].Size)]" | jq '.[0]' -rc)
size=$((size + month_size))
done


# run deduplication
echo "Running fuzzy dedupe for ${year} with ${size} bytes Bloom filter"

# Start the output
document_linearized="documents:\n"

# Loop through the array and append each element
for doc in "${documents[@]}"; do
document_linearized+=" - $doc\n"
done

config_yaml=$(cat <<EOF
${document_linearized}
dedupe:
name: dedupe_by_year
paragraphs:
attribute_name: dedupe_ngrams_13_1
by_ngram:
ngram_length: 13
stride: 1
overlap_threshold: 0.5
skip_short_paragraphs: true
skip_empty: true

bloom_filter:
file: /tmp/cc_news_${year}_dedupe_ngram.bin
read_only: false
estimated_doc_count: ${size}
desired_false_positive_rate: 0.1

work_dir:
input: /tmp/cc_news_${year}_dedupe_para_ngrams_13_1/input
output: /tmp/cc_news_${year}_dedupe_para_ngrams_13_1/output
EOF
)


# Create a temporary file for the YAML config
temp_config_file=$(mktemp)

# Write the YAML config to the temporary file
printf "$config_yaml" > "$temp_config_file"


set -ex
# Run dolma with the temporary config file
dolma -c "$temp_config_file" dedupe --processes $(expr $(nproc) - 4)
set +ex

# Remove the temporary file
rm "$temp_config_file"

done
done
43 changes: 43 additions & 0 deletions configs/cc-news/extract.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
#! /usr/bin/env bash

# run years between 2016 and 2024
for year in {2016..2024}; do
# run months between 1 and 12
for month in {1..12}; do
# skip months after 7 if year is 2024
if [ $year -eq 2024 ] && [ $month -gt 7 ]; then
continue
fi

# skip months before 8 if year is 2016
if [ $year -eq 2016 ] && [ $month -lt 8 ]; then
continue
fi

# rename month to 2 digits
month=$(printf "%02d" $month)

documents="s3://ai2-russella/crawl-data/CC-NEWS/${year}/${month}/*.warc.gz"

# run the extraction
echo "Running extraction for ${year}-${month}"

set -ex

dolma warc \
--documents ${documents} \
--destination s3://ai2-llm/pretraining-data/sources/cc-news/v0-resiliparse/documents/${year}-${month} \
--processes "$(expr $(nproc) - 4)" \
--source_name cc-news_${year}-${month} \
--linearizer resiliparse \
--pre.taggers cc_re \
--no-pre.skip \
--no-store.html \
--store.attr_spans 500 \
--skip_duplicate_urls \
--work_dir.input /tmp/cc-news/${year}-${month}/input \
--work_dir.output /tmp/cc-news/${year}-${month}/output

set +ex
done
done
59 changes: 59 additions & 0 deletions configs/cc-news/make_lang_partition.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
import json
from typing import List
import smart_open


SRC_BASE = "s3://ai2-llm/pretraining-data/sources/cc-news"
SRC_PRFX = "v1-resiliparse"
LANG_THR = 100_000
DST_BASE = "s3://ai2-llm/pretraining-data/sources/cc-news"
DST_PRFX = f"v2-resiliparse-l{LANG_THR // 1000}k"


def base_stream_config(lang: str, year: int, months: List[int]):
return {
"name": f"cc-news_{year:04d}_{lang}",
"documents": [f"{SRC_BASE}/{SRC_PRFX}/documents/{year:04d}-{month:02d}/*.zst" for month in months],
"compression": {"input": "zst", "output": "zst"},
"output": {
"path": f"{DST_BASE}/{DST_PRFX}/documents/{lang}/{year:04d}",
"max_size_in_bytes": 10_000_000_000,
},
"attributes": ["ft_lang_id_1e2", "dolma_v2_tokenizer"],
"filter": {
"include": [],
"exclude": [
# at least 100 tokens
".attributes.dolma_v2_tokenizer__dolma_v2_tokenizer__length[0][-1] <= 100",
# no language detected or low confidence
f"(.attributes.ft_lang_id_1e2__ft_lang_id_1e2__{lang} == null) or (.attributes.ft_lang_id_1e2__ft_lang_id_1e2__{lang}[0][-1] < 0.5)",

],
"syntax": "jq",
},
}


def main():
with smart_open.open("s3://ai2-llm/stats/cc-news/v1-resiliparse/attributes/ft_lang_id_1e2_summary.json") as f:
lang_counts = json.load(f)

languages = {k: v for k, v in lang_counts.items() if v >= LANG_THR}

streams = []
for year in range(2016, 2025):
if year == 2016:
months = list(range(8, 13))
elif year == 2024:
months = list(range(1, 8))
else:
months = list(range(1, 13))

streams.extend([base_stream_config(lang, year, months) for lang in languages])

with smart_open.open("configs/cc-news/mix_v2.json", "wt") as f:
json.dump({"processes": 1, "streams": streams}, f, indent=2)


if __name__ == "__main__":
main()
Loading
Loading