`giashard`: Sharding for Web-Scale Parallel Text Mining

giashard is a tool for batching webcrawled data for later processing. It is designed as part of a corpus creation pipeline in projects like Paracrawl and HPLT.

Installation

giashard is written in Go. To install, you need to clone the repo and then build the application:

git clone https://github.com/paracrawl/giashard.git
cd giashard/cmd/giashard
go build

Running `giashard`

giashard can accept three input formats:

A directory (or list of directories) in bitextor/Paracrawl column storage format: each directory contains three files named url.gz, mime.gz and plain_text.gz (by default). A different number of files and different names for these files can be specified with the -f flag
A zstd-compressed file (or list of files) in the JSONL format where each record contains at minimum one field named u containing a URL and one field named text containing the extracted content in plain text.
An uncompressed stream to stdin in the above JSONL format (indicated by - as the input file: e.g. cat myfile.jsonl | giashard -o myoutput -)

giashard uses the following flags:

-o: Output directory location (default: current directory)
-l: Input file containing a list of files/directories to shard (default: "")
-f: Comma-separated list of files to shard for bitextor/Paracrawl column storage format input (default:"url,mime,plaintext")
-n: Exponent to calculate number of shards (2^n) (default: 8)
-b: Batch size in MB (default: 100)
-d: Additional public suffix entries (default: "")
-jsonl: Boolean indicating data is in JSONL format (default: False)

Example command:

ls -1d output_wide15_filtered/*/is | xargs giashard/cmd/giashard/giashard -n 8 -o output_wide15_sharded -f text,url -b 1024

This runs giashard on all Icelandic data in the output_wide15_filtered directory (in bitextor/Paracrawl column storage format) where each input directory contains two files: text.gz and url.gz. It sorts this data into 2^8 numbered shards where shard membership is assigned based on a hash of the URL. The data in each shard is split into numbered batches of approximately 1024MB. Output text is base64 encoded.

`giashardid`

There is a companion tool called giashardid that you can give a URL to either on the command line or stdin, and it will print the shard id that that URL will get sorted to. If you give it the -s flag, instead of printing the shard id, it will print the slug derived from the hostname in the URL.

So, for example, we can find out what shard, Google lives in,

$ giashardid google.com
48

And then, if we are curious, we can find out what other domains containing Dutch text live in that shard,

$ find wide00006-shards/nl/48 -name url.gz | xargs cat | gzip -dc | \
    giashardid -s | sort | uniq -c | sort -nr | head -10

6483 google 855 paginamarkt 604 vikingdirect 592 ajax1 392 jijislief 277 ixina 209 punkyfish 182 bongo 154 ooyyo 150 ledlampendirect

This should be easily installable using

go get github.com/paracrawl/giashardid/cmd/...

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
cmd		cmd
README.md		README.md
batch.go		batch.go
colreader.go		colreader.go
colwriter.go		colwriter.go
go.mod		go.mod
go.sum		go.sum
jsonlreader.go		jsonlreader.go
linereader.go		linereader.go
linewriter.go		linewriter.go
shard.go		shard.go
shard_test.go		shard_test.go
stat.go		stat.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`giashard`: Sharding for Web-Scale Parallel Text Mining

Installation

Running `giashard`

`giashardid`

About

Releases 1

Packages

Contributors 4

Languages

paracrawl/giashard

Folders and files

Latest commit

History

Repository files navigation

giashard: Sharding for Web-Scale Parallel Text Mining

Installation

Running giashard

giashardid

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 4

Languages

`giashard`: Sharding for Web-Scale Parallel Text Mining

Running `giashard`

`giashardid`

Packages