Skip to content
This repository has been archived by the owner on Feb 27, 2022. It is now read-only.

Commit

Permalink
docs: add/revise the instructions for creating Doc2Vec models
Browse files Browse the repository at this point in the history
  • Loading branch information
tos-kamiya committed Nov 26, 2021
1 parent bc88873 commit 8633831
Show file tree
Hide file tree
Showing 5 changed files with 142 additions and 9 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
(reference: https://github.com/Kyubyong/wordvectors/blob/master/build_corpus.py)

(1) Preparation: Install dependencies:

```
pip3 install wikiextractor
pip3 install jieba
```

(2) Download Chinese wikipedia data from https://dumps.wikimedia.org/zhwiki/latest/

```
zhwiki-latest-pages-articles.xml.bz2 20-Nov-2021 20:50 2296371453
```

```
curl -O https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2
```

(3) Cleaning (remove XML tag, etc.)

Prepare a helper script zh_tokenize.py:

```python
import sys
import jieba

input_file = sys.argv[1]
output_file = sys.argv[2]

with open(output_file, 'w') as outp:
with open(input_file) as inp:
for L in inp:
L = L.rstrip()
try:
print(' '.join(jieba.cut(L, cut_all=False)), file=outp)
except UnicodeDecodeError:
pass
```

```
mkdir wc
python3 -m wikiextractor.WikiExtractor -b 100m -o wc zhwiki-latest-pages-articles.xml.bz2
ls wc/**/* | xargs -P11 -n1 -I "{}" python3 ../remove_doc_and_file_tags.py "{}" "{}".rdft
ls wc/**/*.rdft | xargs -P11 -n1 -I "{}" python3 zh_tokenize.py "{}" "{}".tokenized
cat wc/**/*.tokenized > wiki_tokenized
```

The option `-b 100m` of wikiextractor is size of data chunks, and the option `-P11` of xarg is the number of worker processes. Change these values depending on the data and your environment.

(4) Build Doc2Vec model

Vocabulary size: 100K words

```
python3 ../trim_vocab_to_size.py wiki_tokenized 100000 wiki_tokenized_w100k
python3 ../train.py wiki_tokenized_w100k zhwiki-w100k-d100.model
```

Vocabulary size: 50K words

```
python3 ../trim_vocab_to_size.py wiki_tokenized 50000 wiki_tokenized_w50k
python3 ../train.py wiki_tokenized_w50k zhwiki-w50k-d100.model
```
Original file line number Diff line number Diff line change
Expand Up @@ -32,13 +32,13 @@ The option `xarg -P5` is the number of worker processes. Change the value depend
Vocabulary size: 100K words

```
python3 remove_words_w_occurrence_less_than.py wiki 100000 wiki_w100k
python3 ./train.py wiki_w100k enwiki-w100k-d100.model
python3 ../trim_vocab_to_size.py wiki 100000 wiki_w100k
python3 ../train.py wiki_w100k enwiki-w100k-d100.model
```

Vocabulary size: 50K words

```
python3 remove_words_w_occurrence_less_than.py wiki 50000 wiki_w50k
python3 ./train.py wiki_w50k enwiki-w50k-d100.model
python3 ../trim_vocab_to_size.py wiki 50000 wiki_w50k
python3 ../train.py wiki_w50k enwiki-w50k-d100.model
```
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ curl -O https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.x
```
mkdir wc
python3 -m wikiextractor.WikiExtractor -b 500m -o wc jawiki-latest-pages-articles.xml.bz2
ls wc/**/* | xargs -P5 -n1 -I "{}" python3 ./remove_doc_and_file_tags.py "{}" "{}".rdft
ls wc/**/* | xargs -P5 -n1 -I "{}" python3 ../remove_doc_and_file_tags.py "{}" "{}".rdft
ls wc/**/*.rdft | xargs -P5 -n1 -I "{}" mecab -O wakati -o "{}".wakati "{}"
cat wc/**/*.wakati > wiki_wakati
```
Expand All @@ -33,13 +33,13 @@ cat wc/**/*.wakati > wiki_wakati
語彙数: 10万語

```
python3 remove_words_w_occurrence_less_than.py wiki_wakati 100000 wiki_wakati_w100k
python3 ./train.py wiki_wakati_w100k jawiki-w100k-d100.model
python3 ../trim_vocab_to_size.py wiki_wakati 100000 wiki_wakati_w100k
python3 ../train.py wiki_wakati_w100k jawiki-w100k-d100.model
```

語彙数: 5万語

```
python3 remove_words_w_occurrence_less_than.py wiki_wakati 50000 wiki_wakati_w50k
python3 ./train.py wiki_wakati_w50k jawiki-w50k-d100.model
python3 ../trim_vocab_to_size.py wiki_wakati 50000 wiki_wakati_w50k
python3 ../train.py wiki_wakati_w50k jawiki-w50k-d100.model
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
(reference: https://github.com/Kyubyong/wordvectors/blob/master/build_corpus.py)

(1) Preparation: Install dependencies:

```
pip3 install wikiextractor
pip3 install tweepy==3.7 # the latest ver (4) is incompatible with konlpy
pip3 install konlpy
```

(2) Download Korean wikipedia data from https://dumps.wikimedia.org/kowiki/latest/

```
kowiki-latest-pages-articles.xml.bz2 20-Nov-2021 19:38 797283917
```

```
curl -O https://dumps.wikimedia.org/kowiki/latest/kowiki-latest-pages-articles.xml.bz2
```

(3) Cleaning (remove XML tag, etc.)

Prepare a helper script ko_tokenize.py:

```python
import sys
from konlpy.tag import Kkma

input_file = sys.argv[1]
output_file = sys.argv[2]

k = Kkma()

with open(output_file, 'w') as outp:
with open(input_file) as inp:
for L in inp:
L = L.rstrip()
try:
print(' '.join(w for w, _ in k.pos(L)), file=outp)
except UnicodeDecodeError:
pass
```

```
mkdir wc
python3 -m wikiextractor.WikiExtractor -b 100m -o wc kowiki-latest-pages-articles.xml.bz2
ls wc/**/* | xargs -P11 -n1 -I "{}" python3 ../remove_doc_and_file_tags.py "{}" "{}".rdft
ls wc/**/*.rdft | xargs -P11 -n1 -I "{}" python3 ko_tokenize.py "{}" "{}".tokenized
cat wc/**/*.tokenized > wiki_tokenized
```

The option `-b 100m` of wikiextractor is size of data chunks, and the option `-P11` of xarg is the number of worker processes. Change these values depending on the data and your environment.

(4) Build Doc2Vec model

Vocabulary size: 100K words

```
python3 ../trim_vocab_to_size.py wiki_tokenized 100000 wiki_tokenized_w100k
python3 ../train.py wiki_tokenized_w100k kowiki-w100k-d100.model
```

Vocabulary size: 50K words

```
python3 ../trim_vocab_to_size.py wiki_tokenized 50000 wiki_tokenized_w50k
python3 ../train.py wiki_tokenized_w50k kowiki-w50k-d100.model
```

0 comments on commit 8633831

Please sign in to comment.