docs: add/revise the instructions for creating Doc2Vec models

tos-kamiya · Nov 26, 2021 · 8633831 · 8633831
1 parent bc88873
commit 8633831
Show file tree

Hide file tree

Showing 5 changed files with 142 additions and 9 deletions.
diff --git a/making_doc2vec_model/how_to_make_chinese_doc2vec_model_from_zhwiki_data.md b/making_doc2vec_model/how_to_make_chinese_doc2vec_model_from_zhwiki_data.md
@@ -0,0 +1,65 @@
+(reference: https://github.com/Kyubyong/wordvectors/blob/master/build_corpus.py)
+
+(1) Preparation: Install dependencies:
+
+```
+pip3 install wikiextractor
+pip3 install jieba
+```
+
+(2) Download Chinese wikipedia data from https://dumps.wikimedia.org/zhwiki/latest/
+
+```
+zhwiki-latest-pages-articles.xml.bz2               20-Nov-2021 20:50          2296371453
+```
+
+```
+curl -O https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2
+```
+
+(3) Cleaning (remove XML tag, etc.)
+
+Prepare a helper script zh_tokenize.py:
+
+```python
+import sys
+import jieba
+
+input_file = sys.argv[1]
+output_file = sys.argv[2]
+
+with open(output_file, 'w') as outp:
+    with open(input_file) as inp:
+        for L in inp:
+            L = L.rstrip()
+            try:
+                print(' '.join(jieba.cut(L, cut_all=False)), file=outp)
+            except UnicodeDecodeError:
+                pass
+```
+
+```
+mkdir wc
+python3 -m wikiextractor.WikiExtractor -b 100m -o wc zhwiki-latest-pages-articles.xml.bz2
+ls wc/**/* | xargs -P11 -n1 -I "{}" python3 ../remove_doc_and_file_tags.py "{}" "{}".rdft
+ls wc/**/*.rdft | xargs -P11 -n1 -I "{}" python3 zh_tokenize.py "{}" "{}".tokenized
+cat wc/**/*.tokenized > wiki_tokenized
+```
+
+The option `-b 100m` of wikiextractor is size of data chunks, and the option `-P11` of xarg is the number of worker processes. Change these values depending on the data and your environment.
+
+(4) Build Doc2Vec model
+
+Vocabulary size: 100K words
+
+```
+python3 ../trim_vocab_to_size.py wiki_tokenized 100000 wiki_tokenized_w100k
+python3 ../train.py wiki_tokenized_w100k zhwiki-w100k-d100.model
+```
+
+Vocabulary size: 50K words
+
+```
+python3 ../trim_vocab_to_size.py wiki_tokenized 50000 wiki_tokenized_w50k
+python3 ../train.py wiki_tokenized_w50k zhwiki-w50k-d100.model
+```
diff --git a/making_doc2vec_model/how_to_make_english_doc2vec_model_from_enwiki_data.md b/making_doc2vec_model/how_to_make_english_doc2vec_model_from_enwiki_data.md
@@ -32,13 +32,13 @@ The option `xarg -P5` is the number of worker processes. Change the value depend
 Vocabulary size: 100K words
 
 ```
-python3 remove_words_w_occurrence_less_than.py wiki 100000 wiki_w100k
-python3 ./train.py wiki_w100k enwiki-w100k-d100.model
+python3 ../trim_vocab_to_size.py wiki 100000 wiki_w100k
+python3 ../train.py wiki_w100k enwiki-w100k-d100.model
 ```
 
 Vocabulary size: 50K words
 
 ```
-python3 remove_words_w_occurrence_less_than.py wiki 50000 wiki_w50k
-python3 ./train.py wiki_w50k enwiki-w50k-d100.model
+python3 ../trim_vocab_to_size.py wiki 50000 wiki_w50k
+python3 ../train.py wiki_w50k enwiki-w50k-d100.model
 ```
diff --git a/making_doc2vec_model/how_to_make_japanese_doc2vec_model_from_jawiki_data.ja_JP.md b/making_doc2vec_model/how_to_make_japanese_doc2vec_model_from_jawiki_data.ja_JP.md
@@ -21,7 +21,7 @@ curl -O https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.x
 ```
 mkdir wc
 python3 -m wikiextractor.WikiExtractor -b 500m -o wc jawiki-latest-pages-articles.xml.bz2
-ls wc/**/* | xargs -P5 -n1 -I "{}" python3 ./remove_doc_and_file_tags.py "{}" "{}".rdft
+ls wc/**/* | xargs -P5 -n1 -I "{}" python3 ../remove_doc_and_file_tags.py "{}" "{}".rdft
 ls wc/**/*.rdft | xargs -P5 -n1 -I "{}" mecab -O wakati -o "{}".wakati "{}"
 cat wc/**/*.wakati > wiki_wakati
 ```
@@ -33,13 +33,13 @@ cat wc/**/*.wakati > wiki_wakati
 語彙数: 10万語
 
 ```
-python3 remove_words_w_occurrence_less_than.py wiki_wakati 100000 wiki_wakati_w100k
-python3 ./train.py wiki_wakati_w100k jawiki-w100k-d100.model
+python3 ../trim_vocab_to_size.py wiki_wakati 100000 wiki_wakati_w100k
+python3 ../train.py wiki_wakati_w100k jawiki-w100k-d100.model
 ```
 
 語彙数: 5万語
 
 ```
-python3 remove_words_w_occurrence_less_than.py wiki_wakati 50000 wiki_wakati_w50k
-python3 ./train.py wiki_wakati_w50k jawiki-w50k-d100.model
+python3 ../trim_vocab_to_size.py wiki_wakati 50000 wiki_wakati_w50k
+python3 ../train.py wiki_wakati_w50k jawiki-w50k-d100.model
 ```
diff --git a/making_doc2vec_model/how_to_make_korean_doc2vec_model_from_kowiki_data.md b/making_doc2vec_model/how_to_make_korean_doc2vec_model_from_kowiki_data.md
@@ -0,0 +1,68 @@
+(reference: https://github.com/Kyubyong/wordvectors/blob/master/build_corpus.py)
+
+(1) Preparation: Install dependencies:
+
+```
+pip3 install wikiextractor
+pip3 install tweepy==3.7  # the latest ver (4) is incompatible with konlpy
+pip3 install konlpy
+```
+
+(2) Download Korean wikipedia data from https://dumps.wikimedia.org/kowiki/latest/
+
+```
+kowiki-latest-pages-articles.xml.bz2               20-Nov-2021 19:38           797283917
+```
+
+```
+curl -O https://dumps.wikimedia.org/kowiki/latest/kowiki-latest-pages-articles.xml.bz2
+```
+
+(3) Cleaning (remove XML tag, etc.)
+
+Prepare a helper script ko_tokenize.py:
+
+```python
+import sys
+from konlpy.tag import Kkma
+
+input_file = sys.argv[1]
+output_file = sys.argv[2]
+
+k = Kkma()
+
+with open(output_file, 'w') as outp:
+    with open(input_file) as inp:
+        for L in inp:
+            L = L.rstrip()
+            try:
+                print(' '.join(w for w, _ in k.pos(L)), file=outp)
+            except UnicodeDecodeError:
+                pass
+```
+
+```
+mkdir wc
+python3 -m wikiextractor.WikiExtractor -b 100m -o wc kowiki-latest-pages-articles.xml.bz2
+ls wc/**/* | xargs -P11 -n1 -I "{}" python3 ../remove_doc_and_file_tags.py "{}" "{}".rdft
+ls wc/**/*.rdft | xargs -P11 -n1 -I "{}" python3 ko_tokenize.py "{}" "{}".tokenized
+cat wc/**/*.tokenized > wiki_tokenized
+```
+
+The option `-b 100m` of wikiextractor is size of data chunks, and the option `-P11` of xarg is the number of worker processes. Change these values depending on the data and your environment.
+
+(4) Build Doc2Vec model
+
+Vocabulary size: 100K words
+
+```
+python3 ../trim_vocab_to_size.py  wiki_tokenized 100000 wiki_tokenized_w100k
+python3 ../train.py wiki_tokenized_w100k kowiki-w100k-d100.model
+```
+
+Vocabulary size: 50K words
+
+```
+python3 ../trim_vocab_to_size.py  wiki_tokenized 50000 wiki_tokenized_w50k
+python3 ../train.py wiki_tokenized_w50k kowiki-w50k-d100.model
+```
diff --git a/...el/remove_words_w_occurrence_less_than.py → making_doc2vec_model/trim_vocab_to_size.py b/...el/remove_words_w_occurrence_less_than.py → making_doc2vec_model/trim_vocab_to_size.py