This repository has been archived by the owner on Feb 27, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: add/revise the instructions for creating Doc2Vec models
- Loading branch information
1 parent
bc88873
commit 8633831
Showing
5 changed files
with
142 additions
and
9 deletions.
There are no files selected for viewing
65 changes: 65 additions & 0 deletions
65
making_doc2vec_model/how_to_make_chinese_doc2vec_model_from_zhwiki_data.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
(reference: https://github.com/Kyubyong/wordvectors/blob/master/build_corpus.py) | ||
|
||
(1) Preparation: Install dependencies: | ||
|
||
``` | ||
pip3 install wikiextractor | ||
pip3 install jieba | ||
``` | ||
|
||
(2) Download Chinese wikipedia data from https://dumps.wikimedia.org/zhwiki/latest/ | ||
|
||
``` | ||
zhwiki-latest-pages-articles.xml.bz2 20-Nov-2021 20:50 2296371453 | ||
``` | ||
|
||
``` | ||
curl -O https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2 | ||
``` | ||
|
||
(3) Cleaning (remove XML tag, etc.) | ||
|
||
Prepare a helper script zh_tokenize.py: | ||
|
||
```python | ||
import sys | ||
import jieba | ||
|
||
input_file = sys.argv[1] | ||
output_file = sys.argv[2] | ||
|
||
with open(output_file, 'w') as outp: | ||
with open(input_file) as inp: | ||
for L in inp: | ||
L = L.rstrip() | ||
try: | ||
print(' '.join(jieba.cut(L, cut_all=False)), file=outp) | ||
except UnicodeDecodeError: | ||
pass | ||
``` | ||
|
||
``` | ||
mkdir wc | ||
python3 -m wikiextractor.WikiExtractor -b 100m -o wc zhwiki-latest-pages-articles.xml.bz2 | ||
ls wc/**/* | xargs -P11 -n1 -I "{}" python3 ../remove_doc_and_file_tags.py "{}" "{}".rdft | ||
ls wc/**/*.rdft | xargs -P11 -n1 -I "{}" python3 zh_tokenize.py "{}" "{}".tokenized | ||
cat wc/**/*.tokenized > wiki_tokenized | ||
``` | ||
|
||
The option `-b 100m` of wikiextractor is size of data chunks, and the option `-P11` of xarg is the number of worker processes. Change these values depending on the data and your environment. | ||
|
||
(4) Build Doc2Vec model | ||
|
||
Vocabulary size: 100K words | ||
|
||
``` | ||
python3 ../trim_vocab_to_size.py wiki_tokenized 100000 wiki_tokenized_w100k | ||
python3 ../train.py wiki_tokenized_w100k zhwiki-w100k-d100.model | ||
``` | ||
|
||
Vocabulary size: 50K words | ||
|
||
``` | ||
python3 ../trim_vocab_to_size.py wiki_tokenized 50000 wiki_tokenized_w50k | ||
python3 ../train.py wiki_tokenized_w50k zhwiki-w50k-d100.model | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
68 changes: 68 additions & 0 deletions
68
making_doc2vec_model/how_to_make_korean_doc2vec_model_from_kowiki_data.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
(reference: https://github.com/Kyubyong/wordvectors/blob/master/build_corpus.py) | ||
|
||
(1) Preparation: Install dependencies: | ||
|
||
``` | ||
pip3 install wikiextractor | ||
pip3 install tweepy==3.7 # the latest ver (4) is incompatible with konlpy | ||
pip3 install konlpy | ||
``` | ||
|
||
(2) Download Korean wikipedia data from https://dumps.wikimedia.org/kowiki/latest/ | ||
|
||
``` | ||
kowiki-latest-pages-articles.xml.bz2 20-Nov-2021 19:38 797283917 | ||
``` | ||
|
||
``` | ||
curl -O https://dumps.wikimedia.org/kowiki/latest/kowiki-latest-pages-articles.xml.bz2 | ||
``` | ||
|
||
(3) Cleaning (remove XML tag, etc.) | ||
|
||
Prepare a helper script ko_tokenize.py: | ||
|
||
```python | ||
import sys | ||
from konlpy.tag import Kkma | ||
|
||
input_file = sys.argv[1] | ||
output_file = sys.argv[2] | ||
|
||
k = Kkma() | ||
|
||
with open(output_file, 'w') as outp: | ||
with open(input_file) as inp: | ||
for L in inp: | ||
L = L.rstrip() | ||
try: | ||
print(' '.join(w for w, _ in k.pos(L)), file=outp) | ||
except UnicodeDecodeError: | ||
pass | ||
``` | ||
|
||
``` | ||
mkdir wc | ||
python3 -m wikiextractor.WikiExtractor -b 100m -o wc kowiki-latest-pages-articles.xml.bz2 | ||
ls wc/**/* | xargs -P11 -n1 -I "{}" python3 ../remove_doc_and_file_tags.py "{}" "{}".rdft | ||
ls wc/**/*.rdft | xargs -P11 -n1 -I "{}" python3 ko_tokenize.py "{}" "{}".tokenized | ||
cat wc/**/*.tokenized > wiki_tokenized | ||
``` | ||
|
||
The option `-b 100m` of wikiextractor is size of data chunks, and the option `-P11` of xarg is the number of worker processes. Change these values depending on the data and your environment. | ||
|
||
(4) Build Doc2Vec model | ||
|
||
Vocabulary size: 100K words | ||
|
||
``` | ||
python3 ../trim_vocab_to_size.py wiki_tokenized 100000 wiki_tokenized_w100k | ||
python3 ../train.py wiki_tokenized_w100k kowiki-w100k-d100.model | ||
``` | ||
|
||
Vocabulary size: 50K words | ||
|
||
``` | ||
python3 ../trim_vocab_to_size.py wiki_tokenized 50000 wiki_tokenized_w50k | ||
python3 ../train.py wiki_tokenized_w50k kowiki-w50k-d100.model | ||
``` |
File renamed without changes.