Skip to content

Use PTT and Chinese Wiki corpora to build count-based and prediction-based word embeddings.

License

Notifications You must be signed in to change notification settings

play0137/Traditional_Chinese_word_embedding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CC BY-SA 4.0

Overview

Use PTT (bulletin board system (BBS) in Taiwan) and Chinese Wiki corpora to build count-based and prediction-based word embeddings.
The evaluations in similarity/relatedness tasks are better than the other pre-trained word embeddings.

Chinese Word Embeddings

Download

Hyperparameter

Chinese_word_embedding_count_based

Hyperparameter Setting
Frequency weighting SPPMI_k10
Window size 3
Dimensions 700
Remove first k dimensions 6
Weighting exponent 0.5
Discover new words no

Chinese_word_embedding_CBOW

Hyperparameter Setting
Window size 2
Dimensions 500
Model CBOW
Learning rate 0.025
Sampling rate 0.00001
Negative samples 2
Discover new words no

Reference

If you use the Chinese word embedding in your works, please cite this paper:

Ying-Ren Chen (2021). Generate coherent text using semantic embedding, common sense templates and Monte-Carlo tree search methods (Master's thesis, National Tsing Hua University, Hsinchu, Taiwan).

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
CC BY-SA 4.0

About

Use PTT and Chinese Wiki corpora to build count-based and prediction-based word embeddings.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages