-
Notifications
You must be signed in to change notification settings - Fork 21
/
Copy pathREADME.Rmd
92 lines (57 loc) · 2.28 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# KoSpacing
<!-- badges: start -->
[![License: GPL v3](https://img.shields.io/badge/License-GPL%20v3-blue.svg)](http://www.gnu.org/licenses/gpl-3.0)
<!-- badges: end -->
R package for automatic Korean word spacing.
Python verson can be found [here](https://github.com/haven-jeon/PyKoSpacing).
#### Introduction
Word spacing is one of the important parts of the preprocessing of Korean text analysis. Accurate spacing greatly affects the accuracy of subsequent text analysis. `KoSpacing` has fairly accurate automatic word spacing performance, especially good for online text originated from SNS or SMS.
For example.
"아버지가방에들어가신다." can be spaced both of below.
1. "아버지가 방에 들어가신다." means "My father enters the room."
1. "아버지 가방에 들어가신다." means "My father goes into the bag."
Common sense, the first is the right answer.
`KoSpacing` is based on Deep Learning model trained from large corpus(more than 100 million NEWS articles from [Chan-Yub Park](https://github.com/mrchypark)).
#### Performance
| Test Set | Accuracy |
|---|---|
| Sejong(colloquial style) Corpus(1M) | 97.1% |
| OOOO(literary style) Corpus(3M) | 94.3% |
- Accuracy = # correctly spaced characters/# characters in the test data.
- Might be increased performance if normalize compound words.
#### Install
To install from GitHub, use
```
install.packages('remotes')
remotes::install_github('haven-jeon/KoSpacing')
library(KoSpacing)
set_env()
```
#### Example
```{r}
library(KoSpacing)
spacing("김형호영화시장분석가는'1987'의네이버영화정보네티즌10점평에서언급된단어들을지난해12월27일부터올해1월10일까지통계프로그램R과KoNLP패키지로텍스트마이닝하여분석했다.")
```
#### Model Architecture
![](arch.png)
#### Citation
```markdowns
@misc{heewon2018,
author = {Heewon Jeon},
title = {KoSpacing: Automatic Korean word spacing},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/haven-jeon/KoSpacing}}
```