Skip to content

Text Cleaning

Yimin Jing edited this page Feb 18, 2022 · 7 revisions

delete_escape_character

import takin

zh_text = "中国是一个美丽的地方\n请告诉我你在哪儿。\n我一定会去找你\t在我的怀里\t在你的眼里"
en_text = "Today is sunday\nwe are happy\nwe are fun."
print(takin.delete_escape_character(zh_text, lang="zh", add_punc=False))
print(takin.delete_escape_character(zh_text, lang="zh", add_punc=True))
print(takin.delete_escape_character(en_text, lang="en", add_punc=False))
print(takin.delete_escape_character(en_text, lang="en", add_punc=True))

>>> 中国是一个美丽的地方请告诉我你在哪儿我一定会去找你在我的怀里在你的眼里
>>> 中国是一个美丽的地方请告诉我你在哪儿我一定会去找你在我的怀里在你的眼里
>>> Today is sundaywe are happywe are fun.
>>> Today is sunday. we are happy. we are fun.

delete_extra_whitespace

zh_text = "我 们  都非   常快 乐   。 "
en_text = "Takin  ,    is very   useful  .    "
print(takin.delete_extra_whitespace(zh_text, lang="zh"))
print(takin.delete_extra_whitespace(en_text, lang="en"))

>>> 我们都非常快乐>>> Takin, is very useful.
Clone this wiki locally