Fix double quotation mark(") in each column for Pandas csv parser compatibility. #8

dalgarak · 2023-01-18T01:00:15Z

This commit fixes some double quotation mark(") pair mismatch errors in each column, without compromising the meaning of the original sentence. (and fixes issue #2 , #4 , #5 , too.)

pandas.read_csv() method is often used for parsing tsv files in many projects, including huggingface datasets.
(e.g. https://huggingface.co/datasets/kor_nlu) but that method utilizes double quotation mark(") at the beginning to escape special characters in the column (same as csv convention), as a result, the TAB character is not parsed properly.

"multinli" dataset (KorNLI/multinli.train.ko.tsv) is not used in huggingface datasets due to related error, (see https://huggingface.co/datasets/kor_nlu/blob/main/kor_nlu.py#L28) and for the same reason, the number of data in testset is treated differently from published one.

…patibility.

Fix double quotation mark(") in each column for pandas csv parser com…

85762e6

…patibility.

ys7yoo mentioned this pull request Mar 26, 2023

parsing error with pandas ys7yoo/kor-nlu-datasets#1

Open

ys7yoo pushed a commit to ys7yoo/kor-nlu-datasets that referenced this pull request Mar 26, 2023

fixed double quotaion mark error by dalgarak: kakaobrain#8

41be6b0

fix doublequote mismath to parse correctly

469aced

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix double quotation mark(") in each column for Pandas csv parser compatibility. #8

Fix double quotation mark(") in each column for Pandas csv parser compatibility. #8

dalgarak commented Jan 18, 2023

Fix double quotation mark(") in each column for Pandas csv parser compatibility. #8

Are you sure you want to change the base?

Fix double quotation mark(") in each column for Pandas csv parser compatibility. #8

Conversation

dalgarak commented Jan 18, 2023