-
Notifications
You must be signed in to change notification settings - Fork 0
Bash one liners for cleaning long text columns in data science
Get your data into a tsv file or a separate file for each column if you want to increase the speed.
I'll present to you filtering and accumulating one-liners that help you to explore the data and find special cases.
first, get the column you need; for example, the second one:
cut -f5 > column.tsv
note, to convert csv file to tsv, you can use tr , '\t'
.
to see frequency of unique entries:
cat column.tsv | sort | uniq -c | sort -n
to see all unique characters:
cat column.tsv |grep -o . | sort |uniq -c| sort -n
to display all the unique characters on one screen with multiple columns (like ls
) use:
| sed 's/ *[0-9]* //' | pr -65 -w130 -t
, where -65
is the number of columns, and -w130
is the width of your terminal
once you found characters that shouldn't be there, just use grep to see the lines with those characters (note fgrep is way faster https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html):
| fgrep '"'
or you can loop through a bunch of symbols:
for i in ≥ ─ ± ² Å Ā æ č ğ ī Ü Ε ε Ζ ζ Η η Λ λ Μ Ξ ξ Ρ Σ σ Υ υ Φ φ Ψ ψ Ω ω ´ ≠ ≤ µ _ ¿ ɑ; do echo $i ; cat cedict_1_0_ts_utf-8_mdbg.txt | sed '/^#.*/d'| fgrep "$i"; done