Bash one liners for cleaning long text columns in data science

Get your data into a tsv file or a separate file for each column if you want to increase the speed.

I'll present to you filtering and accumulating one-liners that help you to explore the data and find special cases.

first, get the column you need; for example, the second one:

cut -f5 > column.tsv

note, to convert csv file to tsv, you can use tr , '\t'.

to see frequency of unique entries:

cat column.tsv | sort | uniq -c | sort -n

to see all unique characters:

cat column.tsv |grep -o . | sort |uniq -c| sort -n

to display all the unique characters on one screen with multiple columns (like ls) use:

| sed 's/ *[0-9]* //' | pr -65 -w130 -t

, where -65 is the number of columns, and -w130 is the width of your terminal

once you found characters that shouldn't be there, just use grep to see the lines with those characters (note fgrep is way faster https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html):

| fgrep '"'

or you can loop through a bunch of symbols:

for i in ≥ ─ ± ² Å Ā æ č ğ ī Ü Ε ε Ζ ζ Η η Λ λ Μ Ξ ξ Ρ Σ σ Υ υ Φ φ Ψ ψ Ω ω ´ ≠ ≤ µ _ ¿ ɑ; do echo $i ; cat cedict_1_0_ts_utf-8_mdbg.txt | sed '/^#.*/d'| fgrep "$i"; done

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bash one liners for cleaning long text columns in data science

Clone this wiki locally