Skip to content

albert-decatur/dotfiles

Repository files navigation

dotfiles

Bash (or zsh!) functions for data science.
Check dpkg.md for a list of apt packages that are awesome. On a Debian based system (like Ubuntu!) you can

sudo apt-get update && sudo apt-get install $( cat dpkg.md | sed 's:*.*\|#.*::g' | grep -vE "^\s" | tr '\n' ' ' )
function purpose example prerequisite
latest print the name of the most recently modified file in the current directory latest
latest-exec print the path to the most recently modified n executable files under your $PATH. defaults to an n of 1 latest-exec 3
listold list the oldest files over n MB in current directory listold 100 parallel,mawk
maybedups prints TSV of possible file duplicates of largest n files under current directory. NB: files might not be duplicates but it's fast maybedups 10 | csvlook -t | vim - tawk
clipboard pipe text to clipboard cat foo | clipboard xclip
pdf_subset take a page range from a PDF pdf_subset in.pdf 23-41 out.pdf pdftk
ngrams get ngrams of length n from a column, treating records as documents cat foo.tsv | cut -f3 | ngrams 2
plotbars use ggplot to make PNG bar graph of a TSV. high res an option! cat foo.tsv | tawk '{print $2}' | sortfreq | sortkh "-k2 n" | plotbars year count "title" 20 | feh - ggplot2,Rio
dumbplot use GNUplot to graph one or two numeric fields in the terminal. removes header if found. assumes should graph points but can graphs lines. inspired by jeroenjanssens cat foo.tsv | cut -f3,4 | dumbplot OR cat foo.tsv | cut -f4 | dumbplot lines gnuplot
table2tsv convert any Gnumeric compatible table to TSV cat foo.csv | table2tsv Gnumeric
table2csv convert any CSVKit compatible table to CSV cat foo.tsv | table2csv csvkit
tsv2githubmd print a TSV as a GitHub flavored markdown table cat foo.tsv |tsv2githubmd >> README.md
tsv2redis get redis hashes from each record of a TSV cat foo.tsv | tsv2redis && echo "hgetall 1" redis-server,redis-tools,GNU moreutils,tawk,trim,mawk
joinmany_csv join an arbitrary number of TSVs on a given (identically named) field joinmany_csv "a b.csv 1.csv d e f" "project id" "full outer join" txt2pgsql.pl,postgreSQL
joinmany_csv join an arbitrary number of CSVs on a given (identically named) field. Note that this cannot use OUTER or RIGHT joins b/c it relies on SQLite joinmany_csv "/tmp/a /tmp/b /tmp/c /tmp/d /tmp/e /tmp/f" project_id inner tabs csv2sqlite.py,SQLite3
joinmany_psql join an arbitrary number of postgres tables on a given (identically named) field joinmany_psql "a b c d e f" project_id "full outer join" db_name postgreSQL
psql_listcols for a PostgreSQL DB, print a TSV of all table names and their corresponding field names psql_listcols my_db parallel,mawk
samplekh get n random records and keep your header line on top cat foo.tsv | samplekh 3000
sortkh sort a TSV using UNIX sort options, keeping header in place cat foo.tsv | sortkh "-k2 -n"
sortfreq print counts of unique values descending. keeps header in place cat foo.tsv | tawk '{ print $4 }' | sortfreq
col_sort use UNIX sort flags (eg -n or -d) to reorder TSV fields col_sort -n foo.tsv | sponge foo.tsv mawk,csvkit,table2tsv
col_extra print records that have content beyond expected number of fields for delimited text cat foo.tsv | col_extra 19 mawk
col_swap switch the position of two columns in delimited text cat foo.tsv | col_swap 3 4 | sponge foo.tsv mawk
funky_chars return the count for each unique non-alpha non-digit character in the input cat foo.tsv | tawk '{ print $4 }' | funky_chars
trim remove leading and trailing whitespace cat foo | trim
round round numeric field to the nearest n digits cat foo | round 2
sumawk sum a single numeric field cat foo.tsv | tawk '{ print $2 }' | sumawk
uniqvals given a TSV, return a TSV with the frequency of all unique values shown for each field cat foo.tsv | uniqvals | csvlook -t | vim - mawk
unique given a single column, return the first appearance of each unique value cat foo.tsv | c 1 | unique mawk
mkid given a TSV, returns input with an integer ID field at the front cat foo.tsv | mkid
tawk make awk take in TSV and output TSV cat foo.tsv | tawk '{ print $4,$5 }' mawk
pawk make awk take in pipe separated and output pipe separated cat foo.txt | pawk '{ print $4,$5 }' mawk
cawk make awk take in CSV and output CSV. NB: you usually want to use csvkit's csvcut for CSV. delimiter collision is the norm cat foo.csv | cawk '{ print $4,$5 }' mawk
theader print numbered TSV header cat foo.tsv | theader
pheader print numbered pipe delimited txt header cat foo.txt | pheader
cheader print numbered CSV header cat foo.csv | cheader
awkcols format a sequence of numbers as awk columns cols=$(seq 15 1560 | awkcols ); cat foo.tsv | tawk "{ print $cols}"
find_ext find all files under current directory with a given extension find_ext csv
url_encode URL encode text url_encode 'A&P' URI::Escape
url_decode decode URL encoded text url_decode 'A%26P' URI::Escape
html_encode HTML encode text echo '&' | html_encode HTML::Entities
html_decode decode HTML encoded text echo '&' | html_decode HTML::Entities
libretsv force LibreOffice to open TSV as a table libretsv foo.tsv LibreOffice
parallel make parallel behave like GNU parallel every time cat foo | parallel 'echo {}' parallel
netpiglets show processes using ports - like nethogs but smaller! netpiglets | xargs -I '{}' pkill {}
c quick cut for TSV fields cat foo.tsv | c 8,9
k quick open KeePass on commandline, hard codes filepath k kpcli
s quick screen lock s slock
p quick git add, commit, push p expects push over SSH
sr quick open text web browser, uses surfraw elvi, otherwise assumes search with duckduckgo.com sr george washington surfraw
v quick open graphical web browser vimb. assumes duckduckgo.com if not given a website v george washinton vimb, surfraw
lsoctal list file permissions in octal format lsoctal foo.txt
  • ~/.zshrc uses oh-my-zsh.
  • add a file called ~/.i3/foo.png to get a fullscreen i3wm background using feh, if you change ~/.i3/config
  • to use surfraw's "sr" alias with this ~/.zshrc you will need to
sudo rm /usr/bin/sr

TODO

  • join arbitrary number of tables on same combination of columns, not just a single column
  • use uniq -d
  • use literate-programming
  • consider using LC_ALL=C or LANG=C (eg sort, grep)
  • use sort -s
  • prefer uconv to iconv
  • sgrep can be much faster than grep

About

Bash (or zsh!) functions for data science

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published