This repository contains the helpful command line utilities that I have written over past years and obtained through open source links. Very simple repetitive tasks that we carry out in all out data handling activities are automated using python and shell scripts. Appending files with different schema/hash sampling of huge files, getting column freq was never more easier.
I find these scripts very helpful in all my corporate projects and kaggle competitions. Would recommend to add these utilites to system path so they can be called from any directory and adds to the ease of doing the tasks.
Concats multiple files with same or different schemas. Files output in order of input arguments.Header sequence output in sequence of header of input files
python append_files.py [-h] [--ifile IFILE] [--ofile OFILE] [--d D]
-h, --help show this help message and exit
--ifile IFILE Input files split by |
--ofile OFILE Output file
--d D Delimiters of input file split by ~. Default: Comma
python append_files.py --ifile 'fileA|fileB|fileC' --ofile outfile.csv --d ',
,,'
python append_files.py --ifile 'fileA|fileB|fileC' --ofile outfile.csv --d ','
Removes non printable characters from the file
python clean_file.py infile outfile ','
Make frequency file for each column present in the input file
python column_freq.py file1 ','
Splits a huge file into multiple partitions. By default, the split is random. Additionally, specifying key columns ensures that partitions are based on that key columns. All the key-values are output in same partition.
E.g. Customer ID is the jey column => All entries of customer A present in the same partition.
python split_file.py [-h] [--ifile IFILE] [--ofile OFILE] [--d D][--chunks CHUNKS] [--samplingCols SAMPLINGCOLS]
-h, --help show this help message and exit
--ifile IFILE Input file
--ofile OFILE Output file. Default: Input name with suffix.
--d D Delimiter. Default: Comma
--chunks CHUNKS Number of Output Files. Default: 10
--samplingCols SAMPLINGCOLS Sampling Columns separated by |. Default: None
python append_files.py --ifile file1
python append_files.py --ifile file1 --ofile outfile.csv
python append_files.py --ifile file1 --ofile /dir1/dir2/outfile.csv --chunks 10
python append_files.py --ifile file1 --samplingCols 'CUSTID1|CUSTID2' --d '|'
Count Number of columns in all rows of a file. To be usable for other UNIX Commands, the file should same number of columns in all the rows. In case a cell value also contains delimiter, clean the file using cleantext tool.
pipe the stream into check_col.sh followed by delimiter
cat filename | check_col.sh ','
Cut one/multiple columns from unix input stream
cut_by_name.sh -t delimiter -n 'columns separated by ,'
cat filename | cut_by_name.sh -t "|" -n "COL1,COL2" | histogram.pl
Get frequency count of an input stream in UNIX pipe operations
Pipe the stream into histogram.pl
cat filename | cut -d| -f1 | histogram.pl
cat filename | cut_by_name.sh -t ""|"" -n ""COL1,COL2"" | histogram.pl