diff --git a/README.md b/README.md index 7ec937c..94b1f9d 100644 --- a/README.md +++ b/README.md @@ -101,12 +101,10 @@ fold -w 60 file ```bash awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen = seqlen +length($0)}END{print seqlen}' file.fa ``` -#### Reproducible subsampling of a FASTQ file. srand() is the seed for the random number generator - keeps the subsampling the same when the script is run multiple times. 0.01 is the % of reads to output. +#### Subsampling first 100 lines of a FASTQ file using head to subset large data sets. ```bash -cat file.fq | paste - - - - | awk 'BEGIN{srand(1234)}{if(rand() < 0.01) print $0}' | tr '\t' '\n' > out.fq -# If the FASTQ file is gzipped and you want to produce a gizzped output. -zcat file.fq.gz | paste - - - - | awk 'BEGIN{srand(1234)}{if(rand() < 0.01) print $0}' | tr '\t' '\n' > out.fq | gzip out.fq +head --lines=100 file.fq > out.fq ``` #### or look at the Hengli's Seqtk