Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New AsyncWriteSortingCollection #2

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

SilinPavel
Copy link
Member

@SilinPavel SilinPavel commented Jun 14, 2017

Consider the work of the standard implementation of SortingCollection:

image

We present a multithreaded implementation SortingCollection.
The main idea of this improvements is that we can sort and spill data, stored in the sorting collection, asynchronously by submitting task to executor service.
In comparison with the standard version it looks like:

image

As we can see above our version allow not to stop work with the collection during writing into temporary file.
Method doneAdding() is blocking, we have to wait until all spill tasks will be done.

Please take a look at our benchmarks:

There is the benchmark for SortingCollection. It is based on the random generation of data for SortingCollection, we just create random int, transform it to string and put into the sorting collection.
Results are provided below.

image

Benchmark                                               Number of string  Mode  Cnt      Score      Error  Units
SortingCollectionBenchmark.sortingCollectionBenchmark         50_000_000  avgt    5  42821.407 ± 2095.994  ms/op
SortingCollectionBenchmark.sortingCollectionBenchmark2th      50_000_000  avgt    5  19241.607 ±  353.917  ms/op
SortingCollectionBenchmark.sortingCollectionBenchmark4th      50_000_000  avgt    5  10750.524 ± 72.900  ms/op

There is log output of SortSam metric which used different types of SortingCollection.

AsyncSpillSortingCollection with 2 spilling threads

INFO 2017-06-14 15:58:47 SortSam Read 10,000,000 records. Elapsed time: 00:00:22s. Time for last 10,000,000: 22s. Last read position: 1:148,346,706
INFO 2017-06-14 15:59:14 SortSam Read 20,000,000 records. Elapsed time: 00:00:49s. Time for last 10,000,000: 26s. Last read position: X:149,840,001
INFO 2017-06-14 15:59:41 SortSam Read 30,000,000 records. Elapsed time: 00:01:16s. Time for last 10,000,000: 27s. Last read position: 12:31,820,558
INFO 2017-06-14 16:00:09 SortSam Read 40,000,000 records. Elapsed time: 00:01:44s. Time for last 10,000,000: 27s. Last read position: 11:3,746,487
INFO 2017-06-14 16:00:36 SortSam Read 50,000,000 records. Elapsed time: 00:02:11s. Time for last 10,000,000: 27s. Last read position: 17:27,904,169
INFO 2017-06-14 16:00:47 SortSam Finished reading inputs, merging and writing to output now.

Standard SortingCollection

INFO 2017-06-14 16:05:36 SortSam Read 10,000,000 records. Elapsed time: 00:00:40s. Time for last 10,000,000: 40s. Last read position: 1:148,346,706
INFO 2017-06-14 16:06:20 SortSam Read 20,000,000 records. Elapsed time: 00:01:25s. Time for last 10,000,000: 44s. Last read position: X:149,840,001
INFO 2017-06-14 16:07:05 SortSam Read 30,000,000 records. Elapsed time: 00:02:10s. Time for last 10,000,000: 45s. Last read position: 12:31,820,558
INFO 2017-06-14 16:07:51 SortSam Read 40,000,000 records. Elapsed time: 00:02:55s. Time for last 10,000,000: 45s. Last read position: 11:3,746,487
INFO 2017-06-14 16:08:40 SortSam Read 50,000,000 records. Elapsed time: 00:03:44s. Time for last 10,000,000: 48s. Last read position: 17:27,904,169
INFO 2017-06-14 16:08:58 SortSam Finished reading inputs, merging and writing to output now.
Command for run:
standard: java -Xmx4g -jar picard-unspecified-SNAPSHOT-all.jar SortSam I=/path/to/bam/NA12878.mapped.ILLUMINA.bwa.CEU.exome.20121211.bam O=sort.bam SORT_ORDER=queryname VALIDATION_STRINGENCY=SILENT
EPAM version: java -Dsamjdk.sort_col_threads=2 -Xmx4g -jar picard-unspecified-SNAPSHOT-all.jar SortSam I=/path/to/bam/NA12878.mapped.ILLUMINA.bwa.CEU.exome.20121211.bam O=sort.bam SORT_ORDER=queryname VALIDATION_STRINGENCY=SILENT

This improvement is enabled by setting JVM option -Dsamjdk.sort_col_threads=number_of_threads, if number_of_threads > 0, all utility that use SortingCollection will use its multithreaded version.

@SilinPavel SilinPavel force-pushed the epam-ls_AsyncWriteSortingCollection branch 2 times, most recently from cba5a09 to 16a5bbe Compare June 22, 2017 15:47
@SilinPavel SilinPavel force-pushed the epam-ls_AsyncWriteSortingCollection branch from 16a5bbe to bf2a78d Compare June 23, 2017 10:40
@SilinPavel SilinPavel force-pushed the epam-ls_AsyncWriteSortingCollection branch from 5eb9793 to 21eb42c Compare June 28, 2017 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant