added per-group processing for inStrain compare #39

nick-youngblut · 2021-02-07T07:27:43Z

I quickly tried changing the code to allow for single-group processing by inStrain compare. While the group processing time is uneven, such as this run:

Running group 1 of 148
Comparing scaffolds: 100%|██████████| 646/646 [16:30:21<00:00, 91.98s/it]
Running group 2 of 148
Comparing scaffolds: 100%|██████████| 944/944 [1:14:30<00:00,  4.74s/it]
Running group 3 of 148
Comparing scaffolds: 100%|██████████| 781/781 [20:18<00:00,  1.56s/it]
Running group 4 of 148
Comparing scaffolds: 100%|██████████| 760/760 [19:07<00:00,  1.51s/it]
Running group 5 of 148
Comparing scaffolds: 100%|██████████| 754/754 [25:51<00:00,  2.06s/it]
Running group 6 of 148
Comparing scaffolds: 100%|██████████| 891/891 [23:32<00:00,  1.59s/it]
Running group 7 of 148
Comparing scaffolds: 100%|██████████| 781/781 [35:10<00:00,  2.70s/it]
Running group 8 of 148
Comparing scaffolds: 100%|██████████| 819/819 [22:09<00:00,  1.62s/it]
Running group 9 of 148
Comparing scaffolds: 100%|██████████| 805/805 [22:26<00:00,  1.67s/it]
Running group 10 of 148
Comparing scaffolds: 100%|██████████| 904/904 [04:52<00:00,  3.09it/s]
Running group 11 of 148
Comparing scaffolds: 100%|██████████| 901/901 [04:47<00:00,  3.13it/s]
Running group 12 of 148
Comparing scaffolds: 100%|██████████| 960/960 [04:49<00:00,  3.32it/s]
Running group 13 of 148
Comparing scaffolds: 100%|██████████| 982/982 [07:27<00:00,  2.19it/s]
Running group 14 of 148
Comparing scaffolds: 100%|██████████| 1023/1023 [05:45<00:00,  2.96it/s]
Running group 15 of 148
Comparing scaffolds: 100%|██████████| 932/932 [05:26<00:00,  2.85it/s]
Running group 16 of 148
Comparing scaffolds: 100%|██████████| 929/929 [09:30<00:00,  1.63it/s]
Running group 17 of 148
Comparing scaffolds: 100%|██████████| 872/872 [03:08<00:00,  4.63it/s]
Running group 18 of 148
Comparing scaffolds: 100%|██████████| 887/887 [02:48<00:00,  5.26it/s]
Running group 19 of 148
Comparing scaffolds: 100%|██████████| 1333/1333 [06:01<00:00,  3.69it/s]
Running group 20 of 148
Comparing scaffolds:  99%|█████████▉| 1174/1180 [09:32<08:50, 88.43s/it]
[...]

...parallel processing of each group separately can save some time (many hours in this example).

I couldn't really find any good testing datasets in ./tests/, so I used my own, but here is the general workflow for parallel processing of groups:

INDIRS="2_T3_scaffold_info.tsv 93_T3_scaffold_info.tsv"   # my data files
OUTDIR=tmp/compare_step

# listing groups and creating a pickle file of the group object
inStrain compare --min_cov 5 --min_freq 0.05 --ani_threshold 0.99999 -p 8 \
-s /ebio/abt3_scratch/nyoungblut/LLMGPS_55977766656/inStrain/genomes.stb  -o $OUTDIR -i $INDIRS \
--list-groups

# processing group 1 (output saved to $OUTDIR/groups/1.pkl)
inStrain compare --min_cov 5 --min_freq 0.05 --ani_threshold 0.99999 -p 8 \
-s /ebio/abt3_scratch/nyoungblut/LLMGPS_55977766656/inStrain/genomes.stb  -o $OUTDIR -i $INDIRS \
--group-pkl $OUTDIR/groups.pkl --group 1

# processing group 1 (output saved to $OUTDIR/groups/2.pkl)
inStrain compare --min_cov 5 --min_freq 0.05 --ani_threshold 0.99999 -p 8 \
-s /ebio/abt3_scratch/nyoungblut/LLMGPS_55977766656/inStrain/genomes.stb  -o $OUTDIR -i $INDIRS \
--group-pkl  $OUTDIR/groups.pkl --group 2

# merging the results (the user could also use  --comparisons-list for large sets of groups)
inStrain compare --min_cov 5 --min_freq 0.05 --ani_threshold 0.99999 -p 8 \
-s /ebio/abt3_scratch/nyoungblut/LLMGPS_55977766656/inStrain/genomes.stb  -o $OUTDIR -i $INDIRS \
--comparisons  $OUTDIR/groups/1.pkl  $OUTDIR/groups/2.pkl

...while the basic functionality remains intact:

# standard serial processing of groups
OUTDIR=tmp/compare
inStrain compare --min_cov 5 --min_freq 0.05 --ani_threshold 0.99999 -p 8 \
-s /ebio/abt3_scratch/nyoungblut/LLMGPS_55977766656/inStrain/genomes.stb  -o $OUTDIR -i $INDIRS

The output for both approaches is the same, although it appears that your code allows for variable ordering of the output table columns, probably due to using a dict. Using an ordered dict will stabilize the column order.

Sorry for not keeping the code style consistent with you. Please just consider this an example/guide on how this could be done.

MrOlm · 2021-02-08T18:06:36Z

Hi Nick,

Thanks a ton for doing this. I appreciate the need to run compare over distributed systems, and as soon as I have time I'll dive into the implementation details you've provided here and see what's what.

A way that I've done this in my own work is by using the --genome flag for inStrain. If I have 100 genomes to compare, I essentially launch 100 separate jobs each with their own --genome flag. Having just going through all this yourself, I just wanted to get your thoughts on what the major advantages are to the system you've implemented here over that approach?

Thanks again for going this and following up with the code you've written, much appreciated.

-Matt

nick-youngblut · 2021-02-08T19:29:17Z

The --genome could definitely work too. I want to parallelize with snakemake, and genomes could be hard-coded (using wildcards) versus running parallel jobs on a per-group basis, which would require using snakemake checkpointing. I should have considered --genome before playing around with the inStrain code. Well, at least now I know more about how the code works.

Lyylsys · 2024-06-13T08:10:41Z

I am interested in the --list-groups feature. I also noticed that this step takes a long time. If I need to compare 300 IS files, can I run them separately and then combine them together? Can the --list-groups step be run with instrain compare?

added step-wise group processing for inStrain compare

3f7f984

MrOlm mentioned this pull request Feb 8, 2021

inStrain compare not scaling #14

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added per-group processing for inStrain compare #39

added per-group processing for inStrain compare #39

nick-youngblut commented Feb 7, 2021

MrOlm commented Feb 8, 2021

nick-youngblut commented Feb 8, 2021

Lyylsys commented Jun 13, 2024

added per-group processing for inStrain compare #39

Are you sure you want to change the base?

added per-group processing for inStrain compare #39

Conversation

nick-youngblut commented Feb 7, 2021

MrOlm commented Feb 8, 2021

nick-youngblut commented Feb 8, 2021

Lyylsys commented Jun 13, 2024