Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added per-group processing for inStrain compare #39

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

nick-youngblut
Copy link

I quickly tried changing the code to allow for single-group processing by inStrain compare. While the group processing time is uneven, such as this run:

Running group 1 of 148
Comparing scaffolds: 100%|██████████| 646/646 [16:30:21<00:00, 91.98s/it]
Running group 2 of 148
Comparing scaffolds: 100%|██████████| 944/944 [1:14:30<00:00,  4.74s/it]
Running group 3 of 148
Comparing scaffolds: 100%|██████████| 781/781 [20:18<00:00,  1.56s/it]
Running group 4 of 148
Comparing scaffolds: 100%|██████████| 760/760 [19:07<00:00,  1.51s/it]
Running group 5 of 148
Comparing scaffolds: 100%|██████████| 754/754 [25:51<00:00,  2.06s/it]
Running group 6 of 148
Comparing scaffolds: 100%|██████████| 891/891 [23:32<00:00,  1.59s/it]
Running group 7 of 148
Comparing scaffolds: 100%|██████████| 781/781 [35:10<00:00,  2.70s/it]
Running group 8 of 148
Comparing scaffolds: 100%|██████████| 819/819 [22:09<00:00,  1.62s/it]
Running group 9 of 148
Comparing scaffolds: 100%|██████████| 805/805 [22:26<00:00,  1.67s/it]
Running group 10 of 148
Comparing scaffolds: 100%|██████████| 904/904 [04:52<00:00,  3.09it/s]
Running group 11 of 148
Comparing scaffolds: 100%|██████████| 901/901 [04:47<00:00,  3.13it/s]
Running group 12 of 148
Comparing scaffolds: 100%|██████████| 960/960 [04:49<00:00,  3.32it/s]
Running group 13 of 148
Comparing scaffolds: 100%|██████████| 982/982 [07:27<00:00,  2.19it/s]
Running group 14 of 148
Comparing scaffolds: 100%|██████████| 1023/1023 [05:45<00:00,  2.96it/s]
Running group 15 of 148
Comparing scaffolds: 100%|██████████| 932/932 [05:26<00:00,  2.85it/s]
Running group 16 of 148
Comparing scaffolds: 100%|██████████| 929/929 [09:30<00:00,  1.63it/s]
Running group 17 of 148
Comparing scaffolds: 100%|██████████| 872/872 [03:08<00:00,  4.63it/s]
Running group 18 of 148
Comparing scaffolds: 100%|██████████| 887/887 [02:48<00:00,  5.26it/s]
Running group 19 of 148
Comparing scaffolds: 100%|██████████| 1333/1333 [06:01<00:00,  3.69it/s]
Running group 20 of 148
Comparing scaffolds:  99%|█████████▉| 1174/1180 [09:32<08:50, 88.43s/it]
[...]

...parallel processing of each group separately can save some time (many hours in this example).

I couldn't really find any good testing datasets in ./tests/, so I used my own, but here is the general workflow for parallel processing of groups:

INDIRS="2_T3_scaffold_info.tsv 93_T3_scaffold_info.tsv"   # my data files
OUTDIR=tmp/compare_step

# listing groups and creating a pickle file of the group object
inStrain compare --min_cov 5 --min_freq 0.05 --ani_threshold 0.99999 -p 8 \
-s /ebio/abt3_scratch/nyoungblut/LLMGPS_55977766656/inStrain/genomes.stb  -o $OUTDIR -i $INDIRS \
--list-groups

# processing group 1 (output saved to $OUTDIR/groups/1.pkl)
inStrain compare --min_cov 5 --min_freq 0.05 --ani_threshold 0.99999 -p 8 \
-s /ebio/abt3_scratch/nyoungblut/LLMGPS_55977766656/inStrain/genomes.stb  -o $OUTDIR -i $INDIRS \
--group-pkl $OUTDIR/groups.pkl --group 1

# processing group 1 (output saved to $OUTDIR/groups/2.pkl)
inStrain compare --min_cov 5 --min_freq 0.05 --ani_threshold 0.99999 -p 8 \
-s /ebio/abt3_scratch/nyoungblut/LLMGPS_55977766656/inStrain/genomes.stb  -o $OUTDIR -i $INDIRS \
--group-pkl  $OUTDIR/groups.pkl --group 2

# merging the results (the user could also use  --comparisons-list for large sets of groups)
inStrain compare --min_cov 5 --min_freq 0.05 --ani_threshold 0.99999 -p 8 \
-s /ebio/abt3_scratch/nyoungblut/LLMGPS_55977766656/inStrain/genomes.stb  -o $OUTDIR -i $INDIRS \
--comparisons  $OUTDIR/groups/1.pkl  $OUTDIR/groups/2.pkl

...while the basic functionality remains intact:

# standard serial processing of groups
OUTDIR=tmp/compare
inStrain compare --min_cov 5 --min_freq 0.05 --ani_threshold 0.99999 -p 8 \
-s /ebio/abt3_scratch/nyoungblut/LLMGPS_55977766656/inStrain/genomes.stb  -o $OUTDIR -i $INDIRS

The output for both approaches is the same, although it appears that your code allows for variable ordering of the output table columns, probably due to using a dict. Using an ordered dict will stabilize the column order.

Sorry for not keeping the code style consistent with you. Please just consider this an example/guide on how this could be done.

@MrOlm
Copy link
Owner

MrOlm commented Feb 8, 2021

Hi Nick,

Thanks a ton for doing this. I appreciate the need to run compare over distributed systems, and as soon as I have time I'll dive into the implementation details you've provided here and see what's what.

A way that I've done this in my own work is by using the --genome flag for inStrain. If I have 100 genomes to compare, I essentially launch 100 separate jobs each with their own --genome flag. Having just going through all this yourself, I just wanted to get your thoughts on what the major advantages are to the system you've implemented here over that approach?

Thanks again for going this and following up with the code you've written, much appreciated.

-Matt

@MrOlm MrOlm mentioned this pull request Feb 8, 2021
@nick-youngblut
Copy link
Author

The --genome could definitely work too. I want to parallelize with snakemake, and genomes could be hard-coded (using wildcards) versus running parallel jobs on a per-group basis, which would require using snakemake checkpointing. I should have considered --genome before playing around with the inStrain code. Well, at least now I know more about how the code works.

@Lyylsys
Copy link

Lyylsys commented Jun 13, 2024

I am interested in the --list-groups feature. I also noticed that this step takes a long time. If I need to compare 300 IS files, can I run them separately and then combine them together? Can the --list-groups step be run with instrain compare?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants