Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding nthreads, pvalue option, alignment parser #1

Open
koriege opened this issue Feb 19, 2024 · 3 comments
Open

Regarding nthreads, pvalue option, alignment parser #1

koriege opened this issue Feb 19, 2024 · 3 comments

Comments

@koriege
Copy link

koriege commented Feb 19, 2024

Hi,

I really appreciate your re-implementation, since the original version caused up to ~200Gb memory usage in my case. I know, that summit is really new, but I want to report some issues I stumbled across when trying it out. Regards.

summit.py -> Error. nthreads cannot be larger than environment variable "NUMEXPR_MAX_THREADS" (64)

Solution via OMP_NUM_THREADS variable

OMP_NUM_THREADS=28 summit.py -p 0.05 -> summit.py error: argument -p/--pval: invalid int value: '0.05'

Solution: argparse type set to float

OMP_NUM_THREADS=28 summit.py -c <bam> -b <bam> -o <out> -w 100 -m 500 -t 100 -l 50

Traceback (most recent call last):
  File "/misc/paras/data/programs/bashbone/conda/envs/gopeaks/bin/summit.py", line 485, in <module>
    ttracks, nreads = count_tracks(bam, bins, nreads = True)
  File "/misc/paras/data/programs/bashbone/conda/envs/gopeaks/bin/summit.py", line 178, in count_tracks
    if r1.is_forward & r2.is_reverse:
AttributeError: 'pysam.libcalignedsegment.AlignedSegment' object has no attribute 'is_forward'

Do I need to sort the alignments by name?

@gartician
Copy link
Owner

Hi @koriege,

I appreciate the enthusiasm for the increased efficiency of Summit! To get back to your suggestions...

  1. Since Summit was developed without parallel processing, does setting OMP_NUM_THREADS really utilize your cores?
  2. This will be fixed, thank you for the attention to detail!
  3. This error means that a read was not designated as forward or reverse. Perhaps consider removing unmapped or reads without a mate from the BAM file before using Summit? I typically use base pair-sorted BAM files in my analysis rather than name-sorted, but I'm not sure it matters since that code is pulling and assessing read mates.

@koriege
Copy link
Author

koriege commented Feb 22, 2024

Thanks for the quick reply.

  1. Since Summit was developed without parallel processing, does setting OMP_NUM_THREADS really utilize your cores?

No, it does not. But one of your libraries, that uses nthreads infers number of cores, which is 128 in my case. Thus setting OMP_NUM_THREADS to anything <=64 avoids conflicts.

  1. This error means that a read was not designated as forward or reverse. Perhaps consider removing unmapped or reads without a mate from the BAM file before using Summit?

I did. I always work with unique alignments. The only flag which was removed from the BAM in some cases, due to region blacklisting, is the "proper_pair" flag, although both mates exist (one reverse, the other mate_reverse flagged). I will re-add the proper_pair flag, try again and report back. Thanks for pointing me into this direction.

@koriege
Copy link
Author

koriege commented Feb 26, 2024

  1. This error means that a read was not designated as forward or reverse. Perhaps consider removing unmapped or reads without a mate from the BAM file before using Summit?

After having extensively tested, it seems that your implementation does not work. The pysam object simply does not have an attribute is_forward. According to the pysam docs: is_forward - true if read is mapped to forward strand (implemented in terms of is_reverse)

see this tiny example

NB501242:295:HV7G2BGXV:4:11610:20759:10780      99      chr1    10074   9       4S27=1X4=1X3=   =       10392   345     CCCTAACCCTAACCCTAACCCTAACCCTAACACTAAGCCA        AAAAA/EEAEEAEAEEAEE<EEE/EEEE//E/EEAE//<<        MD:Z:27C4C3     NH:i:1H
I:i:0   NM:i:2  UQ:i:28 YZ:Z:0  RG:Z:A1 MQ:i:6  MC:Z:4=1D22=5S
NB501242:295:HV7G2BGXV:4:11610:20759:10780      147     chr1    10392   6       4=1D22=5S       =       10074   -345    CCTACCCCTAACCCTAACCCTAACCCTAACC E//E/EAEEAEEEEAEEEEE6EEEEEAAAAA MD:Z:4^A22      NH:i:1  HI:i:0  NM:i:1  UQ:i:0  YZ:Z:0R
G:Z:A1  MQ:i:9  MC:Z:4S27=1X4=1X3=

here r1 attributes are

['__class__', '__copy__', '__deepcopy__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__pyx_vta
ble__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', 'aend', 'alen', 'aligned_pairs', 'bin', 'blocks', 'cigar', 'cigarstring', 'cigartuples', 'compare', 'flag', 'from_dict', 'fromstring', 'get_aligned_pairs', 'get_blocks', 'get_cigar_stats', 'get_forward_qualities', 'get_forward_sequence', 'get_overlap', 'get_reference_positions', 'get_reference_sequence', 'get_tag', 'get_tags', 'has_tag', 'header', '
infer_query_length', 'infer_read_length', 'inferred_length', 'is_duplicate', 'is_paired', 'is_proper_pair', 'is_qcfail', 'is_read1', 'is_read2', 'is_reverse', 'is_secondary', 'is_supplementary', 'is_unmapped', 'isize', 'mapping_quality', '
mapq', 'mate_is_reverse', 'mate_is_unmapped', 'mpos', 'mrnm', 'next_reference_id', 'next_reference_name', 'next_reference_start', 'opt', 'overlap', 'pnext', 'pos', 'positions', 'qend', 'qlen', 'qname', 'qqual', 'qstart', 'qual', 'query', '
query_alignment_end', 'query_alignment_length', 'query_alignment_qualities', 'query_alignment_sequence', 'query_alignment_start', 'query_length', 'query_name', 'query_qualities', 'query_sequence', 'reference_end', 'reference_id', 'reference_length', 'reference_name', 'reference_start', 'rlen', 'rname', 'rnext', 'seq', 'setTag', 'set_tag', 'set_tags', 'tags', 'template_length', 'tid', 'tlen', 'to_dict', 'to_string', 'tostring']

So the solution is for line 178 and following

        if not r1.is_reverse and r2.is_reverse:
            r_start, r_end = r1.reference_start, r2.reference_end
        elif r1.is_reverse and not r2.is_reverse:
            r_start, r_end = r2.reference_start, r1.reference_end
        else:
            exit(f"Read pair R1 {r1} and R2 {r2} are not complementary! Either both are forward or reverse reads!")

Btw. is_reverse attribute is of type bool, whereas the & operator is made for bitwise comparision. You may change it to and

Plase also consider to add a proper shebang line to your script.

#!/usr/bin/env python3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants