Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

specifying bloom filter size/overflowerror #10

Open
eocampbe opened this issue Jan 9, 2020 · 7 comments
Open

specifying bloom filter size/overflowerror #10

eocampbe opened this issue Jan 9, 2020 · 7 comments

Comments

@eocampbe
Copy link

eocampbe commented Jan 9, 2020

Hi there,

I am relatively new to python and trying to run discoverY.py in female+male mode using male_contigs.fasta, kmers_from_male_reads, and female reference assembly (female.fasta) files. I am running python 3.7.4,and all the dependencies are installed properly. I created the kmers_from_male_reads file using DSK as per the readme file, and the command I used to run discoverY.py is:

python discoverY.py --mode female+male --kmer_size 25

When I run this, I get this output:

Started DiscoverY
Mode female+male
Using default of k=25 and input folder='data'
Please set bloom filter size before running this program
Shortlisting Y-contigs
Need to make Bloom Filter of k-mers from female
Traceback (most recent call last):
File "./discoverY.py", line 59, in
main()
File "./discoverY.py", line 54, in main
classify_ctgs.classify_ctgs(k_size, bloom_filt, female_kmers, mode)
File "/lustre04/scratch/eocampbe/DiscoverY/scripts/classify_ctgs.py", line 142, in classify_ctgs
female_kmers_bf = getbloomFilter(bf, fem_kmers, kmer_size)
File "/lustre04/scratch/eocampbe/DiscoverY/scripts/classify_ctgs.py", line 20, in getbloomFilter
female_kmers_bf = BloomFilter(bf_size, .001, bf_filename)
File "src/pybloomfilter.pyx", line 87, in pybloomfilter.BloomFilter.cinit
OverflowError: value too large to convert to int

I'm finding it difficult to determine how I might fix this issue. For instance, is the line "Please set bloom filter size before running this program" the source of this error? I can't figure out how I would specify bloom filter size, as there appears to be no option to do so and I can't find any documentation about this in the readme file. Or, is this primarily a memory issue, indicated by the OverflowError? Any help you could give me would be much appreciated!

@md5sam
Copy link
Contributor

md5sam commented Jan 9, 2020

Hi @eocampbe ,

This is likely an issue with bloom filter size. I have just now merged a Pull Request submitted by @rsharris which lets you specify the bloom filter size, and this might be useful to you.

In order to do so, please first perform a git pull to get the latest version of DiscoverY. Subsequently, please see lines 18-20 of discoverY.py, which indicates how to specify bloom filter size using the command line argument "--female_bloom_capacity".

@rsharris
Copy link
Contributor

rsharris commented Jan 9, 2020

@eocampbe IIRC, You'll want to specify a bloom filter size that is about the expected length of your genome, minus repeats. I.e. to the number of distinct kmers you expect in your input data. The only downside of setting it too high is it will use more memory.

I think the default value was about 3G, which relates to the human genome size (but doesn't adjust downward for repeat content). And the corresponding bloom filter data structure was something like 5G bytes.

@eocampbe
Copy link
Author

eocampbe commented Jan 9, 2020

Thank you @md5sam and @rsharris, this is very helpful!

The female genome size I'm working with is ~214 mb, so I set that value using the
--female_bloom_capacity argument, and it seems to be running now.

@eocampbe
Copy link
Author

Hi again @md5sam and @rsharris,

I am now getting another issue when I try to run discoverY.py. When I use the basic command using either a female bloom filter I created OR the example data provided, like this:
python3 ./discoverY.py --mode female+male --female_bloom

I get the following error:
File "./discoverY.py", line 69, in
main()
File "./discoverY.py", line 64, in main
classify_ctgs.classify_ctgs(k_size, bloom_filt, bf_capacity, female_kmers, mode)
UnboundLocalError: local variable 'bf_capacity' referenced before assignment

Any ideas as to what might be causing this?

@rsharris
Copy link
Contributor

I'm sorry, that was my mistake.

I'll make a correction to my fork and issue a pull request.

I'm not the owner of this repo, though. So, if you want to get up and running right away, the change will be to add "bf_capacity = None" after line 43 in discoverY.py, so that it looks like this:

    if not args['kmer_size']:
        k_size = 25
        bf_capacity = None 
    else:

You'd need to be sure to use 8 spaces in front of "bf_capacity", not tab characters.

@eocampbe
Copy link
Author

Great, thanks! I've added that line and it seems to be working now.

@md5sam
Copy link
Contributor

md5sam commented Feb 19, 2020

Thanks @rsharris, I've now merged your PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants