-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization: add ability to ignore sequences when loading a GFA file #25
Comments
Hi @fedarko I implemented a new option 'ignore_sequences' in the 'from_file' method. See the branch 'opt_ignore_seqs' of this repository. I don't know if this really solves the problem with such a huge file. Let me know! |
Hi @ggonnella, Thank you for the quick response! That's great news. I downloaded the |
Exactly... Sorry.
Now it's there.
|
Thank you! I downloaded the fixes and tried running If I can continue asking for things :), I have a suspicion that a reason for this still taking a while is that these long sequences are still being read into memory and passed along—they are only being explicitly overwritten after some time, and in the meantime there are still references to It's a bit of a hack, but I wonder if adjusting In any case, thank you for implementing this as is—I really appreciate it! |
I think it can be done, although it's a bit of a "hack", but justified by the goal. I'll try that.
|
I implemented a version doing that. Unfortunately I have to do something else now, but in preliminary tests it seems to work and could be faster then the previous version if you want to try it grab the newest commit in the opt_ignore_seqs branch... |
Thank you for your work on Gfapy!
I have a very large GFA file (over 3 GB), including ~80k segments that I'd like to load using Gfapy. Although this graph contains sequence information for each segment, I don't really care about the segments' sequences—I just want to know the topology of the graph, and the lengths of each segment.
Gfapy takes a very long time to load this file (I'm not sure if it will ever load before running out of memory: I killed it after a few minutes on our cluster). It ended up being easier for me to write a simple [albeit much less robust than Gfapy :) -- i.e. without GFA2 support] line-by-line parser which completely ignores the segments' sequences.
This line-by-line parser takes only a few seconds to run on the aforementioned large GFA file, and I am fairly sure that the reason for this speed-up is that my parser completely ignores segments' sequences (and instead treats each segment line as if it were written as
S \t segname \t * \t LN:i:100
or something). I suspect this is also the reason why Gfapy's parser took a long amount of time in issue #23 in this repository; from going through the script that generated the GFA file discussed there, it seems like this script includes sequences in the GFA file, and thus causes Gfapy to spend a lot of time loading these (usually relatively large) sequences into memory.I am writing some code now to work with arbitrary GFA files, so I'd really like to be able to use Gfapy; however, the long amount of time it takes to load files is a blocker for me. I would therefore like to propose that Gfapy's
Gfa.from_file()
function add an optional flag parameter to ignore sequences in segment lines. I'm not sure how easy this would be to implement in Gfapy's code, but I think this could really increase the sizes of graphs that Gfapy could work with.Thanks for your consideration, and thanks again for your work on this project.
The text was updated successfully, but these errors were encountered: