Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization: add ability to ignore sequences when loading a GFA file #25

Open
fedarko opened this issue May 13, 2022 · 6 comments
Open
Assignees

Comments

@fedarko
Copy link

fedarko commented May 13, 2022

Thank you for your work on Gfapy!

I have a very large GFA file (over 3 GB), including ~80k segments that I'd like to load using Gfapy. Although this graph contains sequence information for each segment, I don't really care about the segments' sequences—I just want to know the topology of the graph, and the lengths of each segment.

Gfapy takes a very long time to load this file (I'm not sure if it will ever load before running out of memory: I killed it after a few minutes on our cluster). It ended up being easier for me to write a simple [albeit much less robust than Gfapy :) -- i.e. without GFA2 support] line-by-line parser which completely ignores the segments' sequences.

This line-by-line parser takes only a few seconds to run on the aforementioned large GFA file, and I am fairly sure that the reason for this speed-up is that my parser completely ignores segments' sequences (and instead treats each segment line as if it were written as S \t segname \t * \t LN:i:100 or something). I suspect this is also the reason why Gfapy's parser took a long amount of time in issue #23 in this repository; from going through the script that generated the GFA file discussed there, it seems like this script includes sequences in the GFA file, and thus causes Gfapy to spend a lot of time loading these (usually relatively large) sequences into memory.

I am writing some code now to work with arbitrary GFA files, so I'd really like to be able to use Gfapy; however, the long amount of time it takes to load files is a blocker for me. I would therefore like to propose that Gfapy's Gfa.from_file() function add an optional flag parameter to ignore sequences in segment lines. I'm not sure how easy this would be to implement in Gfapy's code, but I think this could really increase the sizes of graphs that Gfapy could work with.

Thanks for your consideration, and thanks again for your work on this project.

@ggonnella
Copy link
Owner

Hi @fedarko

I implemented a new option 'ignore_sequences' in the 'from_file' method. See the branch 'opt_ignore_seqs' of this repository.

I don't know if this really solves the problem with such a huge file. Let me know!

@ggonnella ggonnella self-assigned this May 13, 2022
@fedarko
Copy link
Author

fedarko commented May 13, 2022

Hi @ggonnella,

Thank you for the quick response! That's great news.

I downloaded the opt_ignore_seqs branch, but it doesn't look like the ignore_sequences option is available -- as far as I can tell, the opt_ignore_seqs branch is equal to the master branch. Maybe some code is un-pushed?

@ggonnella
Copy link
Owner

ggonnella commented May 13, 2022 via email

@fedarko
Copy link
Author

fedarko commented May 13, 2022

Thank you! I downloaded the fixes and tried running Gfa.from_file() on the big > 3 GB GFA file (using ignore_sequences=True). After a few minutes, the from_file() operation is unfortunately still running.

If I can continue asking for things :), I have a suspicion that a reason for this still taking a while is that these long sequences are still being read into memory and passed along—they are only being explicitly overwritten after some time, and in the meantime there are still references to line, gfa_line, data, etc. being kept around (and these strings necessarily include the sequence as a part of them).

It's a bit of a hack, but I wonder if adjusting Gfa.read_file() to detect if line is a segment and, if ignore_sequences is True, "modifying" the line to not include this segment, and then keeping everything downstream the same, might be more efficient?

In any case, thank you for implementing this as is—I really appreciate it!

@ggonnella
Copy link
Owner

ggonnella commented May 13, 2022 via email

@ggonnella
Copy link
Owner

I implemented a version doing that. Unfortunately I have to do something else now, but in preliminary tests it seems to work and could be faster then the previous version

if you want to try it grab the newest commit in the opt_ignore_seqs branch...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants