Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate alternatives to gffutils in-memory data #264

Open
jsstevenson opened this issue Sep 28, 2023 · 3 comments
Open

Investigate alternatives to gffutils in-memory data #264

jsstevenson opened this issue Sep 28, 2023 · 3 comments
Labels
performance Improvements to performance priority:medium Medium priority technical debt A feature/requirement implemented in a sub-optimal way & must be re-written. Contrast to "cleanup"

Comments

@jsstevenson
Copy link
Member

I've been having some pretty crazy slowdown while trying to read the NCBI annotations into memory with gffutils. If this persists, there are at least two possible alternatives:

  • Using gffutils, store data in a sqlite db (see below). We could also save this to a DB in the data folder and then check to see if it already exists so that we don't have to repeatedly create it. I think this will be pretty fast once the DB is created.
db = gffutils.create_db("gene/data/ncbi/ncbi_GRCh38.p14.gff", "tmp.db", force=True, merge_strategy="create_unique", keep_order=True)
  • Investigate alternatives. gffpandas came up on a quick google, and it seems like a good fit. I bet we could spin our own up as well (gffpandas is, like, < 100 lines of actual code).
@jsstevenson jsstevenson added priority:medium Medium priority technical debt A feature/requirement implemented in a sub-optimal way & must be re-written. Contrast to "cleanup" performance Improvements to performance labels Sep 28, 2023
@jsstevenson
Copy link
Member Author

Hearing reports of installation issues with gffutils on Intel Macs. Investigate (maybe pin an older version?)

@jsstevenson
Copy link
Member Author

Mentioned this on #299, but there are a few reasons I'd like to use something else:

  • gffutils obscures some of the iterability, so you can't nest it in a progress bar
  • it's weirdly slow. it shouldn't be this slow, it's literally just a TSV
  • it probably provides a lot of tooling that we truly do not need

a few options

  • gffpandas -- brings in a pandas dependency which is unfortunate
  • use polars -- but then we have a polars dependency which is less unfortunate but still not ideal
  • manually operate on the TSV -- most likely going to take a shot at this

Copy link

This issue is stale because it has been open 135 days with no activity. This issue will be closed if no further activity occurs in 14 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improvements to performance priority:medium Medium priority technical debt A feature/requirement implemented in a sub-optimal way & must be re-written. Contrast to "cleanup"
Projects
None yet
Development

No branches or pull requests

1 participant