Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best practices to go from raw data to a clean seqdataset #10

Open
MiqG opened this issue Nov 13, 2024 · 1 comment
Open

Best practices to go from raw data to a clean seqdataset #10

MiqG opened this issue Nov 13, 2024 · 1 comment

Comments

@MiqG
Copy link

MiqG commented Nov 13, 2024

Hi,

thanks for developing such an efficient and needed tool!

I have been looking around this and other repositories of ML4GLand to find examples of best practices to read a genome fasta and a bam or bed file to produce one hot encoded sequences and corresponding coverage arrays. However, in most cases I see reference to an already existing zarr object. Is such an example of dataset making already available?

I saw the API documentation reference and can guess how to do it, but I am unsure whether I would end up doing it in the most efficient way. I hope I did not miss something...

Thanks very much in advance, best,

Miquel

@adamklie
Copy link
Collaborator

Hi Miquel,

Thanks for working with the package!

We are working on releasing some documentation for this, but in the meantime here is draft tutorial for reading from bigwig and bam files: https://github.com/ML4GLand/SeqData/blob/docs/docs/tutorials/2_Reading_Tracks.ipynb

What kind of sequencing data are you working with? I think my current recommendation is to first generate bigwig files from bams and then load into SeqData. But whether that makes sense or not for you might depend on your data type.

Adam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants