Best practices to go from raw data to a clean seqdataset #10

MiqG · 2024-11-13T13:10:57Z

Hi,

thanks for developing such an efficient and needed tool!

I have been looking around this and other repositories of ML4GLand to find examples of best practices to read a genome fasta and a bam or bed file to produce one hot encoded sequences and corresponding coverage arrays. However, in most cases I see reference to an already existing zarr object. Is such an example of dataset making already available?

I saw the API documentation reference and can guess how to do it, but I am unsure whether I would end up doing it in the most efficient way. I hope I did not miss something...

Thanks very much in advance, best,

Miquel

adamklie · 2024-11-13T16:45:10Z

Hi Miquel,

Thanks for working with the package!

We are working on releasing some documentation for this, but in the meantime here is draft tutorial for reading from bigwig and bam files: https://github.com/ML4GLand/SeqData/blob/docs/docs/tutorials/2_Reading_Tracks.ipynb

What kind of sequencing data are you working with? I think my current recommendation is to first generate bigwig files from bams and then load into SeqData. But whether that makes sense or not for you might depend on your data type.

Adam

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices to go from raw data to a clean seqdataset #10

Best practices to go from raw data to a clean seqdataset #10

MiqG commented Nov 13, 2024

adamklie commented Nov 13, 2024

Best practices to go from raw data to a clean seqdataset #10

Best practices to go from raw data to a clean seqdataset #10

Comments

MiqG commented Nov 13, 2024

adamklie commented Nov 13, 2024