ML newbie trying to run Baskerville #27

jesspeers · 2024-04-30T09:35:50Z

Hello,

I'm very new to machine learning and wanting to run Baskerville (as suggested by @davek44 on the Basenji page).

I was wondering if you had any beginner-friendly guidance for how to implement Baskerville (e.g. how to pre-process the data and split into train/test/validation, how to train the model, etc)? I was relying quite heavily on the Basenji ipynb tutorials and I'm a little confused how to use Baskerville.

I'm hoping to supply ATAC-seq training data to the model and use the output to investigate deleterious variants in regulatory elements of model & non-model species, so Baskerville seems like the ideal tool to use, but I'm unfortunately a bit of a beginner!

Many thanks,
Jess

davek44 · 2024-05-03T00:18:26Z

Hi Jess, we haven't completely ported the data preprocessing code into this new repository. I can prioritize that for you. In the case of your non-model species, you'll need to start from scratch. But for the model, assuming it's human or mouse, you can consider transfer learning from our pretrained model. We're working on scripts for that now.

Although I obviously like the tools we develop, they don't necessarily surpass simpler methods for peak data like ATAC-seq, where distal interactions aren't as important. You might also consider Anshul Kundaje's group's ChromBP-net, which is able to model the Tn5 cutting bias and nucleotide-precision cut sites from ATAC. https://github.com/kundajelab/chrombpnet

jesspeers · 2024-05-07T13:28:08Z

Hi Dave,

Thanks so much for your response! I'll have a look at ChromBP-net and come back to Baskerville once the preprocessing code is ported over.

Many thanks,
Jess

jesspeers · 2024-05-09T09:14:53Z

Hi Dave,

Thanks again for all your help. I'd really like to apply Baskerville if possible so do you have a rough estimation of when the preprocessing code might be ported over?

For my application, I think transfer learning from your pretrained model should work, so do you know roughly how long it might take for those scripts to become available?

Many thanks,
Jess

davek44 · 2024-05-17T18:38:04Z

Hi Jess,

I just ported the data preprocessing code and pulled into the main branch. You'll basically need to make a targets table similar to the one we used for Borzoi here: https://raw.githubusercontent.com/calico/borzoi/main/examples/targets_human.txt

Then here's an example of how we ran the scripts for the recent Borzoi dataset: https://github.com/calico/borzoi/tree/main/src/scripts/data/training_data
Just substitute "hound" for "basenji" in the script names.

Reach out if you have additional questions. We'll aim to bring in the new transfer learning script next.

Best,
David

GMFranceschini · 2024-05-21T09:37:44Z

Hi @davek44, I hope it's ok to follow up on this thread as I am also new to ML on sequences.

I aim to obtain a representative feature vector of each genomic bin (say 50kb), possibly incorporating other epigenetic data like accessibility and histone mark tracks. This will ultimately be used for a classification task that would benefit from this well-built sequence representation, or at least that is my intuition.

I am working with hg19; would it be straightforward to start from a pre-trained model and get "embeddings" for those genomic bins? I am asking if this makes sense and if I am looking at the correct repo.
Thank you,

Gian

davek44 · 2024-05-28T04:35:11Z

Hi Gian, this is a different enough question that I'd recommend you open a separate issue. But yes, moving to hg19 should be fine.

DavidvanBruggen · 2024-08-14T12:55:45Z

Hi Dave,

Thanks for making this great work available!

Just a question related to dropping alignment between human and mouse in the makefile approach you specified above?
I want to train a borzoi model on mouse only, dropping the alignment steps, can you tell me how to run hound_data.py properly? At the moment it is not obvious for me.

Thanks!

davek44 · 2024-08-17T00:05:15Z

Hi, for a single genome, you'll simply skip the hound_data_align.py command and run hound_data.py without the --restart option and adding the -l $(LENGTH), --stride $(TSTRIDE), and --umap_t 0.5 options (which were previously handled at the align stage for multiple genomes).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML newbie trying to run Baskerville #27

ML newbie trying to run Baskerville #27

jesspeers commented Apr 30, 2024

davek44 commented May 3, 2024

jesspeers commented May 7, 2024

jesspeers commented May 9, 2024

davek44 commented May 17, 2024

GMFranceschini commented May 21, 2024

davek44 commented May 28, 2024

DavidvanBruggen commented Aug 14, 2024

davek44 commented Aug 17, 2024 •

edited

Loading

ML newbie trying to run Baskerville #27

ML newbie trying to run Baskerville #27

Comments

jesspeers commented Apr 30, 2024

davek44 commented May 3, 2024

jesspeers commented May 7, 2024

jesspeers commented May 9, 2024

davek44 commented May 17, 2024

GMFranceschini commented May 21, 2024

davek44 commented May 28, 2024

DavidvanBruggen commented Aug 14, 2024

davek44 commented Aug 17, 2024 • edited Loading

davek44 commented Aug 17, 2024 •

edited

Loading