Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML newbie trying to run Baskerville #27

Open
jesspeers opened this issue Apr 30, 2024 · 8 comments
Open

ML newbie trying to run Baskerville #27

jesspeers opened this issue Apr 30, 2024 · 8 comments

Comments

@jesspeers
Copy link

Hello,

I'm very new to machine learning and wanting to run Baskerville (as suggested by @davek44 on the Basenji page).

I was wondering if you had any beginner-friendly guidance for how to implement Baskerville (e.g. how to pre-process the data and split into train/test/validation, how to train the model, etc)? I was relying quite heavily on the Basenji ipynb tutorials and I'm a little confused how to use Baskerville.

I'm hoping to supply ATAC-seq training data to the model and use the output to investigate deleterious variants in regulatory elements of model & non-model species, so Baskerville seems like the ideal tool to use, but I'm unfortunately a bit of a beginner!

Many thanks,
Jess

@davek44
Copy link
Collaborator

davek44 commented May 3, 2024

Hi Jess, we haven't completely ported the data preprocessing code into this new repository. I can prioritize that for you. In the case of your non-model species, you'll need to start from scratch. But for the model, assuming it's human or mouse, you can consider transfer learning from our pretrained model. We're working on scripts for that now.

Although I obviously like the tools we develop, they don't necessarily surpass simpler methods for peak data like ATAC-seq, where distal interactions aren't as important. You might also consider Anshul Kundaje's group's ChromBP-net, which is able to model the Tn5 cutting bias and nucleotide-precision cut sites from ATAC. https://github.com/kundajelab/chrombpnet

@jesspeers
Copy link
Author

Hi Dave,

Thanks so much for your response! I'll have a look at ChromBP-net and come back to Baskerville once the preprocessing code is ported over.

Many thanks,
Jess

@jesspeers
Copy link
Author

Hi Dave,

Thanks again for all your help. I'd really like to apply Baskerville if possible so do you have a rough estimation of when the preprocessing code might be ported over?

For my application, I think transfer learning from your pretrained model should work, so do you know roughly how long it might take for those scripts to become available?

Many thanks,
Jess

@davek44
Copy link
Collaborator

davek44 commented May 17, 2024

Hi Jess,

I just ported the data preprocessing code and pulled into the main branch. You'll basically need to make a targets table similar to the one we used for Borzoi here: https://raw.githubusercontent.com/calico/borzoi/main/examples/targets_human.txt

Then here's an example of how we ran the scripts for the recent Borzoi dataset: https://github.com/calico/borzoi/tree/main/src/scripts/data/training_data
Just substitute "hound" for "basenji" in the script names.

Reach out if you have additional questions. We'll aim to bring in the new transfer learning script next.

Best,
David

@GMFranceschini
Copy link

Hi @davek44, I hope it's ok to follow up on this thread as I am also new to ML on sequences.

I aim to obtain a representative feature vector of each genomic bin (say 50kb), possibly incorporating other epigenetic data like accessibility and histone mark tracks. This will ultimately be used for a classification task that would benefit from this well-built sequence representation, or at least that is my intuition.

I am working with hg19; would it be straightforward to start from a pre-trained model and get "embeddings" for those genomic bins? I am asking if this makes sense and if I am looking at the correct repo.
Thank you,

Gian

@davek44
Copy link
Collaborator

davek44 commented May 28, 2024

Hi Gian, this is a different enough question that I'd recommend you open a separate issue. But yes, moving to hg19 should be fine.

@DavidvanBruggen
Copy link

Hi Dave,

Thanks for making this great work available!

Just a question related to dropping alignment between human and mouse in the makefile approach you specified above?
I want to train a borzoi model on mouse only, dropping the alignment steps, can you tell me how to run hound_data.py properly? At the moment it is not obvious for me.

Thanks!

@davek44
Copy link
Collaborator

davek44 commented Aug 17, 2024

Hi, for a single genome, you'll simply skip the hound_data_align.py command and run hound_data.py without the --restart option and adding the -l $(LENGTH), --stride $(TSTRIDE), and --umap_t 0.5 options (which were previously handled at the align stage for multiple genomes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants