-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ML newbie trying to run Baskerville #27
Comments
Hi Jess, we haven't completely ported the data preprocessing code into this new repository. I can prioritize that for you. In the case of your non-model species, you'll need to start from scratch. But for the model, assuming it's human or mouse, you can consider transfer learning from our pretrained model. We're working on scripts for that now. Although I obviously like the tools we develop, they don't necessarily surpass simpler methods for peak data like ATAC-seq, where distal interactions aren't as important. You might also consider Anshul Kundaje's group's ChromBP-net, which is able to model the Tn5 cutting bias and nucleotide-precision cut sites from ATAC. https://github.com/kundajelab/chrombpnet |
Hi Dave, Thanks so much for your response! I'll have a look at ChromBP-net and come back to Baskerville once the preprocessing code is ported over. Many thanks, |
Hi Dave, Thanks again for all your help. I'd really like to apply Baskerville if possible so do you have a rough estimation of when the preprocessing code might be ported over? For my application, I think transfer learning from your pretrained model should work, so do you know roughly how long it might take for those scripts to become available? Many thanks, |
Hi Jess, I just ported the data preprocessing code and pulled into the main branch. You'll basically need to make a targets table similar to the one we used for Borzoi here: https://raw.githubusercontent.com/calico/borzoi/main/examples/targets_human.txt Then here's an example of how we ran the scripts for the recent Borzoi dataset: https://github.com/calico/borzoi/tree/main/src/scripts/data/training_data Reach out if you have additional questions. We'll aim to bring in the new transfer learning script next. Best, |
Hi @davek44, I hope it's ok to follow up on this thread as I am also new to ML on sequences. I aim to obtain a representative feature vector of each genomic bin (say 50kb), possibly incorporating other epigenetic data like accessibility and histone mark tracks. This will ultimately be used for a classification task that would benefit from this well-built sequence representation, or at least that is my intuition. I am working with Gian |
Hi Gian, this is a different enough question that I'd recommend you open a separate issue. But yes, moving to hg19 should be fine. |
Hi Dave, Thanks for making this great work available! Just a question related to dropping alignment between human and mouse in the makefile approach you specified above? Thanks! |
Hi, for a single genome, you'll simply skip the hound_data_align.py command and run hound_data.py without the --restart option and adding the |
Hello,
I'm very new to machine learning and wanting to run Baskerville (as suggested by @davek44 on the Basenji page).
I was wondering if you had any beginner-friendly guidance for how to implement Baskerville (e.g. how to pre-process the data and split into train/test/validation, how to train the model, etc)? I was relying quite heavily on the Basenji ipynb tutorials and I'm a little confused how to use Baskerville.
I'm hoping to supply ATAC-seq training data to the model and use the output to investigate deleterious variants in regulatory elements of model & non-model species, so Baskerville seems like the ideal tool to use, but I'm unfortunately a bit of a beginner!
Many thanks,
Jess
The text was updated successfully, but these errors were encountered: