The goal of this project to make a researchers job easier by integrating some cutting edge AI tools related to -omics data (currently with a heavy focus on proteomics) and assist in hypothesis generation as well as data curation from numerous sources.
There are certain aspects of doing research that is time consuming but also not rewarding. Some of these include integration of different modalities in a cumbersome way, trying to figure out a method to reliably integrate pipelines whose dependencies are not compmatible and literature search and re-search and re-search
As mentioned above the aim is to be able to write a script that can access numerous online resources, curate data and papers in a way that is commatible with different tools, manage orchestrations between packages etc.
- Why are you not writing a pipeline using
snakemane
orwdl
or whatever?
The aim is not to write a pipeline but generate a framework that can adapt itself (not in an agentic way) as it gathers new data. This is not about getting from place A to B, for that a pipeline works great and if you can containerize your pipeline this thing can call it (in very near the future, did not test the code yet, see more on that below).
- What's wrong with agents?
Nothing, I just don't want to pay Google or OpenAI or whoever has the best benchmarks and I am not sure these models are there way in terms of answering hypothesis generating questions. You can use this to create agents too if you want.
- Can you implement tool X?
Maybe, probably, can you put it in a singularity/apptainer container?
This is subject to change and can definitely use some better organization. The focus is building an MVP that can illustrate the basic capabilities and excite people to contribute or hire us. Currently there are 8 main modules and I will go over them below.
Since this is a (currently) proteomics based solution the starting point in my mind is this, though createing a DNA, RNA module and intergrating them with a Genome module is perfectly feasible and I would love to do that.
Protein module starts with a uniprot ID, this then makes an API call and gathers and sorts all the information is there for that protein that you like so much.
After all the json parsing you can use information that is strored in that json to pass it to other modules. While that was the initial idea behinds this there is no reason for you use the other modules on their own.
- In addition to uniprot it would be nice to be able to get some sort of an interaction network using stringdb they have a pretty powerful api and I think this can benefit from that.
- Having the option to collect all the homolog as well as the protein itself. That is actually possible now with mmseqs2 see sequence module below just with extra steps and you neeed to know about MSAs.
This does some basic protein sequence manipulation, calculates embeddings and msas. Currently, I have one model for embeddings but I am going to add a couple of more. This also needs support for multiple sequences. Using the msa feature you can probably get the homologs of your proteins and run this iteratively.
There might be some value in using generative models here and if possible maybe apply RAG methods
Similar to sequence module basic processing and I/O using already existing packages, there is nothing new there, There is a method similar to MSA aligning structures and some simple calculations like embeddings using ESM3 and structure prediction using AF3. There is a skeleton class that will subclass Structure and hopefully will repretsent a protein complex. Again features like structure prediction and other basic properties.
Again methods for generative models that could take sequence and structure to generate something new with a new function or be able to call differnet models for prediction tasks would be great. Better structure comparison methods or integration of structure with the other modules especially literature and knowledgebase
Currenltly, this is a simple container wrapper around RF diffusion and Pocket2Mol. It also has support for searching large libraries using basic fingerprints (generated by RDKit). This also will have a "verify" section where we re-predict the strcture to independenltly re-create to increase our chances of finding an actual positive hit.
more is more, so more methods and comparison of outcomes of those methods etc. Also prediction of molecular properties using other models not just RDKit would be helpful.