A general intergration package for all your research needs :)

Overview

The goal of this project to make a researchers job easier by integrating some cutting edge AI tools related to -omics data (currently with a heavy focus on proteomics) and assist in hypothesis generation as well as data curation from numerous sources.

There are certain aspects of doing research that is time consuming but also not rewarding. Some of these include integration of different modalities in a cumbersome way, trying to figure out a method to reliably integrate pipelines whose dependencies are not compmatible and literature search and re-search and re-search

Goals

As mentioned above the aim is to be able to write a script that can access numerous online resources, curate data and papers in a way that is commatible with different tools, manage orchestrations between packages etc.

Why are you not writing a pipeline using snakemane or wdl or whatever?

The aim is not to write a pipeline but generate a framework that can adapt itself (not in an agentic way) as it gathers new data. This is not about getting from place A to B, for that a pipeline works great and if you can containerize your pipeline this thing can call it (in very near the future, did not test the code yet, see more on that below).

What's wrong with agents?

Nothing, I just don't want to pay Google or OpenAI or whoever has the best benchmarks and I am not sure these models are there way in terms of answering hypothesis generating questions. You can use this to create agents too if you want.

Can you implement tool X?

Maybe, probably, can you put it in a singularity/apptainer container?

Structure

This is subject to change and can definitely use some better organization. The focus is building an MVP that can illustrate the basic capabilities and excite people to contribute or hire us. Currently there are 8 main modules and I will go over them below.

Protein

Since this is a (currently) proteomics based solution the starting point in my mind is this, though createing a DNA, RNA module and intergrating them with a Genome module is perfectly feasible and I would love to do that.

Protein module starts with a uniprot ID, this then makes an API call and gathers and sorts all the information is there for that protein that you like so much.

After all the json parsing you can use information that is strored in that json to pass it to other modules. While that was the initial idea behinds this there is no reason for you use the other modules on their own.

Improvements I have in mind

In addition to uniprot it would be nice to be able to get some sort of an interaction network using stringdb they have a pretty powerful api and I think this can benefit from that.
Having the option to collect all the homolog as well as the protein itself. That is actually possible now with mmseqs2 see sequence module below just with extra steps and you neeed to know about MSAs.

Sequence

This does some basic protein sequence manipulation, calculates embeddings and msas. Currently, I have one model for embeddings but I am going to add a couple of more. This also needs support for multiple sequences. Using the msa feature you can probably get the homologs of your proteins and run this iteratively.

Improvements

There might be some value in using generative models here and if possible maybe apply RAG methods

Structure

Similar to sequence module basic processing and I/O using already existing packages, there is nothing new there, There is a method similar to MSA aligning structures and some simple calculations like embeddings using ESM3 and structure prediction using AF3. There is a skeleton class that will subclass Structure and hopefully will repretsent a protein complex. Again features like structure prediction and other basic properties.

Improvements

Again methods for generative models that could take sequence and structure to generate something new with a new function or be able to call differnet models for prediction tasks would be great. Better structure comparison methods or integration of structure with the other modules especially literature and knowledgebase

Binders

Currenltly, this is a simple container wrapper around RF diffusion and Pocket2Mol. It also has support for searching large libraries using basic fingerprints (generated by RDKit). This also will have a "verify" section where we re-predict the strcture to independenltly re-create to increase our chances of finding an actual positive hit.

Improvements

more is more, so more methods and comparison of outcomes of those methods etc. Also prediction of molecular properties using other models not just RDKit would be helpful.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
ccm_demo		ccm_demo
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A general intergration package for all your research needs :)

Overview

Goals

Structure

Protein

Improvements I have in mind

Sequence

Improvements

Structure

Improvements

Binders

Improvements

Literature and knowledge_base

About

Releases

Packages

Languages

ccmbioinfo/ccm_demo

Folders and files

Latest commit

History

Repository files navigation

A general intergration package for all your research needs :)

Overview

Goals

Structure

Protein

Improvements I have in mind

Sequence

Improvements

Structure

Improvements

Binders

Improvements

Literature and knowledge_base

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages