Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AI/LLM Generated gene alteration and expression based subtyping for each tumor type #114

Open
inodb opened this issue Mar 20, 2024 · 8 comments

Comments

@inodb
Copy link
Member

inodb commented Mar 20, 2024

Background:

  • Cancer Classification: Cancer manifests in various forms across different tissues and organs of the body. The classification of cancer plays a pivotal role in understanding its behavior, prognosis, and treatment strategies. Over the years, advancements in medical research and technology have led to a deeper understanding of the molecular and cellular mechanisms underlying cancer development, thereby refining the classification systems used by oncologists and researchers. At its core, cancer classification categorizes malignancies based on a multitude of factors, including their tissue of origin, histological characteristics, genetic alterations, and clinical behavior.
  • OncoTree: OncoTree is a dynamic and flexible community-driven cancer classification platform encompassing rare and common cancers that provides clinically relevant and appropriately granular cancer classification for clinical decision support systems and oncology research.
  • cBioPortal: cBioPortal is an open-source platform for cancer genomics data analysis and visualization. It provides a centralized resource for exploring and analyzing large-scale cancer genomic data sets, including genomic alterations, gene expression, and clinical information. The platform integrates data from multiple sources, including The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC), and makes it available through a web interface for researchers, clinicians, and the general public. All samples in cBioPortal are assigned a particular cancer type based on OncoTree
  • The Challenge: in cBioPortal there are many pages where it would be useful to list a set of default genes when we know what cancer type the user is looking at. E.g. imagine exploring a breast cancer dataset, it probably makes sense to look at BRCA1, BRCA2 and EGFR alterations. Similarly, for Glioblastoma you'll want to look at IDH1 and IDH2. We can use an LLM (or another method) to generate these recommended genes for each OncoTree code by e.g. constructing a prompt like "Which genes are relevant for subtype x"

Goal:

  • Generate a list of recommended default genes for each OncoTree code that are often used for molecular classification of that subtype

Approach:

  • Try different prompts on any LLM of choice and script a way to do this semi-automatically
    Some example prompts:
    Which genes and pathways are relevant for classifying Breast Cancer? Could you give your answer in a JSON structure like:
    {
        "Breast Cancer": {
            "Mutation-Based": ["Gene1", "Gene2", "Gene3"],
            "Expression-Based": ["GeneX","GeneY","GeneZ"]
            "Pathways": ["PathwayA","PathwayB"]
    }
    
  • The LLM of choice can be vanilla ChatGPT, Gemini, something you train yourself, etc
  • Start with just the main OncoTree Types, e.g. "Breast Cancer", "Lung Cancer", etc
  • Explore ways to validate the proposed genes. One way would be to leverage the cBioPortal API to see if samples with this OncoTree code have any alterations in those genes
  • For the 350h project we can try to do the same for the more detailed subtypes like e.g. 'Breast Lobular Carcinoma In Situ'.

Need skills:
Prompt Engineering, Python or similar scripting language

Possible mentors:
@inodb

@SelahattinAksoy

This comment was marked as resolved.

@Steveolas
Copy link

Super interesting, I don't know much about genetics so I may be off here. But I think it may be worth your while to check out the Mamba model. Although it is a smaller model it can be fine-tuned, and it has a different architecture than transformers (Most other LLMs) that makes it better for large context problems. And genomic problems (from my understanding) can generally have large contexts.

Ilan

@sohamchatterjee50
Copy link

Hey @inodb

I am currently pusrsuing my Masters in Aritificial Intelligence from University of Amsterdam. Prior to this, I worked at Amazon as an Applied Scientist where I was responsible for finetuning models for rereanking purposes. As part of both professional and research activities, I have delved into LLMs for prompt engineering. As part of the research ongoing at UvA, I am exposed to a lot of cutting edge reserch, AI for Science. I would love to contribute into an area which involves both AI and biology, especially bioinformatics.

Mail: [email protected]
LinkedIn: https://www.linkedin.com/in/soham-chatterjee-3410abb8/
It would be great if we can connect sometime on call.

@SumitdevelopAI
Copy link

Hello @inodb

My name Sumit Sharma. Currently Pursuing my B.tech in AI from Parul University. please help how to apply for this project.

@wuyuqing0327
Copy link

Hi @inodb

My name is Yuqing Wu, preferred as Chelsea. I'm currently a master's degree student in Data Science at the University of Chicago. Before this program, I also worked as a machine learning engineer for over 5 years. I currently have a part-time job as a Research Assistant at the Institute of Population and Precision and Health to do machine learning algorithms to identify the impact of microbiome and Bacteroides on blood pressure, preventing people's diseases.

I'm really interested in this project, I can leverage LLM to process and identify text-based information, identifying genes that are relevant for the molecular classification of cancer subtypes.

Linkedin: https://www.linkedin.com/in/chelsea-uchi0327
E-mail: [email protected]/[email protected]

@RainieFu
Copy link

Hi @inodb,

I'm Rainie, a third-year undergraduate student at the University of British Columbia, majoring in Computer Science and Statistics. Currently, I am doing a full-time internship as a Bioinformatics Research Assistant at Vancouver Prostate Cancer, where my focus lies in genomic and statistical analysis within the realm of Prostate Cancer. Here, I'm deeply engaged in identifying potential biomarkers to introduce new treatment arms in clinical trials, leveraging a diverse set of technologies encompassing both biological methodologies and data-driven analytics.

My academic journey has equipped me with a solid foundation in data mining and machine learning, as well as proficiency in scripting languages like python and R, enabling me to grasp intricate machine learning algorithms and well-known libraries with ease. I am genuinely passionate about contributing my skills and knowledge to projects that make a tangible impact. This project is the perfect intersection of my academic interests and practical skills. I'm eager to join forces, brainstorm solutions, and ultimately contribute to advancements in cancer research.

Looking forward to the possibility of working together on this exciting endeavor, and I am happy to discuss further about my application. Feel free to connect with me via the following:

Linkedin: https://www.linkedin.com/in/rainie-fu/
Email: [email protected]

@inodb
Copy link
Member Author

inodb commented Mar 28, 2024

Hi @SelahattinAksoy @Steveolas @sohamchatterjee50 @SumitdevelopAI @wuyuqing0327 @RainieFu!

Thanks so much for your interest! Unfortunately, I'm not able to meet with everyone, but want to encourage you all to try and submit a proposal for this project if you're interested. Make sure to submit it thru the https://summerofcode.withgoogle.com/ website before 4/2!

If you're able to share a proposal in a Google Doc as well before I'll do my best to provide some feedback, you can send it via a DM on https://slack.cbioportal.org

Thanks so much all!

@manheraa
Copy link

manheraa commented Mar 28, 2024

Hi @inodb I am Sachetan Heralagi, a student currently pursuing my undergraduate degree at KLE Institute of Technology
In my academic journey, I have had the opportunity to explore various neural network architectures, from simple ANNs to more complex models like VGG19. One of the projects I take pride in is leading a team to develop an automated irrigation system using ANN technology for weather prediction and integration with Internet of Things (IoT) devices. Our project even made it to the finals of IEEE YESIST12, which was a significant achievement for us.
And I have recently worked with overian carcinoma subtype classification in which we used Transfer Learning i.e(VGG19).

I am particularly drawn to this project because I have worked with natural language processing and deep learning and I am enthusiastic about the prospect of collaborating with like-minded individuals .And I have proficiency in different ml and dl liberaries like tensorflow,keras, pytorch .

linkedin :www.linkedin.com/in/sachetan-heralagi
mail:[email protected]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants