Skip to content

Python project for classification of normal vs tumoral samples using TCGA gene expression data for Scientific Programming Course 2021 (MSc Bioinformatics for Computational Genomics)

Notifications You must be signed in to change notification settings

mariachiaragrieco/TCGAclassification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

classification-project

Project in Python for Scientific Programming 2021 course (MSc in Bioinformatics for Computational Genomics) helded by Prof. Piro Rosario and Prof. Pinoli Pietro.

The notebook can be viewed here nbviewer

Aim

The aim of this project is to analyze the TCGA GRCh38 Breast Cancer gene expression data taken from the GenoSurf interface using different machine learning techniques in order to classify normal and tumor samples. In this light, different classification techniques of machine learning have been compared:

  • Logistic Regression
  • Support Vector Machines
  • Linear Discriminant Analysis
  • Random Forest
  • DecisionTree
  • K-Nearest Neighbors

Outline

Firslty, data are dowloaded from the GenoSurf website using a concurrent programming strategy. Data manipulation is done with pandas. For the classification task, a feature selection is performed before, then different algorithms are used and compared by different evaluation metrics. Moreover, hyperparameters tuning and k-fold cross validation are carried out.

Data

The GenoSurf interface is used to download the related data setting as options:

  • project_name: ['tcga-brca']
  • assembly: ['grch38']
  • data_type: ['gene expression quantification']
  • is_healthy:
    • ['false'] for tumoral data
    • [true] for normal data.

About

Python project for classification of normal vs tumoral samples using TCGA gene expression data for Scientific Programming Course 2021 (MSc Bioinformatics for Computational Genomics)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published