Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
jjjk123 authored Sep 1, 2023
1 parent b64e408 commit d5ddff1
Showing 1 changed file with 69 additions and 53 deletions.
122 changes: 69 additions & 53 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,33 +4,19 @@

![ISOCOMP](https://i.ibb.co/vHLhrZq/Isocomp-logo1.png)

## Contributors
1. Yutong Qiu (Carnegie Mellon)
2. Chia Sin Liew (University of Nebraska-Lincoln)
3. Chase Mateusiak (Washington University)
4. Rupesh Kesharwani (Baylor College of Medicine)
5. Bida Gu (University of Southern California)
6. Muhammad Sohail Raza (Beijing Institute of Genomics, Chinese Academy of Sciences/China National Center for Bioinformation)
7. Evan Biederstedt (HMS)
8. Umran Yaman (UK Dementia Research Institute, University College London)
9. Abdullah Al Nahid (Shahjalal University of Science and Technology)
10. Trinh Tat (Houston Methodist Research Institute)
11. Sejal Modha (Theolytics Limited)
12. Jędrzej Kubica (University of Warsaw)

## Github Codespace for Development

To use codespaces for development purposes, do the following:

1. fork the repo
2. switch to the 'develop' branch
- **NOTE**: if you plan to code/add a feature, create a branch from the 'develop' branch. Switch to it, and then continue on with the steps below.
4. click the green 'code' button. **But**, rather than copying the https or ssh link, click the tab that says "Codespace"
5. click the button that says "create codespace on develop". Go make some tea -- it takes ~5 minutes or so to set up the environment. But, once it is set up, you
will have a fully functioning vscode environment with all the dependencies installed. Start running the tests, set some breakpoints, take a look around!

## Detailed project overview
https://github.com/collaborativebioinformatics/isocomp/blob/main/FinalPresentation_BCM_Hackathon_12Oct2022.pdf
## Contributors:
- Yutong Qiu (Carnegie Mellon)
- Chia Sin Liew (University of Nebraska-Lincoln)
- Chase Mateusiak (Washington University)
- Rupesh Kesharwani (Baylor College of Medicine)
- Bida Gu (University of Southern California)
- Muhammad Sohail Raza (Beijing Institute of Genomics, Chinese Academy of Sciences/China National Center for Bioinformation)
- Evan Biederstedt (HMS)
- Umran Yaman (UK Dementia Research Institute, University College London)
- Abdullah Al Nahid (Shahjalal University of Science and Technology)
- Trinh Tat (Houston Methodist Research Institute)
- Sejal Modha (Theolytics Limited)
- Jędrzej Kubica (University of Warsaw)

## Introduction
NGS-targeted sequencing and WES have become routine for diagnosing Mendelian disease (Xue et al., 2015). Family sequencing (or "trio sequencing") involves sequencing a patient and parents (trio) or other relatives. This improves the diagnostic potential via the interpretation of germline mutations and enables the detection of de novo mutations which underlie most Mendelian disorders.
Expand All @@ -47,10 +33,53 @@ This project aims to expand the applicability of long-read RNAseq for investigat

-And what about gene fusions? We detect these in the stupidest possible way with short-read sequencing, and we think they're cancer-specific. What about the germline?

## Aim

The aim of this project is to algorithmically characterize the "unique" (differing) isoforms between any number of samples using high-quality assembled isoforms.

## Workflow
![](docs/images/isocomp_workflow.png)

## Running the pipeline

### Installation

`pip install isocomp==0.3.0`

For guidelines run:

`isocomp --help`

### Step 1. Create windows

`isocomp create_windows -i sample1.gtf sample2.gtf sample3.gtf -f transcript -o clustered_file.gtf`

### Step 2. Find unique isoforms across multiple samples

`isocomp find_unique_isoforms -a clustered_file.gtf -f fasta_map.csv`

File fasta_map.csv:

```
source,fasta
NA24385.filtered,BCM-data-HG002-All2Samples-hg38-Results/NA24385_HG002/MMSQANTI3Filter/NA24385.filtered.fasta
NA26105.filtered,BCM-data-HG002-All2Samples-hg38-Results/NA26105_GM26105/MMSQANTI3Filter/NA26105.filtered.fasta
```

### Example output

For each isoform that is unique to at least one sample, we provide information about the read and the similarity between that isoform and the most similar isoform within the same window.

The last column describes the normalized edit distance and the CIGAR string.

## Goals
```
win_chr win_start win_end total_isoform isoform_name sample_from sample_compared_to mapped_start isoform_sequence selected_alignments
NC_060925.1 255178 288416 4 PB.6.2 HG004 HG002 255173 GGATTATCCGGAGCCAAGGTCCGCTCGGGTGAGTGCCCTCCGCTTTTT 0.02_HG002_PB.6.2_3=6I1=3I1286=11I
NC_060925.1 255178 288416 4 PB.6.2 HG004 HG005 255173 GGATTATCCGGAGCCAAGGTCCGCTCGGGTGAGTGCCCTCCGCTTTTTG 0.02_HG002_PB.6.2_3=6I1=3I1286=11
```

The goal of this project is to algorithmically characterize the "unique" (differing) isoforms between any number of samples using high-quality assembled isoforms.
## Detailed project overview
https://github.com/collaborativebioinformatics/isocomp/blob/main/FinalPresentation_BCM_Hackathon_12Oct2022.pdf

## Methods

Expand Down Expand Up @@ -96,30 +125,6 @@ Isoseq3 (v3.2.2) generated HQ (Full-length high quality) transcripts [Table 1] w

Differences between isoforms are categorized into [TODO] SNPs (<5bp), large-scale variants (>5bp), gene fusion, different exon usage, and completely novel sequences. These categories build upon those used by SQANTI to annotate disparities between sample isoforms and the reference transcriptome. Note that we extend the categories provided by SQANTI by adding SNPs and large-scale variants.

## Description

## Flowchart
![](images/workflow.png)
### To extract sets of unique isoforms
![](images/workflow_part1.png)
### To annotate the unique isoforms
![](images/workflow_part2.png)

## Example Output

For each isoform that is unique to at least one sample, we provide information about the read and the similarity between that isoform and the most similar isoform within the same window.

The last column describes the normalized edit distance and the CIGAR string.

```
win_chr win_start win_end total_isoform isoform_name sample_from sample_compared_to mapped_start isoform_sequence selected_alignments
NC_060925.1 255178 288416 4 PB.6.2 HG004 HG002 255173 GGATTATCCGGAGCCAAGGTCCGCTCGGGTGAGTGCCCTCCGCTTTTT 0.02_HG002_PB.6.2_3=6I1=3I1286=11I
NC_060925.1 255178 288416 4 PB.6.2 HG004 HG005 255173 GGATTATCCGGAGCCAAGGTCCGCTCGGGTGAGTGCCCTCCGCTTTTTG 0.02_HG002_PB.6.2_3=6I1=3I1286=11
```

### Deployment

Eventually, `pip install isocomp`. But not yet.

## DEPENDENCIES

Expand Down Expand Up @@ -187,6 +192,17 @@ pip install poetry
# and continue with the development install below
```
## Github Codespace for Development

To use codespaces for development purposes, do the following:

1. fork the repo
2. switch to the 'develop' branch
- **NOTE**: if you plan to code/add a feature, create a branch from the 'develop' branch. Switch to it, and then continue on with the steps below.
4. click the green 'code' button. **But**, rather than copying the https or ssh link, click the tab that says "Codespace"
5. click the button that says "create codespace on develop". Go make some tea -- it takes ~5 minutes or so to set up the environment. But, once it is set up, you
will have a fully functioning vscode environment with all the dependencies installed. Start running the tests, set some breakpoints, take a look around!

### Development

Install [poetry](https://python-poetry.org/) and consider setting [the configuration
Expand Down

0 comments on commit d5ddff1

Please sign in to comment.