Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding citations and updating changelog #51

Merged
merged 4 commits into from
Sep 1, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# 0.3.0

This is the result of the 2023 hackathon. Current features:

1. The [Coordinate submodule](src/isocomp/Coordinates/) creates windows based
on overlapping transcripts in the concatenated set of input sequences
2. The [Compare submodule](src/isocomp/Compare/) outputs unique transcripts
based on first whether they are individual transcripts in a given overlap
bin, next whether they have the exact same start/end points, and finally
those transcripts which do have the same start/end are pair-wise compared
3. the command line tools function. On a 16 core machine on DNAnexus, runtime
is ~15 minutes with less than 7GB on 16 CPU

## Caveats

1. The output should be considered an intermediate result. It is unparsed and
not immediately useful to anyone. However, there is good information there

2. We are not conducting exon level coordinate matching on the transcripts. We
are therefore doing sequence comparison on transcripts which are not actually
the same (eg, transcripts form the same individual with different exon usage),
and we are not reporting the wealth of information that we could using the
interval data alone.

## Future directions

1. The Coordinate submodule should create an interval tree structure from the
input gtf files using exon coordinates. Exons should be labelled with the
transcript and gene IDs
2. The interval tree can then be used to more finely compare intervals and
label different TSS/TTS, exon usage, intron retention, etc
3. The output format(s) must be refined
17 changes: 17 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# If you use this repo, please cite:

>Qiu, Y., Liew, C. S., Mateusiak, C., Kesharwani, R., Gu, B., Raza, M. S., Biederstedt, E., Yaman, U., Al Nahid, A., Tat, T., Modha, S., & Kubica, J. (2023). Isocomp. Carnegie Mellon, University of Nebraska-Lincoln, Washington University, Baylor College of Medicine, University of Southern California, Beijing Institute of Genomics, Chinese Academy of Sciences/China National Center for Bioinformation, HMS, UK Dementia Research Institute, University College London, Shahjalal University of Science and Technology, Houston Methodist Research Institute, Theolytics Limited, University of Warsaw. https://github.com/collaborativebioinformatics/isocomp

## Significant dependencies

### BioPython

> Cock, P.J.A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009 Jun 1; 25(11) 1422-3 https://doi.org/10.1093/bioinformatics/btp163 pmid:19304878

### edlib

>Martin Šošić, Mile Šikić; Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance. Bioinformatics 2017 btw753. doi: 10.1093/bioinformatics/btw753

### PyRanges

> Endre Bakken Stovner , Pål Sætrom, PyRanges: efficient comparison of genomic intervals in Python, Bioinformatics, Volume 36, Issue 3, February 2020, Pages 918–919, https://doi.org/10.1093/bioinformatics/btz615