Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import commit history from slam_3UIs #46

Merged
merged 63 commits into from
Nov 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
c90f64d
initial commit before ian started messing with it
IanSudbery Aug 29, 2023
9e9ab36
first commit
jjriley1 Jan 23, 2024
b677c81
following receipt of seq data
jjriley1 Jan 23, 2024
a0f720b
Fixed errors re salmon index and quant
jjriley1 Jan 23, 2024
526e4a3
Added script to assign reads to spliced and unspliced isoform, and ge…
jjriley1 Feb 7, 2024
5e2ca09
Updated comments.
jjriley1 Feb 7, 2024
0b9338f
Changed RI assignment to only look at JC, and only focus on 3' junction
jjriley1 Feb 7, 2024
1d57d8a
Updated 3UI info call to use bed6 not default bed
jjriley1 Feb 9, 2024
25d2624
updated retained read assignment to use exon counts. add chr and stra…
jjriley1 Feb 9, 2024
d99cb74
Added dapars scripts and config for gene level APA across diff
Mar 12, 2024
86120d9
Added scripts and pipeline method to get pval for difference between …
Mar 12, 2024
31f8ace
Added script to get pval for differences in each isoform over differe…
Mar 12, 2024
677369e
Added script to get pval for gene level stability differences over di…
Mar 12, 2024
a711d90
Script to assign reads at gene level
Mar 12, 2024
5732b39
Script to get pvals for differences in isoform ratios over time
Mar 12, 2024
69508bb
Pipeline updated to do gene level analysis, and get pvals for between…
Mar 12, 2024
18eb393
updated pipeline and config
jjriley1 Jun 28, 2024
118400e
updated following code review with IS
jjriley1 Jun 28, 2024
8d0432e
updated following code review with IS pt2
jjriley1 Jun 28, 2024
9778367
script to do conversions for all introns, not just 3utr, and output
jjriley1 Jun 28, 2024
77cab20
script to summarize pct_converted instead of doing in each script
jjriley1 Jun 28, 2024
bd50b1b
script to do decay curves and pvals for all introns not just 3utr
jjriley1 Jun 28, 2024
59479bf
Move text from README.txt > README.md
ns-rse Oct 31, 2024
c53e078
Merge pull request #2 from jjriley1/ns-rse/demo
ns-rse Oct 31, 2024
0d00ec9
Test to see how branch switching works
JackJRiley Oct 31, 2024
07ddc16
Removed unnecessary pipeline functions and corresponding scripts
JackJRiley Oct 31, 2024
36bb006
Added script description
JackJRiley Oct 31, 2024
9491d0f
Removed non-accessed imports
JackJRiley Oct 31, 2024
fb93e86
chore: info added to README.md; tidying Python code
ns-rse Nov 1, 2024
304ab81
Merge pull request #3 from jjriley1/ns-rse/tidying-20241101
ns-rse Nov 1, 2024
44d4512
Changed hardcoded input files to test set
JackJRiley Nov 11, 2024
93794bb
Merge branch 'jjriley1/1-tidy-up' of github.com:jjriley1/slam_3UIs in…
JackJRiley Nov 11, 2024
ebebfce
removed blank lines
JackJRiley Nov 11, 2024
cf24f4d
refactor: Switching to f-strings and removing CamelCase
ns-rse Nov 14, 2024
888cd9c
Initial commit
ns-rse Oct 16, 2024
73c6ad7
ci: Workflow to add issues to project automatically
ns-rse Oct 16, 2024
ae4769e
chore: Ignore PyChrm,VSCode,Emacs,OSX files
ns-rse Oct 16, 2024
bd71766
chore: Adds CODE_OF_CONDUCT.md
ns-rse Oct 16, 2024
d99a13f
chore: Adds MIT License and CITATION.cff
ns-rse Oct 17, 2024
1818566
doc: First draft of README
ns-rse Oct 17, 2024
0419d8c
Fix error in badges of README.md
ns-rse Oct 17, 2024
f58587f
Adding missing line to header of README.md
ns-rse Oct 17, 2024
0e51ed2
packaging: Adds initial pyproject.toml
ns-rse Oct 17, 2024
714e7cb
chore: Adding bug report and feature request issue templates
ns-rse Oct 17, 2024
8046c97
bug(issuetemplates): Fixing issues in issue templates
ns-rse Oct 17, 2024
51fa6a2
bug(issuetemplate): Another attempt at fixing Bug Report template
ns-rse Oct 17, 2024
a5f317b
docs(readthedocs): Adds configuration file for readthedocs
ns-rse Oct 17, 2024
eda6cd5
package: Initial dependencies added to pyproject.toml
ns-rse Oct 18, 2024
6081459
ci(pre-commit): Adding pre-commit, markdownlint-cli2 & pylint conf
ns-rse Oct 23, 2024
b5c2be5
style: Linting existing files with pre-commit hooks
ns-rse Oct 23, 2024
9a918bc
docs: Setting up Sphinx documentation
ns-rse Oct 23, 2024
1a9b6eb
ci: Adding multiversionopts
ns-rse Oct 25, 2024
bab8ef9
ci/docs: Move sphinx conf.py to work with CI; tweak sphinx workflow
ns-rse Oct 25, 2024
8104ac8
ci: remove run from step
ns-rse Oct 25, 2024
2b643c3
ci: Move docs/conf.py > docs/source/conf.py
ns-rse Oct 25, 2024
f942c76
ci: add explicit config file option for sphinx-multiversion
ns-rse Oct 25, 2024
706a870
ci: correct path to conf.py in .readthedocs.yaml
ns-rse Oct 25, 2024
dd47d41
ci: restrict tmate to only run if there is a failure
ns-rse Oct 25, 2024
1d04575
admin: Links to contributing from README.md
ns-rse Oct 25, 2024
3224f4d
build: minimal set of dependencies
ns-rse Nov 1, 2024
bc1a639
ci: Add workflow for testing action
ns-rse Nov 1, 2024
011822a
feature: Initial import of basic pipeline from jjriley1/slam_3UIs
ns-rse Nov 15, 2024
f2817a6
Merge branch 'main' into ns-rse/31-import-slam_3UIs
ns-rse Nov 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
177 changes: 177 additions & 0 deletions isoslam/pipeline_slam_3UIs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
"""
This pipeline takes in mapped RNA seq reads in `.bam`, sorts and
annotates them, then passes them detects and quantifies the number
of 4sU-linked nucleotide conversions (T>C or A>G), exclusing SNPs.
"""

import os
import sys

import ruffus
from cgatcore import pipeline as P

# Read in pipeline.yml
PARAMS = P.get_parameters(
[
"%s/pipeline.yml" % os.path.splitext(__file__)[0],
"../pipeline.yml",
"pipeline.yml",
]
)


@ruffus.follows(ruffus.mkdir("name_sorted_bams"))
@ruffus.transform(
"bam/*.bam",
ruffus.regex("bam/(.+).bam"),
r"name_sorted_bams/\1.sorted.bam",
)
def name_sort_bams(infile, outfile):
"""
Takes `.bam` file and sorts it in name order, therefore reads from the same pair appear sequentially
"""
statement = f"samtools sort -n {infile} -o {outfile}"
P.run(statement, job_memory="8G")


@ruffus.follows(ruffus.mkdir("read_assignments"))
@ruffus.transform(
name_sort_bams,
ruffus.regex("name_sorted_bams/(.+).sorted.bam"),
r"read_assignments/\1/\1.sorted.assigned.bam",
)
def feature_count_read_assignments(infile, outfile):
"""
Takes `.bam` file and adds 2 tags to it:
- XT tag tells us the transcript the reads map to
- XS tag tells us which strand the gene is transcribed from

NB The `-p` flag is required for SubRead v2.0.6, may need to make this flexible.
"""
annotation = PARAMS["transcript_gtf"]
outdir = os.path.dirname(outfile)
rename_outfile = outfile.replace(".assigned.bam", "")
statement = (
"featureCounts"
"-p "
f"-a {annotation} "
f"-o {outdir}/counts.txt "
"-T 2 "
"-R BAM"
f"{infile} &&"
"mv %(rename_outfile)s.bam.featureCounts.bam %(outfile)s"
)
P.run(statement, job_threads=2, job_memory="8G")


@ruffus.transform(
"input_bams.dir/*no4sU*.bam",
ruffus.regex("input_bams.dir/(.+)_no4sU_(.+).bam"),
r"snp_vcf/\1.vcf.gz",
)
def VCF_from_no4sU_control(infile, outfile):
"""
Takes the negative control `.bam` file (via the no4sU regex) and creates a SNP VCF file
using Varscan. It looks for differences between the `.bam` file and the genome `.fasta`
file to detect differences, where these are always present at the same position (above
a treshold, see Varscan docs) it is classed as a SNP and included in output VCF file.
"""
outfile_uncompressed = outfile.replace(".vcf.gz", ".vcf")
genome_fasta = PARAMS["genome_fasta"]
statement = (
f"samtools mpileup -B -A -f {genome_fasta} {infile} | "
f"varscan mpileup2snp --variants 1 --output-vcf 1 > {outfile_uncompressed} && "
# f"sort -k1,1 -k2,2n {outfile_uncompressed} -o {outfile_uncompressed} &&"
f"bcftools view {outfile_uncompressed} -Oz -o {outfile}"
)
P.run(statement, job_threads=12, job_memory="1G")


@ruffus.transform(VCF_from_no4sU_control, ruffus.suffix(".vcf.gz"), ".vcf.gz.tbi")
def index_VCF(infile, outfile):
"""
Indexes the VCF file so it can be read with Pysam, generates the .vcf.gz.tbi file
"""
statement = f"tabix -p vcf {infile}"
P.run(statement)


@ruffus.follows(index_VCF, ruffus.mkdir("all_introns_counts_and_info"))
@ruffus.transform(
feature_count_read_assignments,
ruffus.regex("read_assignments/(.+)/(.+)_(.+)_(.+)_(.+)_(.+).sorted.assigned.bam"),
ruffus.add_inputs(r"snp_vcf/\2.vcf.gz"),
r"all_introns_counts_and_info/\1.tsv",
)
def all_introns_counts_and_info(infiles, outfile):
"""
Takes in the sorted and feature assigned `.bam` file from previous steps and passes them to a
python script that uses pysam (python wrapper for htslib). This script iterates over the `.bam`
file pair-by-pair (representing the sequencing insert), determines whether the read-pair shows
evidence of intron splicing/retention and assigns these to specific events by referencing the
`.gtf` and `.bed` files, and XT tag from feature_count_read_assignments. Next, the script uses the
XS tag from featureCountsReadAssignment to assign each read in the pair as the forward or reverse
read, relative to the direction of transcription. Finally, it looks for T>C in FW read, A>G in RV
read, checks these are not present in the SNP VCF file, and outputs metadata on each read-pair
about it's event assignment, number of conversions, coverage etc.
"""
bamfile, vcffile = infiles
annotation = PARAMS["transcript_gtf"]
utron_bed = PARAMS["all_introns_bed6"]
script_path = os.path.dirname(os.path.abspath(__file__)) + "/pipeline_slam_3UIs/all_introns_counts_and_info.py"

statement = (
f"python {script_path} -b {bamfile} "
f"-g {annotation} "
f"-o {outfile} "
f"-vcf {vcffile} "
f"-bed {utron_bed} -v5"
)
P.run(statement, job_memory="16G")


@ruffus.follows(ruffus.mkdir("all_introns_counts_and_info/summarized"))
@ruffus.collate(
all_introns_counts_and_info,
ruffus.regex("(.+)/(.+)_.+(hr|4sU).+"),
r"\1/summarized/\2_summarized.tsv",
)
def summarize_all_introns_counts(infiles, outfile):
"""
R script to summarize the number of converted counts from each gene in each sample relative to
unconverted counts, therefore we calcualte the "percent_converted" for each gene in each sample.
"""
input_folder = os.path.dirname(infiles[0])
day_regex = os.path.basename(outfile).split("_")[0]
script_path = os.path.dirname(os.path.abspath(__file__)) + "/pipeline_slam_3UIs/summarize_counts.R"
statement = f"Rscript {script_path} {input_folder} {day_regex} {outfile}"

P.run(statement, job_memory="64G")


@ruffus.follows(
name_sort_bams,
feature_count_read_assignments,
VCF_from_no4sU_control,
index_VCF,
all_introns_counts_and_info,
summarize_all_introns_counts,
)
def full():
"""
Utility function to run the whole pipeline via:
python pipeline_slam_3UIs.py make full -v5
"""
pass


### MISC ###
def main(argv=None):
if argv is None:
argv = sys.argv
P.main(argv)


if __name__ == "__main__":
sys.exit(P.main(sys.argv))
############
Loading
Loading