Skip to content

elengss/PacBio_HepB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PacBio_HepB

This is a repository of the scripts used to prepare the paper.
It aligns PacBio Isoseq reads from Hepatitis B infected/transfected hepatocyte cell lines and performs differential gene/transcript expression and chimera detection.
The directories are as follows
Reference - stores the hepatitis B fasta and GTF files. Human files are large and can be downloaded from the Gencode website (further details in ReadMe in directory)
Data - this is where the raw fastq files are stored
Align_and_Quantify - this aligns the reads to viral and human genomes and quantifies them
Differential - this performs differential expression and gene set enrichment analysis
Chimera - this extracts soft clipped reads, remaps them and then then remaps them to the human genome

The shell scripts set a variable called "home" to define the directory paths used.
There is a small utility "Findhome.sh" which can find the value for that variable on the system running these scripts.
On most systems scripts are run by adding "./" in front of the script name, for example: ./Findhome.sh
Many Linux systems require the script be declared as executable using the "chmod" command, for example: chmod +x Findhome.sh


Packages used

  1. R Version 4.1.2 with Bioconductor (https://www.bioconductor.org/)
  2. Samtools (http://www.htslib.org/)
  3. PBMM2 (https://github.com/PacificBiosciences/pbmm2)

Samtools and pbmm2 are both downloadable as conda packages:

conda install -c bioconda samtools
conda install -c bioconda pbmm2

On some systems the conda package manager has to be loaded before running the shell scripts, on our system the command is :
conda activate XXXX
where "XXXX" is the name of the conda environment

More details on the Conda and Anaconda package manager system can be found here : https://anaconda.org


Bioconductor packages:
Rsubread_2.8.0
edgeR_3.36.0
limma_3.50.0
GenomicAlignments_1.30.0
Rsamtools_2.10.0
Biostrings_2.62.0
XVector_0.34.0
SummarizedExperiment_1.24.0
Biobase_2.54.0
MatrixGenerics_1.6.0
matrixStats_0.61.0
GenomicRanges_1.46.0
GenomeInfoDb_1.30.0
IRanges_2.28.0
S4Vectors_0.32.0
BiocGenerics_0.40.0
clusterProfiler_4.2.2

They can be installed and loaded with this command in R :

bio_pkgs <- c("Rsubread","edgeR","limma","GenomicAlignments","Rsamtools","Biostrings","XVector","SummarizedExperiment","Biobase","MatrixGenerics","matrixStats","GenomicRanges","GenomeInfoDb","IRanges","S4Vectors","BiocGenerics","clusterProfiler")
BiocManager::install(bio_pkgs)

Cut and paste tha above command into R or Rstudio to install all the required packages.

There are currently (December 2022) some problems running edgeR on M1 and M2 Macs. EdgeR requires fortran libraries to run. This requires a new installation of the gcc package which has these libraries (these are in addition to the compilers installed with Xtools, details here : https://mac.r-project.org/tools/).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published