Releases: iquasere/MOSCA
Merging of paired reads when no assembly is performed
MOSCA was calling genes directly from the preprocessed reads.
Now, it merges paired-end reads first, and then calls the genes on those reads.
When gene calling, MOSCA still considers the data as reads (-complete=0
), not complete genomes (-complete=1
).
Update on sortmerna
functions
SortMeRNA databases have been updated, and are now provided as a tar file multiple database files. Each of these databases can be used separately for a specific type of search. MOSCA now provides the sortmerna_database
parameter, which sets which database will be used:
- if
fast
, MOSCA will use thesmr_v4.3_fast_db.fasta
database. - if
default
, MOSCA will use thesmr_v4.3_ default_db.fasta
database. - if
sensitive
, MOSCA will use thesmr_v4.3_sensitive_db.fasta
database. - if
sensitive_with_rfam
, MOSCA will use thesmr_v4.3_sensitive_db_rfam_seeds.fasta
database.
Only one database file can be used at a time.
minimum_read_length
parameter split for MG and MT
Now, minimum length of reads for further analysis can be set with the minimum_mg_read_length
and minimum_mt_read_length
parameters.
Added minimum_envs folder and contents
For commands and resources to update envs when needed
Also, some fixes
- Converting readcounts (for MG and MT) to
int
was turning them all to zeros (because they are normalized). MOSCA now keeps them asfloat
. - Blocked the print of MOSCA's TXT logo. Don't know why it doesn't work on the tests.
- Fix on
Summary Report
, now rows have information for both "Name" and "Sample" levels (before, there were rows for "Name" and rows for "Sample"). - Another fix on
Summary Report
, counting annotated genes was not done properly. - When not performing assembly,
General Report
was not importing correctly the readcounts. Now, it does.
Added default parameters JSON
I hadn't updated MOSCA's recipe in Bioconda to include the new default_config.json
file. This release has no code updates, but serves to include the file in MOSCA's recipe.
Default parameters, input sanitization and final reports updates
MOSCA now has default parameters
These default parameters are set by the default_config.json
file.
Input quality checking
Implemented checking of invalid names in experiments - names can't start with number, even a float (e.g., 5AA
or .5Name
).
Updates on final reports
Renamed Protein
report to General
report.
New report - Expression
. This report includes only genes expressed.
Technical
report was renamed to Versions
. It is also exported as EXCEL now, because it brings information on every environment.
Implemented minimum value imputation
For MP analysis, but it's still not an option to use. For now, is a feature in preparation.
No more build_deps in Dockerfile
It's no longer needed, conda handles it all.
Dependencies update
- Fixed
snakemake
version to<8
- some of its new functionality is incompatible with MOSCA implementation. - Added
pandas
as dependency -mosca.py
now has functions that require it. - Updated to newest versions of UPIMAPI, reCOGnizer and KEGGCharter - allowed to remove the parameters related to database download.
Blocked MGMT test
Because GitHub actions doesn't provide enough disk space for it.
Also, several fixes
- Fix on DE handling multiple samples
- Fix on KEGGCharter handling multiple samples
- Fix on multi_sheet excel handling multiple samples and numbering
- Fix on converting RAW spectra to MGF outside a container environment
- MOSCA now prints snakemake command properly
- Fix on adding normalized matrices to entry report
- Several fixes on summary report
- Necessary reparations on EC numbers and KEGG IDs, as those come from UPIMAPI in non-compatible format for KEGGCharter
- Fix on inputting mods to
generate_parameters_file
function
Reintroduction of MOSCA into Bioconda
Reintroduction of MOSCA into Bioconda
Since MOSCA 1.3.6
, the list of dependencies of the pipeline has become too complex for conda to manage.
This release makes use of snakemake environments to simplify the minimal environment required to install MOSCA. MOSCA ow only requires snakemake.
Now MOSCA uses snakemake's rules
All the rules have been moved to corresponding .smk files. This has simplified a lot the main script.
Script files can no longer be run through the command line, however. Interface is with snakemake directly.
First step into producing a web-service.
Added schema for validating config.json
config.schema.yaml
checks if all needed informations are present, and in correct format, on the input config file.
New parameter
metaproteomics_add_reference_proteomes
: New option for not searching for reference proteomes for organisms identified. Helps save a lot of time during Peptide-to-Spectrum Matching.
Tests have been reformatted
Complete MGMP has been reintroduced, however, it still fails for too much disk usage. It'll be a problem for another time.
Several fixes and improvements
params.method
was not being correctly read on de_analysis.R
.
config.json
is now explicitly required.
tmp
directory when handling SortMeRNA is created inside SortMeRNA output directory.
Removed pandas warnings concerning reading files without low_memory=False
.
Memory allocated in metaproteomics now in G instead of M.
Removed UPIMAPI apt dependencies - are no longer needed.
Fix on reading method for normalization.
Fix on parsing conditions in de_analysis.R
.
Metaproteogenomics - a new level of omics analyses
New workflow of metaproteomics analyses, based on metagenomics (MG) results.
This new layer of analysis allows to input spectra - both in raw and standard formats - to MOSCA for metaproteomics (MP) analysis
MOSCA's MP workflow is as follows:
1. Database construction
A database is built from MG results, aiming to include all sequences that can possibly be in the datasets. This include:
- the genes identified by FragGeneScan on the MG gene calling step
- reference proteomes retrieved from UniProt of the taxa identified in the annotation step with UPIMAPI
- the cRAP database
- the protease sequence - only automatically available sequence is Trypsin for now, all others must be inputted manually
This database will then be submitted for a first round of Peptide-to-Spectrum matching with SearchCLI and PeptideShaker. All proteins with at least one Peptide-to-Spectrum match (PSM) are collected for the final database - the metaproteogenomics database.
2. Peptide-to-Spectrum matching
SearchCLI is used for obtaining PSMs from inputted spectra, using as reference the database constructed in the previous step. SearchCLI is used with three search engines - X!Tandem, MyriMatch and MS-GF+. More engines might be added in the future.
3. Protein inference
PeptideShaker is used for protein inference and quantification, based on spectracounts. PSMs are selected at a 5 % local False Discovery Rate, and only peptides with two or more PSMs and only proteins with two or more peptides identified are selected for further analysis
4. Normalization, imputation and differential protein expression analysis
Spectracounts are normalized with Variance Stabilizing Normalization. Missing values are imputed using Local Least Squares Imputation.
Normalized and imputed spectracounts are then submitted for differential protein expression analysis with Reproducibility-Optimized Test Statistics. Log2foldchange
and p-values
are retrieved for reporting.
5. Metabolic pathway representation and final reportings
All following steps are performed as close as possible to metatranscriptomics (gene expression) analysis.
Metabolic maps are built with KEGGCharter, showing protein expression levels from MP and genomic potential from MG.
Final reports include all results from MG, and report on differential expression analysis of proteins.
Other updates
MOSCA has increased its workflow in around 40 %.
MOSCA is now compatible with the six months old updates of UniProt, through UPIMAPI. It includes the parsing of taxonomic columns, to continue representing taxonomic kronas.
Snakemake conda environments are now used, instead of one single environment. This has made possible again to build MOSCA's environments, and may signal the return of MOSCA to Bioconda.
Re-added KEGGCharter to workflow
KEGGCharter is again run from "MOSCA_Entry_Report".
Changed its output filename in the rule because the tool now only outputs in TSV.
Also some fixes in environment.yml
- fixed perl version
- added subversion
Stand-alone metatranscriptomics worflow implemented
Metatranscriptomics can be used as reference without metagenomics
- If MG is not inputted, MT will be used for the MG part of MOSCA's workflow - assembly, binning, gene calling and annotation.
- Trinity and RNAspades now available as assembler options
- rule
join_reads
now considers possibility of MT as reference
Changes in config.json
experiments.tsv
integrated intoconfig.json
as a parameter (list of dictionaries)- adapted config.json column names to MOSGUITO
- New parameter - "suffix"
- This parameter allows to specify a suffix to follow the
_R1
/_R2
special characters in files names, MOSCA will consider that those characters are followed by the "suffix" (e.g.,_L001
would serve for the filesmg_R1_L001.fq
andmg_R2_L001.fq
)
- This parameter allows to specify a suffix to follow the
Adaptations for new versions of tools
- SortMeRNA 4 fully implemented
- Always gzips SortMeRNA output
- UPIMAPI used directly instead of DIAMOND
- MOSCA now accepts UPIMAPI's three options for database: "taxids", "uniprot" or "swissprot"
- Small adjustment on CI to allow running reCOGnizer with mini
cdd.tar.gz
- Fixed krona version (to
2.5
) for compatibility with MaxBin2 - MaxBin2 dependencies are presenting problems for higher versions, and krona's more recent versions would force to install those damaged dependencies
Added technical files, removed old scripts
- added
.gitignore
join_information.py
deprecated, replaced by mosca_tools functions and rules inSnakefile
Changes in environment and CI files
install.bash
no longer installs mamba- added gmcloser to
environment.yml
- added simplified
cdd.tar.gz
for CI - added test for complete workflow of MOSCA
- new default for
max-ref-number
with metaquast - is now 0 to allow running CI
Miscellaneous fixes
- fix on snakefile - checks if "Name" in "experiments" is ""
- bins and DE results go to the folders of their respective "samples"
- several fixes on reporting
- fix on alignment functions in
mosca_tools.py
- fix on de_analysis.R
- fix on obtaining directories for Illumina adapters and rRNA databases on preprocessing step
Fixed high quality bins evaluation
MOSCA was evaluating wrongly the high quality bins.
Best probability threshold is now written at the end of iterative binning.
Assigned minus 1 thread in Snakefile
for quantification rule.
- Allows upimapi to run simultaneously.
metaSPAdes upped to version 3.15 to not run out of memory.
Fixed some bugs in name assignment.
Iterative binning for best binning
do_iterative_binning
option now available!
- Iterative binning cycles between MaxBin and CheckM - MaxBin obtains the bins, CheckM checks their quality
- Iterative binning cycles by many probability thresholds to determine the value for the best binning
New option for differential expression - minimum_fold_change
!
- Determine padj for up or down expression, instead of just 0 difference
Can now be installed from source code
Automatic setup from source code is now functional, and suggested installation method is through the bash script.