diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md new file mode 100644 index 00000000..51c54025 --- /dev/null +++ b/.github/CONTRIBUTING.md @@ -0,0 +1,72 @@ +# buisciii-tools: Contributing Guidelines + +## Contribution workflow + +If you'd like to write or modify some code for buisciii-tools, the standard workflow is as follows: + +1. Check that there isn't already an issue about your idea in the [buisciii-tools issues](https://github.com/BU-ISCIII/buisciii-tools/issues) to avoid duplicating work. **If there isn't one already, please create one so that others know you're working on this**. +2. [Fork](https://help.github.com/en/github/getting-started-with-github/fork-a-repo) the [buisciii-tools repository](https://github.com/BU-ISCIII/buisciii-tools/) to your GitHub account. +3. Make the necessary changes / additions within your forked repository following the [code style guidelines](#code-style-guidelines). +4. Modify the [`CHANGELOG`](../CHANGELOG.md) file according to your changes in the appropiate section ([X.X.Xhot] or [X.X.Xdev]), you should register your changes regarding: + 1. Added enhancements + 2. Template changes + 3. Fixes + 4. Removed stuff + 5. Requirements added or version update +5. Update any documentation as needed. +6. [Submit a Pull Request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request) against the `develop` or `hotfix` branch and send the url to the #pipelines-dev channel in slack (if you are not in the slack channel just wait fot the PR to be reviewed and rebased). + +If you're not used to this workflow with git, you can start with: + +- Some [docs in the bu-isciii wiki](https://github.com/BU-ISCIII/BU-ISCIII/wiki/Github--gitflow). +- [some slides](https://docs.google.com/presentation/d/1PruqGxPQVxtNcuEbOd86mylXorgYIU5a/edit?pli=1#slide=id.p1) (in spanish). +- some github generic docs [docs from GitHub](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests). +- even their [excellent `git` resources](https://try.github.io/). + +### buisciii-tools repo branches + +buisciii-tools repo works with a three branching scheme. Two regular branches `main` and `develop`, and a third one created for hot fixes `hotfix`. This last one is created for changes in the **services templates**. + +- `main`: stable code only for releases. +- `develop`: new code development for the diferente modules. +- `hotfix`: bug fixing and/or templates addition/modification (bash scripts in: `templates` folder). + +You need to submit your PR always against `develop` or `hotfix` depending on the nature of your changes. Once approbed, this changes must be **`rebased`** so we do not create empty unwanted merges. + +## Tests + +When you create a pull request with changes, [GitHub Actions](https://github.com/features/actions) will run automatic tests. +Typically, pull-requests are only fully reviewed when these tests are passing, though of course we can help out before then. + +There are typically two types of tests that run: + +### Lint tests + +We use black and flake8 linting based on PEP8 guidelines for python coding. You can check more information [here](https://github.com/BU-ISCIII/BU-ISCIII/wiki/Python#linting). + +### Code tests + +TODO. NOT YET IMPLEMENTED. +Anyhow you should always submit locally tested code!! + +### New version bumping and release + +In order to create a new release you need to follow the next steps: + +1. Set the new version according to [semantic versioning](https://semver.org/), in our particular case, changes in the `hotfix` branch will change the PATCH version (third one), and changes in develop will typicaly change the MINOR version, unless the developing team decides otherwise. +2. Create a PR bumping the new version against `hotfix` or `develop`. For bumping a new version just change [this line](https://github.com/BU-ISCIII/buisciii-tools/blob/615f1390d96cd6c8168acebc384289520a3cd728/setup.py#L5) with the new version. +3. Once that PR is merged, create via web another PR against `main` (origin `develop` or `hotfix` accordingly). This PR would need 2 approvals. +4. [Create a new release](https://docs.github.com/en/repositories/releasing-projects-on-github/managing-releases-in-a-repository) copying the appropiate notes from the `CHANGELOG`. +5. Once the release is approved and merged, you're all set! + +PRs from one branch to another, like in a release should be **`merged`** not rebased, so we avoid conflicts and the branch merge is correctly visualize in the commits history. + +> A new PR for `develop` branch will be automatically generated if the changes came from `hotfix` so everything is properly sync. + +### Code style guidelines + +We follow PEP8 conventions as code style guidelines, please check [here](https://github.com/BU-ISCIII/BU-ISCIII/wiki/Python#pep-8-guidelines-read-the-full-pep-8-documentation) for more detail. + +## Getting help + +For further information/help, please ask on the `#pipelines-dev` slack channel or write us an email! ([bionformatica@isciii.es](emailto:bioinformatica@isciii.es)). diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md new file mode 100644 index 00000000..0fc42167 --- /dev/null +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -0,0 +1,21 @@ + + +## PR checklist + +- [ ] This comment contains a description of changes (with reason). +- [ ] Make sure your code lints (`black and flake8`). +- If a new tamplate was added make sure: + - [ ] Template's schema is added in `templates/services.json`. + - [ ] Template's pipeline's documentation in `assets/reports/md/template.md` is added. + - [ ] Results Documentation in `assets/reports/results/template.md` is updated. +- [ ] `CHANGELOG.md` is updated. +- [ ] `README.md` is updated (including new tool citations and authors/contributors). +- [ ] If you know a new user was added to the SFTP, make sure you added it to `templates/sftp_user.json` diff --git a/.github/workflows/update_branches.yml b/.github/workflows/update_branches.yml new file mode 100644 index 00000000..706fba2c --- /dev/null +++ b/.github/workflows/update_branches.yml @@ -0,0 +1,30 @@ +name: Create Pull Request from main to develop and/or hotfix + +on: + push: + branches: + - main + +env: + GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} # setting GH_TOKEN for the entire workflow + +jobs: + create-and-auto-merge-pr: + runs-on: ubuntu-latest + + steps: + - name: Checkout code + uses: actions/checkout@v2 + + - name: Install GitHub CLI + run: | + sudo apt update + sudo apt install gh + + - name: Create Pull Request to develop + run: | + gh pr create --base develop --head main --title "Merge changes from main into develop" --body "Automatically created pull request to merge changes from main into develop." + + - name: Create Pull Request to hotfix + run: | + gh pr create --base hotfix --head main --title "Merge changes from main into hotfix" --body "Automatically created pull request to merge changes from main into hotfix." diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100755 index 00000000..59cd25c6 --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,126 @@ +# bu-isciii tools Changelog + +All notable changes to this project will be documented in this file. + +The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). + +## [Unreleased] + +## [2.0.0dev] - 2024-0X-0X : https://github.com/BU-ISCIII/buisciii-tools/releases/tag/2.0.0 + +### Credits + +Code contributions to the release: + +- [Sara Monzón](https://github.com/saramonzon) +- [Sarai Varona](https://github.com/svarona) +- [Pablo Mata](https://github.com/Shettland) +- [Guillermo Gorines](https://github.com/GuilleGorines) + + +### Template fixes and updates + +- Added templates: + - freebayes + + +### Modules + +#### Added enhancements + +- Added credential parameters: --api_user, --api_password and --cred_file +- Make modules to create folder's paths automatically from DB +- Added finish module +- Added json files: sftp_user.json +- Added delivery jinja templates + +#### Fixes + +#### Changed + +- Fixed API requests to fit in the new database format +- Updated README + +#### Removed + +### Requirements + +- Added PyYAML + +## [1.0.2hot] - 2024-0X-0X : https://github.com/BU-ISCIII/buisciii-tools/releases/tag/1.0.2 + +### Credits + +Code contributions to the hotfix: + +### Template fixes and updates + +### Modules + +#### Added enhancements + +#### Fixes + +#### Changed + +#### Removed + +### Requirements + +## [1.0.1] - 2024-02-01 : https://github.com/BU-ISCIII/buisciii-tools/releases/tag/1.0.1 + +### Credits + +Code contributions to the hotfix: + +- [Pablo Mata](https://github.com/Shettland) +- [Jaime Ozaez](https://github.com/jaimeozaez) +- [Sara Monzón](https://github.com/saramonzon) +- [Sarai Varona](https://github.com/svarona) +- [Daniel Valle](https://github.com/Daniel-VM) + +### Template fixes and updates + +- Added new line in `buisciii_tools/bu_isciii/templates/viralrecon/ANALYSIS/lablog_viralrecon`, in order to automatically rename `ANALYSIS0X_MAG` directory with the current date. +- Introduced handling of flu-C in `buisciii_tools/bu_isciii/templates/IRMA/ANALYSIS/ANALYSIS01_FLU_IRMA/04-irma/` `lablog` and `create_irma_stats.sh` +- Small changes to `buisciii_tools/bu_isciii/templates/viralrecon/RESULTS/viralrecon_results` for blast and new excel_generator.py +- Introduced better error handling in excel_generator.py. Now it can also be used for single files +- Brought back `PASS_ONLY` to exometrio's `exomiser_configfile.yml` +- [#187](https://github.com/BU-ISCIII/buisciii-tools/pull/187) - Added new template for bacterial assembly. Allowing for short, long and hybrid assembly. +- [#190](https://github.com/BU-ISCIII/buisciii-tools/pull/190) - renamed some variables in create-summary_report from viralrecon template as their name was misleading and fixed a small typo in regex finding in excel_generator.py +- [#192](https://github.com/BU-ISCIII/buisciii-tools/pull/192) - Small changes in excel_generator.py to automatically merge pangolin/nextclade tables when more than 1 reference is found + +### Modules + +#### Added enhancements +- Added CHANGELOG +- Added template for Pull Request +- Added Contributing guidelines +- Added github action to sync branches + +#### Fixes + +#### Changed + +#### Removed + + +### Requirements + + +## [1.0.0] - 2024-01-08 : https://github.com/BU-ISCIII/buisciii-tools/releases/tag/1.0.0 + +### Credits + +Code contributions to the inital release: + +- [Sara Monzón](https://github.com/saramonzon) +- [Saria Varona](https://github.com/svarona) +- [Guillermo Gorines](https://github.com/GuilleGorines) +- [Pablo Mata](https://github.com/Shettland) +- [Luis Chapado](https://github.com/luissian) +- [Erika Kvalem](https://github.com/ErikaKvalem) +- [Alberto Lema](https://github.com/Alema91) +- [Daniel Valle](https://github.com/Daniel-VM) +- [Fernando Gomez](https://github.com/FGomez-Aldecoa) + diff --git a/bu_isciii/__main__.py b/bu_isciii/__main__.py index 79044def..9dcbbb25 100644 --- a/bu_isciii/__main__.py +++ b/bu_isciii/__main__.py @@ -55,7 +55,7 @@ def run_bu_isciii(): ) # stderr.print("[green] `._,._,'\n", highlight=False) - __version__ = "1.0.0" + __version__ = "1.0.1" stderr.print( "[grey39] BU-ISCIII-tools version {}".format(__version__), highlight=False ) diff --git a/bu_isciii/autoclean_sftp.py b/bu_isciii/autoclean_sftp.py index c740d342..274a1515 100644 --- a/bu_isciii/autoclean_sftp.py +++ b/bu_isciii/autoclean_sftp.py @@ -120,9 +120,9 @@ def get_sftp_services(self): # Get sftp-service last modification service_finder = LastMofdificationFinder(sftp_service_fullPath) service_last_modification = service_finder.find_last_modification() - self.sftp_services[ - sftp_service_fullPath - ] = service_last_modification + self.sftp_services[sftp_service_fullPath] = ( + service_last_modification + ) if len(self.sftp_services) == 0: sys.exit(f"No services found in {self.path}") diff --git a/bu_isciii/templates/IRMA/ANALYSIS/ANALYSIS01_FLU_IRMA/04-irma/create_irma_stats.sh b/bu_isciii/templates/IRMA/ANALYSIS/ANALYSIS01_FLU_IRMA/04-irma/create_irma_stats.sh old mode 100644 new mode 100755 index 89e072a5..93f0ffec --- a/bu_isciii/templates/IRMA/ANALYSIS/ANALYSIS01_FLU_IRMA/04-irma/create_irma_stats.sh +++ b/bu_isciii/templates/IRMA/ANALYSIS/ANALYSIS01_FLU_IRMA/04-irma/create_irma_stats.sh @@ -1 +1,33 @@ -echo -e "sample_ID\tTotalReads\tMappedReads\tFlu_type\tReads_HA\tReads_MP\tReads_NA\tReads_NP\tReads_NS\tReads_PA\tReads_PB1\tReads_PB2" > irma_stats.txt; cat ../samples_id.txt | while read in; do paste <(echo ${in}) <(grep '1-initial' ${in}/tables/READ_COUNTS.txt | cut -f2) <(grep '3-match' ${in}/tables/READ_COUNTS.txt | cut -f2) <(paste <(grep '4-[A-B]_HA' ${in}/tables/READ_COUNTS.txt | cut -f1 | cut -d '_' -f1,3 | cut -d '-' -f2) <(grep '4-[A-B]_NA' ${in}/tables/READ_COUNTS.txt | cut -f1 | cut -d '_' -f3) | tr '\t' '_') <(grep '4-[A-B]_HA' ${in}/tables/READ_COUNTS.txt | cut -f2) <(grep '4-[A-B]_MP' ${in}/tables/READ_COUNTS.txt | cut -f2) <(grep '4-[A-B]_NA' ${in}/tables/READ_COUNTS.txt | cut -f2) <(grep '4-[A-B]_NP' ${in}/tables/READ_COUNTS.txt | cut -f2) <(grep '4-[A-B]_NS' ${in}/tables/READ_COUNTS.txt | cut -f2) <(grep '4-[A-B]_PA' ${in}/tables/READ_COUNTS.txt | cut -f2) <(grep '4-[A-B]_PB1' ${in}/tables/READ_COUNTS.txt | cut -f2) <(grep '4-[A-B]_PB2' ${in}/tables/READ_COUNTS.txt | cut -f2); done >> irma_stats.txt + +echo -e "sample_ID\tTotalReads\tMappedReads\tFlu_type\tReads_HA\tReads_MP\tReads_NA\tReads_NP\tReads_NS\tReads_PA\tReads_PB1\tReads_PB2" > irma_stats.txt + +cat ../samples_id.txt | while read in +do +SAMPLE_ID=$(echo ${in}) +TOTAL_READS=$(grep '1-initial' ${in}/tables/READ_COUNTS.txt | cut -f2) +MAPPEDREADS=$(grep '3-match' ${in}/tables/READ_COUNTS.txt | cut -f2) +FLU_TYPE=$(paste <(grep '4-[A-C]_MP' ${in}/tables/READ_COUNTS.txt | cut -f1 | cut -d '_' -f1 | cut -d '-' -f2) <(grep '4-[A-B]_HA' ${in}/tables/READ_COUNTS.txt | cut -f1 | cut -d '_' -f3 | cut -d '-' -f2) <(grep '4-[A-B]_NA' ${in}/tables/READ_COUNTS.txt | cut -f1 | cut -d '_' -f3) | tr '\t' '_') +HA=$(grep '4-[A-C]_HA' ${in}/tables/READ_COUNTS.txt | cut -f2) +MP=$(grep '4-[A-C]_MP' ${in}/tables/READ_COUNTS.txt | cut -f2) +NA=$(grep '4-[A-C]_NA' ${in}/tables/READ_COUNTS.txt | cut -f2) +NP=$(grep '4-[A-C]_NP' ${in}/tables/READ_COUNTS.txt | cut -f2) +NS=$(grep '4-[A-C]_NS' ${in}/tables/READ_COUNTS.txt | cut -f2) +PA=$(grep '4-[A-C]_PA' ${in}/tables/READ_COUNTS.txt | cut -f2) +PB1=$(grep '4-[A-C]_PB1' ${in}/tables/READ_COUNTS.txt | cut -f2) +PB2=$(grep '4-[A-C]_PB2' ${in}/tables/READ_COUNTS.txt | cut -f2) +#In case of Influenza C in samples: +HE=$(grep '4-C_HE' ${in}/tables/READ_COUNTS.txt | cut -f2) +if [[ -n "$HE" ]]; then + LINE=$(paste <(echo $SAMPLE_ID) <(echo $TOTAL_READS) <(echo $MAPPEDREADS) <(echo $FLU_TYPE) <(echo $HA) <(echo $MP) <(echo $NA) <(echo $NP) <(echo $NS) <(echo $PA) <(echo $PB1) <(echo $PB2) <(echo $HE)) +else + LINE=$(paste <(echo $SAMPLE_ID) <(echo $TOTAL_READS) <(echo $MAPPEDREADS) <(echo $FLU_TYPE) <(echo $HA) <(echo $MP) <(echo $NA) <(echo $NP) <(echo $NS) <(echo $PA) <(echo $PB1) <(echo $PB2)) +fi + +echo "$LINE" >> irma_stats.txt + +done + +ANY_C=$(grep "C_" irma_stats.txt) +if [[ -n "$ANY_C" ]]; then + sed -i 's/Reads_PB2/Reads_PB2\tReads_HE/g' irma_stats.txt +fi diff --git a/bu_isciii/templates/IRMA/ANALYSIS/ANALYSIS01_FLU_IRMA/04-irma/lablog b/bu_isciii/templates/IRMA/ANALYSIS/ANALYSIS01_FLU_IRMA/04-irma/lablog old mode 100644 new mode 100755 index 33f3a273..540640fe --- a/bu_isciii/templates/IRMA/ANALYSIS/ANALYSIS01_FLU_IRMA/04-irma/lablog +++ b/bu_isciii/templates/IRMA/ANALYSIS/ANALYSIS01_FLU_IRMA/04-irma/lablog @@ -15,12 +15,21 @@ echo "cat HA_types.txt | while read in; do mkdir \${in}; done" >> _03_post_proce echo "mkdir B" >> _03_post_processing.sh +echo "mkdir C" >> _03_post_processing.sh + echo "ls */*.fasta | cut -d '/' -f2 | cut -d '.' -f1 | cut -d '_' -f1,2 | sort -u | grep 'A_' > A_fragment_list.txt" >> _03_post_processing.sh echo "ls */*.fasta | cut -d '/' -f2 | cut -d '.' -f1 | cut -d '_' -f1,2 | sort -u | grep 'B_' > B_fragment_list.txt" >> _03_post_processing.sh -echo 'cat HA_types.txt | while read type; do grep ${type} irma_stats.txt | cut -f1 | while read sample; do cat A_fragment_list.txt | while read fragment; do if test -f ${sample}/${fragment}*.fasta; then cat ${sample}/${fragment}*.fasta | sed "s/^>/\>${sample}_/g" | sed 's/_H1//g' | sed 's/_H3//g' | sed 's/_N1//g' | sed 's/_N2//g'; fi >> ${type}/${fragment}.txt; done; done; done' >> _03_post_processing.sh +echo "ls */*.fasta | cut -d '/' -f2 | cut -d '.' -f1 | cut -d '_' -f1,2 | sort -u | grep 'C_' > C_fragment_list.txt" >> _03_post_processing.sh + +echo 'cat HA_types.txt | while read type; do grep ${type} irma_stats.txt | cut -f1 | while read sample; do cat A_fragment_list.txt | while read fragment; do if test -f ${sample}/${fragment}*.fasta; then cat ${sample}/${fragment}*.fasta | sed "s/^>/\>${sample}_/g" | sed 's/_H1//g' | sed 's/_H3//g' | sed 's/_N1//g' | sed 's/_N2//g' | sed s@-@/@g | sed s/_A_/_/g ; fi >> ${type}/${fragment}.txt; done; done; done' >> _03_post_processing.sh + +echo 'grep -w 'B__' irma_stats.txt | cut -f1 | while read sample; do cat B_fragment_list.txt | while read fragment; do if test -f ${sample}/${fragment}*.fasta; then cat ${sample}/${fragment}*.fasta | sed "s/^>/\>${sample}_/g" | sed s/_H1//g | sed s/_H3//g | sed s/_N1//g | sed s/_N2//g | sed s@-@/@g | sed s/_B_/_/g ; fi >> B/${fragment}.txt; done; done' >> _03_post_processing.sh + +echo 'grep -w 'C__' irma_stats.txt | cut -f1 | while read sample; do cat C_fragment_list.txt | while read fragment; do if test -f ${sample}/${fragment}*.fasta; then cat ${sample}/${fragment}*.fasta | sed "s/^>/\>${sample}_/g" | sed s/_H1//g | sed s/_H3//g | sed s/_N1//g | sed s/_N2//g | sed s@-@/@g | sed s/_C_/_/g ; fi >> C/${fragment}.txt; done; done' >> _03_post_processing.sh -echo 'grep -w 'B_' irma_stats.txt | cut -f1 | while read sample; do cat B_fragment_list.txt | while read fragment; do if test -f ${sample}/${fragment}*.fasta; then cat ${sample}/${fragment}*.fasta | sed "s/^>/\>${sample}_/g" | sed s/_H1//g | sed s/_H3//g | sed s/_N1//g | sed s/_N2//g; fi >> B/${fragment}.txt; done; done' >> _03_post_processing.sh +echo 'cat ../samples_id.txt | while read in; do cat ${in}/*.fasta | sed "s/^>/\>${in}_/g" | sed 's/_H1//g' | sed 's/_H3//g' | sed 's/_N1//g' | sed 's/_N2//g' | sed 's@-@/@g' | 's/_A_/_/g' | sed 's/_B_/_/g' | sed 's/_C_/_/g' >> all_samples_completo.txt; done' >> _03_post_processing.sh -echo 'cat ../samples_id.txt | while read in; do cat ${in}/*.fasta | sed "s/^>/\>${in}_/g" | sed 's/_H1//g' | sed 's/_H3//g' | sed 's/_N1//g' | sed 's/_N2//g' >> all_samples_completo.txt; done' >> _03_post_processing.sh +echo 'sed -i "s/__//g" irma_stats.txt' >> _03_post_processing.sh +echo 'sed -i "s/_\t/\t/g" irma_stats.txt' >> _03_post_processing.sh \ No newline at end of file diff --git a/bu_isciii/templates/IRMA/RESULTS/irma_results b/bu_isciii/templates/IRMA/RESULTS/irma_results old mode 100644 new mode 100755 index 4c910758..a2a5bb33 --- a/bu_isciii/templates/IRMA/RESULTS/irma_results +++ b/bu_isciii/templates/IRMA/RESULTS/irma_results @@ -7,3 +7,4 @@ ln -s ../../ANALYSIS/*_MET/99-stats/multiqc_report.html ./krona_results.html ln -s ../../ANALYSIS/*FLU_IRMA/04-irma/all_samples_completo.txt . ln -s ../../ANALYSIS/*FLU_IRMA/04-irma/A_H* . ln -s ../../ANALYSIS/*FLU_IRMA/04-irma/B . +ln -s ../../ANALYSIS/*FLU_IRMA/04-irma/C . \ No newline at end of file diff --git a/bu_isciii/templates/assembly/ANALYSIS/ANALYSIS01_ASSEMBLY/lablog b/bu_isciii/templates/assembly/ANALYSIS/ANALYSIS01_ASSEMBLY/lablog index 9503c3f9..bd8f8549 100644 --- a/bu_isciii/templates/assembly/ANALYSIS/ANALYSIS01_ASSEMBLY/lablog +++ b/bu_isciii/templates/assembly/ANALYSIS/ANALYSIS01_ASSEMBLY/lablog @@ -1,31 +1,97 @@ -echo "Do you want to save trimmed reads in outdir?" +# Function to print colored text +print_color() { + case "$2" in + "red") + echo -e "\e[1;31m$1\e[0m" + ;; + "green") + echo -e "\e[1;32m$1\e[0m" + ;; + "blue") + echo -e "\e[1;34m$1\e[0m" + ;; + *) + echo "$1" + ;; + esac +} -read -p 'Write y or n: ' trimmed +# Function to prompt with color +prompt_with_color() { + read -p "$(print_color $1 'blue') $2" response +} -TRIMMED=$(echo "${trimmed}" | tr '[:upper:]' '[:lower:]') +# Select assembly mode +assembly_options=("short" "long" "hybrid") +print_color "Indicate the preferred assembly mode:" 'blue' +select ASSEMBLY_MODE in "${assembly_options[@]}"; do + if [ -n "$ASSEMBLY_MODE" ]; then + if [ $ASSEMBLY_MODE == "short" ]; then + ASSEMBLER="unicycler" + elif [ "$ASSEMBLY_MODE" == "long" ] || [ "$ASSEMBLY_MODE" == "hybrid" ]; then + ASSEMBLER="dragonflye" + fi + break + else + print_color "Invalid input. Please select a valid option." 'red' + fi +done +print_color "Selected assembly mode: $ASSEMBLY_MODE" 'green' -if [ "$TRIMMED" == "yes" ] || [ "$TRIMMED" == "y" ] -then SAVETRIMMED="True" -else SAVETRIMMED="False" -fi +# Select whether to save trimmed reads +trim_options=("Yes" "No") +print_color "Do you want to save trimmed reads in outdir?" 'blue' +select TRIMMED in "${trim_options[@]}"; do + if [ -n "$TRIMMED" ]; then + # rename trimmed + if [ "$TRIMMED" == "Yes" ] || [ "$TRIMMED" == "y" ]; then + SAVETRIMMED="true" + else + SAVETRIMMED="false" + fi -echo "Is gram positive or negative?" + break + else + print_color "Invalid input. Please select a valid option." 'red' + fi +done +print_color "Selected trimmed file option: $TRIMMED save trimmed" 'green' -read -p 'Write + or -: ' grammtype +# Select Prokka gram type +gram_options=("+" "-" "skip") -if [ "$grammtype" != "-" ] && [ "$grammtype" != "+" ] -then - echo "The given param: $grammtype does not match any of the accepted params ('+' or '-')" - exit 1 -fi +print_color "Is gram positive or negative?" 'blue' +select GRAMTYPE in "${gram_options[@]}"; do + if [ -n "$GRAMTYPE" ]; then + if [ "$GRAMTYPE" != "skip" ]; then + PROKKA_ARGS="--prokka_args '--gram ${GRAMTYPE}'" + fi + break + else + print_color "Invalid input. Please select a valid option." 'red' + fi +done +print_color "Selected Prokka gram type: $GRAMTYPE" 'green' + + +# SETUP INTPUT SAMPLE SHEET ln -s ../00-reads . ln -s ../samples_id.txt . -echo "sample,fastq_1,fastq_2" > samplesheet.csv -cat samples_id.txt | while read in; do echo "${in},00-reads/${in}_R1.fastq.gz,00-reads/${in}_R2.fastq.gz"; done >> samplesheet.csv +echo "ID,R1,R2,LongFastQ,Fast5,GenomeSize" > samplesheet.csv +cat samples_id.txt | while read in; do + if [ "$ASSEMBLY_MODE" == "short" ]; then + echo "${in},00-reads/${in}_R1.fastq.gz,00-reads/${in}_R2.fastq.gz,NA,NA,NA"; + elif [ "$ASSEMBLY_MODE" == "long" ]; then + echo "${in},NA,NA,00-reads/${in}.fastq.gz,NA,NA"; + elif [ "$ASSEMBLY_MODE" == "hybrid" ]; then + echo "${in},00-reads/${in}_R1.fastq.gz,00-reads/${in}_R2.fastq.gz,00-reads/${in}.fastq.gz,NA,NA"; + else + echo "Format not recognized for the sample : ${in}."; + fi +done >> samplesheet.csv -#module load Nextflow/21.10.6 singularity scratch_dir=$(echo $PWD | sed "s/\/data\/bi\/scratch_tmp/\/scratch/g") cat < assembly.sbatch @@ -38,20 +104,28 @@ cat < assembly.sbatch #SBATCH --output $(date '+%Y%m%d')_assembly01.log #SBATCH --chdir $scratch_dir -export NXF_OPTS="-Xms500M -Xmx4G" - -nextflow run /scratch/bi/pipelines/BU_ISCIII-bacterial-assembly/main.nf \\ - -c ../../DOC/hpc_slurm_assembly.config \\ - --input samplesheet.csv \\ - --outdir ./ \\ - --cut_mean_quality 20 \\ - --qualified_quality_phred 20 \\ - --gram ${grammtype} \\ - --reference_outdir ../../REFERENCES \\ - --save_trimmed ${SAVETRIMMED} \\ - --kmerfinder_bacteria_database '/data/bi/references/kmerfinder/20190108_stable_dirs/bacteria' \\ - --reference_ncbi_bacteria '/data/bi/references/bacteria/latest_db/assembly_summary_bacteria.txt' \\ - -resume +# module load Nextflow/23.10.0 singularity +export NXF_OPTS="-Xms500M -Xmx8G" + +nextflow run /data/bi/pipelines/nf-core-bacass/main.nf \\ + -c ../../DOC/hpc_slurm_assembly.config \\ + -profile singularity \\ + --input samplesheet.csv \\ + --outdir ./ \\ + --assembly_type ${ASSEMBLY_MODE} \\ + --assembler ${ASSEMBLER} \\ + --skip_polish true \\ + --save_trimmed ${SAVETRIMMED} \\ + --fastp_args '--qualified_quality_phred 20 --cut_mean_quality 20' \\ + --skip_kraken2 true \\ + --skip_kmerfinder false \\ + --kmerfinderdb /data/bi/references/kmerfinder/20190108_stable_dirs/bacteria \\ + --ncbi_assembly_metadata /data/bi/references/bacteria/20191212/assembly_summary_bacteria.txt \\ + ${PROKKA_ARGS} \\ + -resume + EOF echo "sbatch assembly.sbatch" > _01_nf_assembly.sh + + diff --git a/bu_isciii/templates/assembly/ANALYSIS/lablog_assembly b/bu_isciii/templates/assembly/ANALYSIS/lablog_assembly index bcbdefd6..c5e90e0e 100644 --- a/bu_isciii/templates/assembly/ANALYSIS/lablog_assembly +++ b/bu_isciii/templates/assembly/ANALYSIS/lablog_assembly @@ -1,4 +1,21 @@ mkdir -p 00-reads -cd 00-reads; cat ../samples_id.txt | xargs -I % echo "ln -s ../../RAW/%_*R1*.fastq.gz %_R1.fastq.gz" | bash; cd - -cd 00-reads; cat ../samples_id.txt | xargs -I % echo "ln -s ../../RAW/%_*R2*.fastq.gz %_R2.fastq.gz" | bash; cd - -mv ANALYSIS01_ASSEMBLY $(date '+%Y%m%d')_ANALYSIS01_ASSEMBLY +cd 00-reads + +# Loop through each file in the directory +while IFS= read -r sample; do + # Extract the file name with&without extension + filename_noext=$(basename -s .fastq.gz ../../RAW/${sample}*) + + ### Check if the file is a short read or long read + for fileitem in $filename_noext; do + if [[ "$fileitem" =~ _R[12] ]]; then + ln -s -f ../../RAW/${sample}*_R1*.fastq.gz ${sample}_R1.fastq.gz + ln -s -f ../../RAW/${sample}*_R2*.fastq.gz ${sample}_R2.fastq.gz + elif [[ ! "$fileitem" =~ _R[12] ]]; then + ln -s -f ../../RAW/${sample}.fastq.gz ${sample}.fastq.gz + fi + done +done < ../samples_id.txt + +cd - +mv ANALYSIS01_ASSEMBLY "$(date '+%Y%m%d')_ANALYSIS01_ASSEMBLY" diff --git a/bu_isciii/templates/assembly/DOC/hpc_slurm_assembly.config b/bu_isciii/templates/assembly/DOC/hpc_slurm_assembly.config index 73bfc79b..04dddf4d 100644 --- a/bu_isciii/templates/assembly/DOC/hpc_slurm_assembly.config +++ b/bu_isciii/templates/assembly/DOC/hpc_slurm_assembly.config @@ -1,26 +1,231 @@ -conda { - enabled = true - autoMounts = true -} +/* + HPC XTUTATIS CONFIGURATION +*/ singularity { - enabled = true - autoMounts = true + enabled = true + autoMounts = true + singularity.cacheDir = '/data/bi/pipelines/singularity-images' } process { - executor = 'slurm' - queue = 'middle_idx' - conda = '/data/bi/pipelines/miniconda3/envs/assembly' - errorStrategy = { task.exitStatus in [140,143,137,138,104,134,139] ? 'retry' : 'finish'; task.exitStatus in [1,4,255] ? 'ignore' : 'finish' } - maxRetries = 1 + executor = 'slurm' + queue = 'middle_idx' + jobName = { "$task.name - $task.hash" } + conda = null + + errorStrategy = { task.exitStatus in [140,143,137,138,104,134,139] ? 'retry' : 'finish'; task.exitStatus in [1,4,255] ? 'ignore' : 'finish' } + maxRetries = 1 + maxErrors = '-1' + + withName:PROKKA { + container = 'https://zenodo.org/records/10496286/files/bioconda_prokka_v1.14.6_signalp_v4.1.sif?download=1' + errorStrategy = { task.exitStatus in [2] ? 'retry' : 'finish'} + maxRetries = 2 maxErrors = '-1' - withName:KMERFINDER { - container = '/scratch/bi/singularity-images/kmerfinder_v3.0.2.sif' - } + } } + params { max_memory = 376.GB max_cpus = 32 max_time = '48.h' } + +/* + CUSTOM OUTPUT FOLDER STRUCTURE -- modules.config +*/ +params { publish_dir_mode = 'copy' } +process { + withName: '.*:.*:FASTQ_TRIM_FASTP_FASTQC:FASTQC_RAW' { + publishDir = [ + [ + path: { "${params.outdir}/01-processing/fastqc/raw" }, + pattern: "*.{json,html}", + mode: params.publish_dir_mode + ], + [ + path: { "${params.outdir}/01-processing/fastqc/raw/zips" }, + pattern: "*.zip", + mode: params.publish_dir_mode + ] + ] + } + withName: '.*:.*:FASTQ_TRIM_FASTP_FASTQC:FASTP' { + publishDir = [ + [ + path: { "${params.outdir}/01-processing/fastp" }, + mode: params.publish_dir_mode, + enabled: params.save_trimmed + ], + [ + path: { "${params.outdir}/01-processing/fastp" }, + mode: params.publish_dir_mode, + pattern: "*.{json,html}" + ], + [ + path: { "${params.outdir}/01-processing/fastp/logs" }, + mode: params.publish_dir_mode, + pattern: "*.log" + ] + ] + } + withName: '.*:.*:FASTQ_TRIM_FASTP_FASTQC:FASTQC_TRIM' { + publishDir = [ + [ + path: { "${params.outdir}/01-processing/fastqc/trim" }, + pattern: "*.{json,html}", + mode: params.publish_dir_mode + ], + [ + path: { "${params.outdir}/01-processing/fastqc/trim/zips" }, + pattern: "*.zip", + mode: params.publish_dir_mode + ] + ] + } + withName: 'NANOPLOT' { + publishDir = [ + path: { "${params.outdir}/01-processing/nanoplot" }, + pattern: "*.txt", + mode: params.publish_dir_mode + ] + } + withName: 'PYCOQC' { + publishDir = [ + path: { "${params.outdir}/01-processing/pycoqc" }, + mode: params.publish_dir_mode + ] + } + withName: 'PORECHOP_PORECHOP' { + publishDir = [ + [ + path: { "${params.outdir}/01-processing/porechop" }, + pattern: "*.fastq.gz", + mode: params.publish_dir_mode, + enabled: params.save_trimmed + ], + [ + path: { "${params.outdir}/01-processing/porechop/logs" }, + pattern: "*.log", + mode: params.publish_dir_mode, + ] + ] + } + withName: '.*:.*:KMERFINDER_SUBWORKFLOW:FIND_DOWNLOAD_REFERENCE' { + publishDir = [ + path: { "${params.outdir}/../../REFERENCES" }, + pattern: "*.{fna,gff}.gz", + mode: params.publish_dir_mode, + saveAs: { filename -> + if (filename.equals('versions.yml')){ + null + } else { + "${refmeta.toString().replace(' ', '_')}/${filename}" + } + } + ] + } + withName: '.*:.*:KMERFINDER_SUBWORKFLOW:KMERFINDER' { + publishDir = [ + path: { "${params.outdir}/02-taxonomy_contamination/kmerfinder/${meta.id}" }, + mode: params.publish_dir_mode + ] + } + withName: '.*:.*:KMERFINDER_SUBWORKFLOW:KMERFINDER_SUMMARY' { + publishDir = [ + path: { "${params.outdir}/99-stats" }, + mode: params.publish_dir_mode + ] + } + withName: 'KRAKEN2|KRAKEN2_LONG' { + publishDir = [ + path: { "${params.outdir}/02-taxonomy_contamination/kraken2" }, + mode: params.publish_dir_mode + ] + } + withName: 'UNICYCLER|CANU|MINIASM|DRAGONFLYE' { + publishDir = [ + path: { "${params.outdir}/03-assembly/${params.assembler}" }, + mode: params.publish_dir_mode, + saveAs: { filename -> + if (filename.endsWith('.scaffolds.fa.gz') || + filename.endsWith('.contigs.fasta.gz') || + filename.endsWith('.contigs.fa') || + filename.endsWith('.fasta.gz')) { + "${meta.id}.fasta.gz" + } else { + null + } + } + ] + } + withName: 'RACON|MEDAKA|NANOPOLISH' { + publishDir = [ + path: { "${params.outdir}/03-assembly/${params.assembler}/${params.polish_method}" }, + mode: params.publish_dir_mode + ] + } + withName: 'QUAST|QUAST_BYREFSEQID' { + publishDir = [ + path: { "${params.outdir}/03-assembly/quast" }, + mode: params.publish_dir_mode, + saveAs: { filename -> + if (filename.equals('versions.yml') || filename.endsWith('.tsv')){ + null + } else if (filename.startsWith('GCF')){ + "per_reference_reports/${filename}" + } + else if (!filename.startsWith('GCF')) { + "global_${filename}" + } + } + ] + } + withName: 'PROKKA' { + ext.args = { + [ + '--force', + params.prokka_args ? "${params.prokka_args}" : '' + ].join(' ').trim() + } + publishDir = [ + path: { "${params.outdir}/05-annotation/prokka" }, + mode: params.publish_dir_mode, + saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] + } + withName: 'BAKTA_BAKTA' { + ext.args = { + [ + '--force', + params.bakta_args ? "${params.bakta_args}" : '' + ].join(' ').trim() + } + publishDir = [ + path: { "${params.outdir}/05-annotation/bakta/${meta.id}" }, + mode: params.publish_dir_mode, + saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] + } + withName: 'MULTIQC' { + publishDir = [ + [ + path: { "${params.outdir}/99-stats/multiqc" }, + mode: params.publish_dir_mode, + saveAs: { filename -> + if (filename.equals('versions.yml') || filename.endsWith('.csv')) { + null + } else { + filename + } + } + ], + [ + path: { "${params.outdir}/99-stats" }, + mode: params.publish_dir_mode, + pattern: "*.csv" + ] + ] + } +} diff --git a/bu_isciii/templates/assembly/RESULTS/lablog_assembly_results b/bu_isciii/templates/assembly/RESULTS/lablog_assembly_results index f112c0b5..508d1d55 100644 --- a/bu_isciii/templates/assembly/RESULTS/lablog_assembly_results +++ b/bu_isciii/templates/assembly/RESULTS/lablog_assembly_results @@ -1,12 +1,32 @@ DELIVERY_FOLDER="$(date '+%Y%m%d')_entrega" mkdir $DELIVERY_FOLDER -mkdir "${DELIVERY_FOLDER}/assembly" +mkdir $DELIVERY_FOLDER/assembly # Assembly service cd $DELIVERY_FOLDER/assembly -ln -s ../../../ANALYSIS/*ASSEMBLY/99-stats/MultiQC/multiqc_report.html . -ln -s ../../../ANALYSIS/*ASSEMBLY/99-stats/kmerfinder.csv . -ln -s ../../../ANALYSIS/*ASSEMBLY/03-assembly/unicycler assemblies -ln -s ../../../ANALYSIS/*ASSEMBLY/03-assembly/quast_results/latest*/report.html quast_report.html + +# Links to reports +ln -s ../../../ANALYSIS/*ASSEMBLY/99-stats/multiqc/multiqc_report.html . +ln -s ../../../ANALYSIS/*ASSEMBLY/99-stats/summary_assembly_metrics_mqc.csv . +ln -s ../../../ANALYSIS/*ASSEMBLY/99-stats/kmerfinder_summary.csv . +ln -s ../../../ANALYSIS/*ASSEMBLY/03-assembly/quast/global_report/report.html quast_global_report.html + +# Links to per reference reports +for dir in ../../../ANALYSIS/*ASSEMBLY/03-assembly/quast/per_reference_reports/*; do + base=$(basename "$dir") + if compgen -G "$dir" > /dev/null; then + ln -s "$dir/report.html" "quast_${base}_report.html" + fi +done + +# Links to assemblies +assembly_dirs=(unicycler dragonflye canu miniasm) +for tool in "${assembly_dirs[@]}"; do + path="../../../ANALYSIS/*ASSEMBLY/03-assembly/${tool}" + if compgen -G "$path" > /dev/null; then + find $path -type d -exec ln -nsf {} assemblies \; + break + fi +done cd - diff --git a/bu_isciii/templates/exometrio/ANALYSIS/ANALYSIS01_EXOME/03-annotation/exomiser_configfile.yml b/bu_isciii/templates/exometrio/ANALYSIS/ANALYSIS01_EXOME/03-annotation/exomiser_configfile.yml index b3b9d027..84bd59f2 100644 --- a/bu_isciii/templates/exometrio/ANALYSIS/ANALYSIS01_EXOME/03-annotation/exomiser_configfile.yml +++ b/bu_isciii/templates/exometrio/ANALYSIS/ANALYSIS01_EXOME/03-annotation/exomiser_configfile.yml @@ -24,7 +24,7 @@ analysis: MITOCHONDRIAL: 100.0 } #FULL or PASS_ONLY - analysisMode: PASS_ONLY + #analysisMode: PASS_ONLY #Possible frequencySources: #Thousand Genomes project http://www.1000genomes.org/ # THOUSAND_GENOMES, diff --git a/bu_isciii/templates/exometrio/ANALYSIS/ANALYSIS01_EXOME/03-annotation/lablog b/bu_isciii/templates/exometrio/ANALYSIS/ANALYSIS01_EXOME/03-annotation/lablog index dcb39acd..2dd494ea 100644 --- a/bu_isciii/templates/exometrio/ANALYSIS/ANALYSIS01_EXOME/03-annotation/lablog +++ b/bu_isciii/templates/exometrio/ANALYSIS/ANALYSIS01_EXOME/03-annotation/lablog @@ -76,8 +76,6 @@ echo "srun --partition short_idx --mem 100G --time 2:00:00 --chdir /data/bi/pipe #--------------------------------------------------------------------------------------------------------- -echo "srun --chdir /tmp --partition short_idx --nodelist ${EXOMISER_NODE} rm spring.log &" > _05_filter_heritance.sh - ## Lablog to modify the output reported by exomiser and create a final file with a personalized format. # Grep variant id for each inheritance model cat inheritance_types.txt | xargs -I % echo "grep 'PASS' ./exomiser/exomiser_%.variants.tsv | awk '{print \$1\"_\"\$2\"_\"\$3\"_\"\$4}' > ./id_%.txt " >> _05_filter_heritance.sh @@ -91,4 +89,4 @@ echo "rm id_*" >> _05_filter_heritance.sh cat inheritance_types.txt | xargs -I % echo "rm ./vep_annot_%.txt" >> _05_filter_heritance.sh # annot_all table is huge, lets shrink it a little bit -echo "srun --partition short_idx --chdir ${scratch_dir} --output logs/COMPRESS_ALL.log --job-name COMPRESS_ANNOT_ALL gzip variants_annot_all.tab &" >> _05_filter_heritance.sh \ No newline at end of file +echo "srun --partition short_idx --chdir ${scratch_dir} --output logs/COMPRESS_ALL.log --job-name COMPRESS_ANNOT_ALL gzip variants_annot_all.tab &" >> _05_filter_heritance.sh diff --git a/bu_isciii/templates/services.json b/bu_isciii/templates/services.json index f1e1758d..59a451a7 100755 --- a/bu_isciii/templates/services.json +++ b/bu_isciii/templates/services.json @@ -6,7 +6,7 @@ "order": 1, "begin": "", "end": "", - "description": "Nextflow assembly pipeline to assemble bacterial genomes", + "description": "nf-core/bacass: Simple bacterial assembly and annotation pipeline", "clean": { "folders":["01-preprocessing/trimmed_sequences"], "files":[] diff --git a/bu_isciii/templates/viralrecon/ANALYSIS/create_summary_report.sh b/bu_isciii/templates/viralrecon/ANALYSIS/create_summary_report.sh index 4ed9b192..f5734984 100644 --- a/bu_isciii/templates/viralrecon/ANALYSIS/create_summary_report.sh +++ b/bu_isciii/templates/viralrecon/ANALYSIS/create_summary_report.sh @@ -22,18 +22,18 @@ do reads_hostR1=$(cat ${arr[1]}*/kraken2/${arr[0]}.kraken2.report.txt | grep -v 'unclassified' | cut -f3 | awk '{s+=$1}END{print s}') reads_host_x2=$(echo $((reads_hostR1 * 2)) ) - perc_mapped=$(echo $(awk -v v1=$total_reads -v v2=$reads_host_x2 'BEGIN {print (v2*100)/v1}') ) + perc_host=$(echo $(awk -v v1=$total_reads -v v2=$reads_host_x2 'BEGIN {print (v2*100)/v1}') ) reads_virus=$(cat ${arr[1]}*/variants/bowtie2/samtools_stats/${arr[0]}.sorted.bam.flagstat | grep '+ 0 mapped' | cut -d ' ' -f1) unmapped_reads=$(echo $((total_reads - (reads_host_x2+reads_virus))) ) perc_unmapped=$(echo $(awk -v v1=$total_reads -v v2=$unmapped_reads 'BEGIN {print (v2/v1)*100}') ) - n_count=$(cat %Ns.tab | grep -w ${arr[0]} | grep ${arr[1]} | cut -f2) + ns_10x_perc=$(cat %Ns.tab | grep -w ${arr[0]} | grep ${arr[1]} | cut -f2) missense=$(LC_ALL=C awk -F, '{if($10 >= 0.75)print $0}' ${arr[1]}*/variants/ivar/variants_long_table.csv | grep ^${arr[0]}, | grep 'missense' | wc -l) - Ns_10x_perc=$(zcat ${arr[1]}*/variants/ivar/consensus/bcftools/${arr[0]}.filtered.vcf.gz | grep -v '^#' | wc -l) + vars_in_cons10x=$(zcat ${arr[1]}*/variants/ivar/consensus/bcftools/${arr[0]}.filtered.vcf.gz | grep -v '^#' | wc -l) lineage=$(cat ${arr[1]}*/variants/ivar/consensus/bcftools/pangolin/${arr[0]}.pangolin.csv | tail -n1 | cut -d ',' -f2) @@ -47,5 +47,5 @@ do analysis_date=$(date '+%Y%m%d') # Introduce data row into output file - echo -e "${RUN}\t${USER}\t${HOST}\t${arr[1]}\t${arr[0]}\t$total_reads\t$reads_hostR1\t$reads_host_x2\t$perc_mapped\t$reads_virus\t$reads_virus_perc\t$unmapped_reads\t$perc_unmapped\t$medianDPcov\t$cov10x\t$Ns_10x_perc\t$missense\t$n_count\t$lineage\t$read_length\t$analysis_date" >> mapping_illumina_$(date '+%Y%m%d').tab + echo -e "${RUN}\t${USER}\t${HOST}\t${arr[1]}\t${arr[0]}\t$total_reads\t$reads_hostR1\t$reads_host_x2\t$perc_host\t$reads_virus\t$reads_virus_perc\t$unmapped_reads\t$perc_unmapped\t$medianDPcov\t$cov10x\t$vars_in_cons10x\t$missense\t$ns_10x_perc\t$lineage\t$read_length\t$analysis_date" >> mapping_illumina_$(date '+%Y%m%d').tab done diff --git a/bu_isciii/templates/viralrecon/ANALYSIS/lablog_viralrecon b/bu_isciii/templates/viralrecon/ANALYSIS/lablog_viralrecon index bdd1e200..f533dd36 100644 --- a/bu_isciii/templates/viralrecon/ANALYSIS/lablog_viralrecon +++ b/bu_isciii/templates/viralrecon/ANALYSIS/lablog_viralrecon @@ -63,4 +63,5 @@ rm create_summary_report.sh rm deduplicate_long_table.sh rm percentajeNs.py rm _02_create_run_percentage_Ns.sh -cd 00-reads; cat ../samples_id.txt | xargs -I % echo "ln -s ../../RAW/%_*R1*.fastq.gz %_R1.fastq.gz" | bash; cat ../samples_id.txt | xargs -I % echo "ln -s ../../RAW/%_*R2*.fastq.gz %_R2.fastq.gz" | bash; cd - \ No newline at end of file +mv DATE_ANALYSIS0X_MAG $(date '+%Y%m%d')_ANALYSIS0X_MAG +cd 00-reads; cat ../samples_id.txt | xargs -I % echo "ln -s ../../RAW/%_*R1*.fastq.gz %_R1.fastq.gz" | bash; cat ../samples_id.txt | xargs -I % echo "ln -s ../../RAW/%_*R2*.fastq.gz %_R2.fastq.gz" | bash; cd - diff --git a/bu_isciii/templates/viralrecon/RESULTS/excel_generator.py b/bu_isciii/templates/viralrecon/RESULTS/excel_generator.py index 55655950..b33c761c 100755 --- a/bu_isciii/templates/viralrecon/RESULTS/excel_generator.py +++ b/bu_isciii/templates/viralrecon/RESULTS/excel_generator.py @@ -4,29 +4,25 @@ from typing import List, Dict # conda activate viralrecon_report -"""Usage: python excel_generator.py ./reference.tmp""" +"""Standard usage: python excel_generator.py -r ./reference.tmp""" +"""Single csv to excel usage: python excel_generator.py -s csv_file.csv""" parser = argparse.ArgumentParser( description="Generate excel files from viralrecon results" ) parser.add_argument( - "reference_file", + "-r", + "--reference_file", type=str, help="File containing the references used in the analysis", ) - -args = parser.parse_args() - -print( - "Extracting references used for analysis and the samples associated with each reference\n" +parser.add_argument( + "-s", + "--single_csv", + type=str, + default="", + help="Transform a single csv file to excel format. Omit rest of processes", ) -with open(args.reference_file, "r") as file: - references = [line.rstrip() for line in file] - print(f"\nFound {len(references)} references: {str(references).strip('[]')}") - -reference_folders = {ref: str("excel_files_" + ref) for ref in references} -samples_ref_files = { - ref: str("ref_samples/samples_" + ref + ".tmp") for ref in references -} +args = parser.parse_args() def concat_tables_and_write(csvs_in_folder: List[str], merged_csv_name: str): @@ -91,39 +87,88 @@ def excel_generator(csv_files: List[str]): print(f"File {file} does not exist, omitting...") continue print(f"Generating excel file for {file}") - output_name = str(file.split(".csv")[0] + ".xlsx") + output_name = os.path.splitext(os.path.basename(file))[0] + ".xlsx" # workbook = openpyxl.Workbook(output_name) if "nextclade" in str(file): - pd.read_csv(file, sep=";", header=0).to_excel(output_name, index=False) + table = pd.read_csv(file, sep=";", header=0) elif "illumina" in str(file): table = pd.read_csv(file, sep="\t", header=0) table["analysis_date"] = pd.to_datetime( table["analysis_date"].astype(str), format="%Y%m%d" ) - table.to_excel(output_name, index=False) - elif "assembly" in str(file): - pd.read_csv(file, sep="\t", header=0).to_excel(output_name, index=False) + elif "assembly" in str(file) or ".tsv" in str(file) or ".tab" in str(file): + table = pd.read_csv(file, sep="\t", header=0) else: - pd.read_csv(file).to_excel(output_name, index=False) - return file - - -# Merge pangolin and nextclade csv files separatedly and create excel files for them -merge_lineage_tables(reference_folders, samples_ref_files) -for reference, folder in reference_folders.items(): - print(f"Creating excel files for reference {reference}") - csv_files = [file.path for file in os.scandir(folder) if file.path.endswith(".csv")] - excel_generator(csv_files) - -# Merge all the variant long tables into one and convert to excel format -variants_tables = [ - table.path for table in os.scandir(".") if "variants_long_table" in table.path -] -concat_tables_and_write( - csvs_in_folder=variants_tables, merged_csv_name="variants_long_table.csv" -) -pd.read_csv("variants_long_table.csv").to_excel("variants_long_table.xlsx", index=False) + try: + table = pd.read_csv(file) + except pd.errors.EmptyDataError: + print("Could not parse table from ", str(file)) + continue + table = table.drop(["index"], axis=1, errors="ignore") + table.to_excel(output_name, index=False) + return + -# Create excel files for individual tables -result_tables = ["mapping_illumina.csv", "assembly_stats.csv", "pikavirus_table.csv"] -excel_generator(result_tables) +def single_csv_to_excel(csv_file: str): + try: + excel_generator([csv_file]) + except FileNotFoundError as e: + print(f"Could not find file {e}") + + +def main(args): + if args.single_csv: + # If single_csv is called, just convert target csv to excel and skip the rest + print("Single file convertion selected. Skipping main process...") + single_csv_to_excel(args.single_csv) + exit(0) + + print( + "Extracting references used for analysis and the samples associated with each reference\n" + ) + with open(args.reference_file, "r") as file: + references = [line.rstrip() for line in file] + print(f"\nFound {len(references)} references: {str(references).strip('[]')}") + + reference_folders = {ref: str("excel_files_" + ref) for ref in references} + samples_ref_files = { + ref: str("ref_samples/samples_" + ref + ".tmp") for ref in references + } + + if len(references) > 1: + # Merge pangolin and nextclade csv files separatedly and create excel files for them + merge_lineage_tables(reference_folders, samples_ref_files) + for reference, folder in reference_folders.items(): + print(f"Creating excel files for reference {reference}") + csv_files = [ + file.path for file in os.scandir(folder) if file.path.endswith(".csv") + ] + excel_generator(csv_files) + + # Merge all the variant long tables into one and convert to excel format + variants_tables = [ + table.path for table in os.scandir(".") if "variants_long_table" in table.path + ] + try: + concat_tables_and_write( + csvs_in_folder=variants_tables, merged_csv_name="variants_long_table.csv" + ) + except FileNotFoundError as e: + print(str(e)) + print("Merged variants_long_table.csv might be empty") + + # Create excel files for individual tables + valid_extensions = [".csv", ".tsv", ".tab"] + rest_of_csvs = [ + file.path + for file in os.scandir(".") + if any(file.path.endswith(ext) for ext in valid_extensions) + ] + link_csvs = [file for file in rest_of_csvs if os.path.islink(file)] + broken_links = [file for file in link_csvs if not os.path.exists(os.readlink(file))] + valid_csvs = [file for file in rest_of_csvs if file not in broken_links] + excel_generator(valid_csvs) + + +if __name__ == "__main__": + main(args) diff --git a/bu_isciii/templates/viralrecon/RESULTS/viralrecon_results b/bu_isciii/templates/viralrecon/RESULTS/viralrecon_results index 24125304..1b9f2275 100644 --- a/bu_isciii/templates/viralrecon/RESULTS/viralrecon_results +++ b/bu_isciii/templates/viralrecon/RESULTS/viralrecon_results @@ -8,7 +8,6 @@ mkdir mapping_consensus mkdir variants_annot mkdir assembly_spades mkdir abacas_assembly -mkdir blast mkdir ref_samples #Setting up folder and files required for excel_generator.py @@ -27,13 +26,14 @@ ln -s ../../ANALYSIS/*/assembly_stats.csv ./assembly_stats.csv ln -s ../../ANALYSIS/*/01-PikaVirus-results/all_samples_virus_table_filtered.csv ./pikavirus_table.csv #conda activate viralrecon_report -echo "python ./excel_generator.py ./references.tmp" > _01_generate_excel_files.sh +echo "python ./excel_generator.py -r ./references.tmp" > _01_generate_excel_files.sh #Cleaning temp files and broken symbolic links echo "find . -xtype l -delete" > _02_clean_folders.sh echo 'for dir in */; do find ${dir} -xtype l -delete; done' >> _02_clean_folders.sh echo "find . -type d -empty -delete" >> _02_clean_folders.sh echo 'cat references.tmp | while read in; do cp excel_files_${in}/*.xlsx ./ ;done' >> _02_clean_folders.sh echo 'cat references.tmp | while read in; do rm -rf excel_files_${in}; done' >> _02_clean_folders.sh +echo 'cat references.tmp | while read in; do rm ${in}_variants_long_table.xlsx; done' >> _02_clean_folders.sh echo "rm references.tmp" >> _02_clean_folders.sh echo "rm -rf ref_samples/" >> _02_clean_folders.sh echo "rm ./*.csv" >> _02_clean_folders.sh @@ -45,4 +45,4 @@ cd mapping_consensus; cat ../../../ANALYSIS/*/samples_ref.txt | while read in; d cd variants_annot; cat ../../../ANALYSIS/*/samples_ref.txt | while read in; do arr=($in); ln -s ../../../ANALYSIS/*/*${arr[1]}*/variants/ivar/snpeff/${arr[0]}.snpsift.txt ./${arr[0]}_${arr[1]}.snpsift.txt; done; cd - cd assembly_spades; rsync -rlv ../../../ANALYSIS/*/*/assembly/spades/rnaviral/*.scaffolds.fa.gz .; gunzip *.scaffolds.fa.gz; cd - cd abacas_assembly; cat ../../../ANALYSIS/*/samples_ref.txt | while read in; do arr=($in); ln -s ../../../ANALYSIS/*/*${arr[1]}*/assembly/spades/rnaviral/abacas/${arr[0]}.abacas.fasta ./${arr[0]}_${arr[1]}.abacas.fasta; done; cd - -cd blast; ln -s ../../../ANALYSIS/*/*.blast.filt.header.xlsx .; cd - \ No newline at end of file +ln -s ../../ANALYSIS/*/all_samples_filtered_BLAST_results.xlsx . diff --git a/setup.py b/setup.py index 22c87dbd..901abc6b 100644 --- a/setup.py +++ b/setup.py @@ -2,7 +2,7 @@ from setuptools import setup, find_packages -version = "0.0.1" +version = "1.0.1" with open("README.md") as f: readme = f.read()