Error in bin/explode #22

desmodus1984 · 2020-12-09T02:06:29Z

Hi,

I am running CONSENT to correct some Nanopore reads and I encountered some errors.

First, I tried running directly in the command line in Linux with 4 cores and 4 GB each with the following command:
./CONSENT-correct -j 8 --in /users/PHS0338/jpac1984/local/src/Porechop/DemulT7BAR.fasta --out demult.fasta --type ONT

and the logs were:
[M::worker_pipeline::831.184*0.97] mapped 93033 sequences
[M::main] Version: 2.17-r974-dirty
[M::main] CMD: minimap2 -k15 -w5 -m100 -g10000 -r2000 --max-chain-skip 25 --dual=yes -PD --no-long-join -t8 -I1G /users/PHS0338/jpac1984/local/src/Porechop/DemulT7BAR.fasta /users/PHS0
338/jpac1984/local/src/Porechop/DemulT7BAR.fasta
[M::main] Real time: 831.641 sec; CPU: 806.339 sec; Peak RSS: 8.466 GB
[Tue Dec 8 20:56:10 EST 2020] Sorting the overlaps
./CONSENT-correct: line 189: /fs/scratch/PHS0338/appz/CONSENT/bin/explode: No such file or directory

Then, since I have a bigger dataset, I submitted to the cluster,
#!/bin/bash
#SBATCH --mem=120gb
#SBATCH --ntasks=28
#SBATCH --job-name=CONSENT-ONT-VALID
#SBATCH --time=12:00:00
#SBATCH --account=PHS0338
export PATH=$PWD/minimap2:/fs/scratch/PHS0338/appz/minimap2
./CONSENT-correct -j 28 --in /fs/scratch/PHS0338/data/ONTq_combine.fasta --out ONT-V-corr.fasta --type ONT
and I got the following error:
./CONSENT-correct: line 39: nproc: command not found

I want to use all the cores requested.

Thank you very much;

morispi · 2020-12-09T10:05:15Z

Hello,

Multiple errors here.

./CONSENT-correct: line 189: /fs/scratch/PHS0338/appz/CONSENT/bin/explode: No such file or directory
This seems to indicate CONSENT was improperly built. Did you run the install.sh script after cloning? If so, could you try to reclone or pull the latest version and try again? The sources for the explode subprograms are in the src directory and should compile with no issue.
./CONSENT-correct: line 39: nproc: command not found
This comes from the way you modified your PATH environment variable. Indeed, you set export PATH=$PWD/minimap2:/fs/scratch/PHS0338/appz/minimap2, which erased everything else you had in your PATH. To correct this, you should replace this line by: export PATH=$PWD/minimap2:/fs/scratch/PHS0338/appz/minimap2:$PATH, the :$PATH at the end indicates that you wish to also keep what initially was in your PATH.

Keep me updated if this solves your issues. :)

Best,
Pierre

desmodus1984 · 2020-12-09T23:39:29Z

Dear Pierre, Thank you very much for your reply. I think that I managed to properly install it. I did get the 2nd error but with the correct command, I ran the example test with no problems. I wanted to ask you a question. I want to try with my dataset seriously, but I wanted to consult you first. I am concerned that my NR50 is pretty low ~1.5K-2Kb, so I feel like I need to tune some parameters some how to deal with this issue particularly. Since it loads the reads in batches, I read the defaults, and checking 500 at a time sounds like a small number. My dataset is ~20GB, and I have access to a cluster, would you recommend any parameter configuration to optimize correction efficiency for execution and any particular node/memory configuration? Thank you very much; Juan Pablo Aguilar Cabezas Ecology and Evolutionary Biology Ph.D. Student Department of Biological Sciences Ohio University, Athens OH

…

________________________________ From: Pierre Morisse <[email protected]> Sent: Wednesday, December 9, 2020 5:05 AM To: morispi/CONSENT <[email protected]> Cc: Aguilar Cabezas, Juan Pablo <[email protected]>; Author <[email protected]> Subject: Re: [morispi/CONSENT] Error in bin/explode (#22) Hello, Multiple errors here. 1. ./CONSENT-correct: line 189: /fs/scratch/PHS0338/appz/CONSENT/bin/explode: No such file or directory This seems to indicate CONSENT was improperly built. Did you run the `install.sh" script after cloning? If so, could you try to reclone or pull the latest version and try again? The sources for the explode subprograms are in the src directory and should compile with no issue. 2. ./CONSENT-correct: line 39: nproc: command not found This comes from the way you modified your PATH environment variable. Indeed, you set export PATH=$PWD/minimap2:/fs/scratch/PHS0338/appz/minimap2, which erased everything else you had in your PATH. To correct this, you should replace this line by: export PATH=$PWD/minimap2:/fs/scratch/PHS0338/appz/minimap2:$PATH, the :$PATH at the end indicates that you wish to also keep what initially was in your PATH. Keep me updated if this solves your issues. :) Best, Pierre — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmorispi%2FCONSENT%2Fissues%2F22%23issuecomment-741668783&data=04%7C01%7Cja569116%40ohio.edu%7C652f39dcd4094aca8db008d89c29f619%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637431051318984681%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=yoD%2Br05az39U3Xk4tCOZpvpk%2FqgUdxBW%2BOfrRe%2FuZ3A%3D&reserved=0>, or unsubscribe<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJWD2VOBHTMABNUR7ME5UCLST5DWVANCNFSM4USZ3FBA&data=04%7C01%7Cja569116%40ohio.edu%7C652f39dcd4094aca8db008d89c29f619%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637431051318984681%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=vrmpxORM%2FWtwhqSMP7BAcFUUzsnN2QqjUW%2Fhxhvt5BE%3D&reserved=0>.

morispi · 2020-12-10T10:39:08Z

Hi,

Great to hear the example tests ran with no problem.

I already ran experiments on low N50 reads (around 1-2 kb too), and CONSENT behave quite well. The overlapping step is actually crucial for CONSENT, and since Minimap2 behaves well with low N50 reads, there should be no issue.

For the second part of your question, I think you are talking about the minimapIndex parameter? If so, I believe the reads are actually not loaded in batches in the way you think. It's actually 500M bases that are loaded at a time, and not only 500 reads. Moreover, this is actually just a parameter which is passed to Minimap2, and modifying it does not impact the results. The only thing it changes is the number of reads that are loaded into memory at once when performing mapping. This means, for instance, when setting it to 500M, the first 500M bases of your reads are loaded into memory, and all other reads are mapped to these indexed reads. The next batch of 500M bases is then loaded, and so forth. The only impact this parameter has is actually on runtime and memory consumption. The higher you set it, the quicker mapping will be performed, but the more RAM you will use. Conversely, the lower you set it, the less RAM you will use, but the slower mapping will be.

For information, I ran a test on a full human genome (about 131 GB of reads), and set minimapIndex to 5G. Minimap2 displayed a peak of 70 GB of RAM, and the actual CONSENT correction step displayed a peak of 45 GB.

So, for your dataset, I believe you can leave the parameters to their default value, except for minimapIndex which you could increase a bit. I believe that, if you can, asking for around 100GB and as many threads as you want (CONSENT is highly parallelized and processes one read per thread, so the more threads you use, the quicker it will be) should work well with minimapIndex set to 5G.

Don't hesitate to update me if I got your question wrong.

Best,
Pierre

desmodus1984 · 2020-12-10T18:01:03Z

Hi Pierre, I think that I did something wrong. I did as you said, used the command: export PATH=$PWD/minimap2:/fs/scratch/PHS0338/appz/minimap2:$PATH and then ./install.sh. I am using the --minimapIndex 5G and I got an error in minimap. File: minimap_2010 Message: ./CONSENT-correct: line 187: minimap2: command not found I am freaking excited because I ran a quick assembly using my uncorrected reads and got a dramatic improvement in contiguity - from 300K fragments to 1.3K! Thus, I believe that correcting the reads will ridiculously increase the contiguity and perhaps, I will end up with a close to chromosome-scale genome assembly! Thank you very much! Juan Pablo Aguilar Cabezas Ecology and Evolutionary Biology Ph.D. Student Department of Biological Sciences Ohio University, Athens OH

…

________________________________ From: Pierre Morisse <[email protected]> Sent: Thursday, December 10, 2020 5:39 AM To: morispi/CONSENT <[email protected]> Cc: Aguilar Cabezas, Juan Pablo <[email protected]>; Author <[email protected]> Subject: Re: [morispi/CONSENT] Error in bin/explode (#22) Hi, Great to hear the example tests ran with no problem. I already ran experiments on low N50 reads (around 1-2 kb too), and CONSENT behave quite well. The overlapping step is actually crucial for CONSENT, and since Minimap2 behaves well with low N50 reads, there should be no issue. For the second part of your question, I think you are talking about the minimapIndex parameter? If so, I believe the reads are actually not loaded in batches in the way you think. It's actually 500M bases that are loaded at a time, and not only 500 reads. Moreover, this is actually just a parameter which is passed to Minimap2, and modifying it does not impact the results. The only thing it changes is the number of reads that are loaded into memory at once when performing mapping. This means, for instance, when setting it to 500M, the first 500M bases of your reads are loaded into memory, and all other reads are mapped to these indexed reads. The next batch of 500M bases is then loaded, and so forth. The only impact this parameter has is actually on runtime and memory consumption. The higher you set it, the quicker mapping will be performed, but the more RAM you will use. Conversely, the lower you set it, the less RAM you will use, but the slower mapping will be. For information, I ran a test on a full human genome (about 131 GB of reads), and set minimapIndex to 5G. Minimap2 displayed a peak of 70 GB of RAM, and the actual CONSENT correction step displayed a peak of 45 GB. So, for your dataset, I believe you can leave the parameters to their default value, except for minimapIndex which you could increase a bit. I believe that, if you can, asking for around 100GB and as many threads as you want (CONSENT is highly parallelized and processes one read per thread, so the more threads you use, the quicker it will be) should work well with minimapIndex set to 5G. Don't hesitate to update me if I got your question wrong. Best, Pierre — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmorispi%2FCONSENT%2Fissues%2F22%23issuecomment-742437403&data=04%7C01%7Cja569116%40ohio.edu%7Cfaa537c1d3b342c23a5a08d89cf7ddac%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637431935682303448%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=miMuIdBChj5P70QEskZEmpefgQj35b6muN%2B1EqaXE2g%3D&reserved=0>, or unsubscribe<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJWD2VMJ5SVXKBQGFOY5E2LSUCQN3ANCNFSM4USZ3FBA&data=04%7C01%7Cja569116%40ohio.edu%7Cfaa537c1d3b342c23a5a08d89cf7ddac%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637431935682313448%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=R3BbMJnoP21FJY3YEUTHBlM71Gdc%2Bx5ZVjqz5VkaPRw%3D&reserved=0>.

morispi · 2020-12-10T18:17:20Z

minimap2: command not found means that the minimap2 executable could not be found in your PATH.
Does this directory /fs/scratch/PHS0338/appz/minimap2 actually contain the minimap2 executable?

If not, I would advise to:

Make sure the install.sh script from CONSENT runs until the end with no error
Check the minimap2 subfolder in CONSENT, and verify it is not empty and does contain minimap2 executable
Try to run minimap2 from here, just launching ./minimap2 to see if it works
Change the export to export PATH=/absolute/path/to/CONSENT/minimap2/:$PATH, where /absolute/path/to/CONSENT/ should, of course, be replaced by the actual location of your CONSENT directory.

Moreover, maybe I wasn't clear, but the export part should be added to your launch scripts, and not just ran before running the install script. As you mentioned in your first message, you should thus have something like:

#!/bin/bash
#SBATCH --mem=120gb
#SBATCH --ntasks=28
#SBATCH --job-name=CONSENT-ONT-VALID
#SBATCH --time=12:00:00
#SBATCH --account=PHS0338
export PATH=/absolute/path/to/CONSENT/minimap2/:$PATH
./CONSENT-correct -j 28 --in /fs/scratch/PHS0338/data/ONTq_combine.fasta --out ONT-V-corr.fasta --type ONT

in every of your launch scripts. Adding the export part is important so that the minimap2 executable can be properly found.

That's great to hear my tool could be used for such a purpose, and give satisfying results! :)

Keep me updated if you still struggle to run.

Best,
Pierre

desmodus1984 · 2020-12-10T18:28:11Z

I just checked and it does contain the minimap 2 executable: [jpac1984@owens-login03 minimap2]$ pwd/fs/scratch/PHS0338/appz/minimap2[jpac1984@owens-login03 minimap2]$ ./minimap2 -hUsage: minimap2 [options] <target.fa>|<target.idx> [query.fa] [...]Options: Indexing: -H use homopolymer-compressed k-mer (preferrable for PacBio) -k INT k-mer size (no larger than 28) [15] -w INT minimizer window size [10] -I NUM split index for every ~NUM input bases [4G] -d FILE dump index to FILE [] Mapping:... Will try the other options to see if I can get it to work. Juan Pablo Aguilar Cabezas Ecology and Evolutionary Biology Ph.D. Student Department of Biological Sciences Ohio University, Athens OH

…

________________________________ From: Pierre Morisse <[email protected]> Sent: Thursday, December 10, 2020 1:17 PM To: morispi/CONSENT <[email protected]> Cc: Aguilar Cabezas, Juan Pablo <[email protected]>; Author <[email protected]> Subject: Re: [morispi/CONSENT] Error in bin/explode (#22) minimap2: command not found means that the minimap2 executable could not be found in your PATH. Does this directory /fs/scratch/PHS0338/appz/minimap2 actually contain the minimap2 executable? If not, I would advise to: 1. Make sure the install.sh script from CONSENT runs until the end with no error 2. Check the minimap2 subfolder in CONSENT, and verify it is not empty and does contain minimap2 executable 3. Try to run minimap2 from here, just launching ./minimap2 to see if it works 4. Change the export to export PATH=/absolute/path/to/CONSENT/minimap2/:$PATH, where /absolute/path/to/CONSENT/ should, of course, be replaced by the actual location of your CONSENT directory. Moreover, maybe I wasn't clear, but the export part should be added to your launch scripts, and not just ran before running the install script. As you mentioned in your first message, you should thus have something like: #!/bin/bash #SBATCH --mem=120gb #SBATCH --ntasks=28 #SBATCH --job-name=CONSENT-ONT-VALID #SBATCH --time=12:00:00 #SBATCH --account=PHS0338 export PATH=/absolute/path/to/CONSENT/minimap2/:$PATH ./CONSENT-correct -j 28 --in /fs/scratch/PHS0338/data/ONTq_combine.fasta --out ONT-V-corr.fasta --type ONT in every of your launch scripts. Adding the export part is important so that the minimap2 executable can be properly found. That's great to hear my tool could be used for such a purpose, and give satisfying results! :) Keep me updated if you still struggle to run. Best, Pierre — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmorispi%2FCONSENT%2Fissues%2F22%23issuecomment-742702195&data=04%7C01%7Cja569116%40ohio.edu%7C6fb710c854c64847488c08d89d37e00c%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637432210590999209%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=r2NzEsglUyfWmJjxaYylpu7c36f%2BMp8TCoSscAl%2BCPs%3D&reserved=0>, or unsubscribe<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJWD2VISJPVHKEPJJ6UZS53SUEGEDANCNFSM4USZ3FBA&data=04%7C01%7Cja569116%40ohio.edu%7C6fb710c854c64847488c08d89d37e00c%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637432210591009206%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=zKWB1OBB3abcqt0%2FeHVz5to8BrOWT8EKCB2rhx%2B5ic8%3D&reserved=0>.

desmodus1984 · 2020-12-16T03:34:31Z

Hi Pierre, I tried both ways and the minimap2 executable is either in the installed directory or in the CONSENT directory, and both work. I put that command in the script as you mentioned and got the same error. I didn't respond because I was dealing with health issues last week. Here is the update. I tried running CONSENT using the following code: export PATH=$PWD/minimap2:/fs/scratch/PHS0338/appz/minimap2:$PATH./CONSENT-correct -j 28 --in /fs/scratch/PHS0338/data/ONTq_combine.rename.fasta --out ONTq_CONSENT.fasta --type ONT --minimapIndex 5G with 28 cores and 100G, and I got the following error: ./CONSENT-correct: line 202: 25058 Illegal instruction (core dumped) $LRSCf/bin/CONSENT-correction -a $tmpdir/"$alignments" -s "$minSupport" -S "$maxSupport" -l "$windowSize" -k "$merSize" -c "$commonKMers" -A "$minAnchors" -f "$solid" -m "$windowOverlap" -j "$nproc" -r "$reads" -M "$maxMSA" -p "$LRSCf" >> "$out" I got up to the core.25058 The last time, with export PATH=$PWD/minimap2:/fs/scratch/PHS0338/appz/minimap2:$PATH./CONSENT-correct -j 28 --in /fs/scratch/PHS0338/data/ONTq_combine.rename.fasta --out ONTq_CONSENT.fasta --type ONT --minimapIndex 5G, it reached up to Alignments_28512.paf Hopefully, we can get to the point that I can finally use CONSENT! Thank you very much; Juan Pablo Aguilar Cabezas Ecology and Evolutionary Biology Ph.D. Student Department of Biological Sciences Ohio University, Athens OH

…

________________________________ From: Pierre Morisse <[email protected]> Sent: Thursday, December 10, 2020 1:17 PM To: morispi/CONSENT <[email protected]> Cc: Aguilar Cabezas, Juan Pablo <[email protected]>; Author <[email protected]> Subject: Re: [morispi/CONSENT] Error in bin/explode (#22) minimap2: command not found means that the minimap2 executable could not be found in your PATH. Does this directory /fs/scratch/PHS0338/appz/minimap2 actually contain the minimap2 executable? If not, I would advise to: 1. Make sure the install.sh script from CONSENT runs until the end with no error 2. Check the minimap2 subfolder in CONSENT, and verify it is not empty and does contain minimap2 executable 3. Try to run minimap2 from here, just launching ./minimap2 to see if it works 4. Change the export to export PATH=/absolute/path/to/CONSENT/minimap2/:$PATH, where /absolute/path/to/CONSENT/ should, of course, be replaced by the actual location of your CONSENT directory. Moreover, maybe I wasn't clear, but the export part should be added to your launch scripts, and not just ran before running the install script. As you mentioned in your first message, you should thus have something like: #!/bin/bash #SBATCH --mem=120gb #SBATCH --ntasks=28 #SBATCH --job-name=CONSENT-ONT-VALID #SBATCH --time=12:00:00 #SBATCH --account=PHS0338 export PATH=/absolute/path/to/CONSENT/minimap2/:$PATH ./CONSENT-correct -j 28 --in /fs/scratch/PHS0338/data/ONTq_combine.fasta --out ONT-V-corr.fasta --type ONT in every of your launch scripts. Adding the export part is important so that the minimap2 executable can be properly found. That's great to hear my tool could be used for such a purpose, and give satisfying results! :) Keep me updated if you still struggle to run. Best, Pierre — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmorispi%2FCONSENT%2Fissues%2F22%23issuecomment-742702195&data=04%7C01%7Cja569116%40ohio.edu%7C6fb710c854c64847488c08d89d37e00c%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637432210590999209%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=r2NzEsglUyfWmJjxaYylpu7c36f%2BMp8TCoSscAl%2BCPs%3D&reserved=0>, or unsubscribe<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJWD2VISJPVHKEPJJ6UZS53SUEGEDANCNFSM4USZ3FBA&data=04%7C01%7Cja569116%40ohio.edu%7C6fb710c854c64847488c08d89d37e00c%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637432210591009206%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=zKWB1OBB3abcqt0%2FeHVz5to8BrOWT8EKCB2rhx%2B5ic8%3D&reserved=0>.

morispi · 2020-12-16T09:23:20Z

Hi,

I hope you're doing better! Health should always be more important than work.

Well, it's weird that you still get the error when adding the command to the script. I've been running CONSENT myself on different clusters and never encountered the problem. But great to see you managed to find a workaround. Another more robust solution could be to try and modify the minimap2 command on lines 185 and 187 of CONSENT-correct, and replace them with the complete path to your minimap2 executable. I believe this should also fix the issue, and could avoid the need to specify the path to minimap2 each and every time you run CONSENT.

For the illegal instruction / core dumped, I'm afraid this error is pretty data dependent. Maybe it could have something to do with the way your data is formatted. Maybe your reads file contains empty reads, or something like that. Could you paste me what head /fs/scratch/PHS0338/data/ONTq_combine.rename.fasta reports?

Otherwise, if your data is public, could you send me a link to download it? This would greatly help me locate the error more precisely and quickly push an update to deal with it. Maybe a few bugs are still present in CONSENT, but it's hard to tell with no access to the data causing the issue, unfortunately.

Best,
Pierre

desmodus1984 · 2020-12-18T20:12:49Z

Hello Pierre, Hope you are doing well. I wanted to let you know that I have an idea of what was going on before. I believe that somehow the program was outputting file with the same name for a given dataset and therefore, the previous run was interfering with the next run, and that was basically the issue. I wanted to let you know that God knows why, I have successfully ran a CONSENT job! I feel pretty dam amazing before I assigned it 24 hours, It was running for ~18hrs and I was scared that it would ran out of time and then time wasted. Incredibly, the cluster staff -customer service was out of work, and could not extend the runtime but nonetheless, IT FINISHED! I just CAN'T BELIEVE IT! I wanted to say that I ran a previous test with a small dataset and incredibly, contrary to the expectation of shortening the reads, it actually did extend the max read length for about 4k bps! I would like to know why that would have been the case. Look The corrected dataset: min_len avg_len max_len 11 2,173.7 87,691 Uncorrected: min_len avg_len max_len 100 1,372.7 84,432 For this final, full dataset, min_len avg_len max_len 10 1,713.8 260,673 and the uncorrected: min_len avg_len max_len 100 1,173.5 266,311 Maybe you can explain this result a little better. The size of the corrected dataset is more than twice the size of the uncorrected dataset! For me this is awesome because in some weird dimension for no reason I have more reads and coverage and CORRECTED READS! But... I would not even expect this- just under any circumstance... The size of the corrected is 79,618,172,283 while that of the uncorrected is 34,524,922,373 One point is that I wanted to include some quality information and the uncorrected is a FASTQ file, while the uncorrected is a FASTA file. Finally, I found a paper where you reviewed error-correction methods, and I wanted to consult you. I was thinking on perhaps trying one or two other error correction algorithms and merge those datasets for final genome assembly - like a data augmentation strategy. Since not all errors would be corrected by every single method, merging different results will incorporate information from multiple strategies and increase the number of errors corrected. I have a dataset of about ~20X of Nanopore reads, with read length up to 400kbps , I don't know why I don't have it in my uncorrected dataset; and I have about 65X of 100bps short reads. I would appreciate if you could suggest which method or two methods would be the best ones to try given the characteristics of my dataset. I know that some methods fail with long reads, but at the same time I do not want to shorten my long-reads. I was pretty shocked that I have had reads 1M+ but Guppy failed to base-call them, and they did not show up as pass calls... Looking forward to hearing from you soon! Sincerely; Juan Pablo Aguilar Cabezas Ecology and Evolutionary Biology Ph.D. Student Department of Biological Sciences Ohio University, Athens OH

…

________________________________ From: Pierre Morisse <[email protected]> Sent: Wednesday, December 16, 2020 4:23 AM To: morispi/CONSENT <[email protected]> Cc: Aguilar Cabezas, Juan Pablo <[email protected]>; Author <[email protected]> Subject: Re: [morispi/CONSENT] Error in bin/explode (#22) Hi, I hope you're doing better! Health should always be more important than work. Well, it's weird that you still get the error when adding the command to the script. I've been running CONSENT myself on different clusters and never encountered the problem. But great to see you managed to find a workaround. Another more robust solution could be to try and modify the minimap2 command on lines 185 and 187 of CONSENT-correct, and replace them with the complete path to your minimap2 executable. I believe this should also fix the issue, and could avoid the need to specify the path to minimap2 each and every time you run CONSENT. For the illegal instruction / core dumped, I'm afraid this error is pretty data dependent. Maybe it could have something to do with the way your data is formatted. Maybe your reads file contains empty reads, or something like that. Could you paste me what head /fs/scratch/PHS0338/data/ONTq_combine.rename.fasta reports? Otherwise, if your data is public, could you send me a link to download it? This would greatly help me locate the error more precisely and quickly push an update to deal with it. Maybe a few bugs are still present in CONSENT, but it's hard to tell with no access to the data causing the issue, unfortunately. Best, Pierre — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmorispi%2FCONSENT%2Fissues%2F22%23issuecomment-745977697&data=04%7C01%7Cja569116%40ohio.edu%7C36f9ac3e0e354915bc9e08d8a1a44403%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637437074203208676%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=GSHwnNH8wNwCnaaJ5ZR8p%2BoJBF434jfk1T7wBi04YNs%3D&reserved=0>, or unsubscribe<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJWD2VN54KS5ZFMYBNHVMUDSVB4BPANCNFSM4USZ3FBA&data=04%7C01%7Cja569116%40ohio.edu%7C36f9ac3e0e354915bc9e08d8a1a44403%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637437074203208676%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Qw4pf%2FgEU%2FllKRGWd6JWeU0q75trAA3l%2Fl%2FAeIw%2B35o%3D&reserved=0>.

desmodus1984 · 2020-12-21T18:33:43Z

Hello Pierre, I just wanted to follow up. I found that the last run didn't finish, so I reran it, it took ~6 days to correct a 59GB dataset. There are two weird points that I wanted to mention. First, as in the failed run, the final corrected dataset turned into a 100G dataset! Second, the uncorrected dataset was limited to fragments at least or longer than 500 bps min_len avg_len max_len 500 1,696.4 266,311 , and in the corrected dataset, there are very small fragments: min_len avg_len max_len 10 1,743.6 260,137 Lastly, sadly, the largest fragment size of the uncorrected dataset was shortened. Best regards, and Merry Christmas; Juan Pablo Aguilar Cabezas Ecology and Evolutionary Biology Ph.D. Student Department of Biological Sciences Ohio University, Athens OH

…

________________________________ From: Pierre Morisse <[email protected]> Sent: Wednesday, December 16, 2020 4:23 AM To: morispi/CONSENT <[email protected]> Cc: Aguilar Cabezas, Juan Pablo <[email protected]>; Author <[email protected]> Subject: Re: [morispi/CONSENT] Error in bin/explode (#22) Hi, I hope you're doing better! Health should always be more important than work. Well, it's weird that you still get the error when adding the command to the script. I've been running CONSENT myself on different clusters and never encountered the problem. But great to see you managed to find a workaround. Another more robust solution could be to try and modify the minimap2 command on lines 185 and 187 of CONSENT-correct, and replace them with the complete path to your minimap2 executable. I believe this should also fix the issue, and could avoid the need to specify the path to minimap2 each and every time you run CONSENT. For the illegal instruction / core dumped, I'm afraid this error is pretty data dependent. Maybe it could have something to do with the way your data is formatted. Maybe your reads file contains empty reads, or something like that. Could you paste me what head /fs/scratch/PHS0338/data/ONTq_combine.rename.fasta reports? Otherwise, if your data is public, could you send me a link to download it? This would greatly help me locate the error more precisely and quickly push an update to deal with it. Maybe a few bugs are still present in CONSENT, but it's hard to tell with no access to the data causing the issue, unfortunately. Best, Pierre — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmorispi%2FCONSENT%2Fissues%2F22%23issuecomment-745977697&data=04%7C01%7Cja569116%40ohio.edu%7C36f9ac3e0e354915bc9e08d8a1a44403%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637437074203208676%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=GSHwnNH8wNwCnaaJ5ZR8p%2BoJBF434jfk1T7wBi04YNs%3D&reserved=0>, or unsubscribe<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJWD2VN54KS5ZFMYBNHVMUDSVB4BPANCNFSM4USZ3FBA&data=04%7C01%7Cja569116%40ohio.edu%7C36f9ac3e0e354915bc9e08d8a1a44403%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637437074203208676%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Qw4pf%2FgEU%2FllKRGWd6JWeU0q75trAA3l%2Fl%2FAeIw%2B35o%3D&reserved=0>.

morispi · 2021-01-06T09:31:28Z

Hello,

Best wishes for this new year! Hope it brings you better things than 2020. :)
Sorry for not answering before, but I try to totally cut out work during my holidays.

Great to hear you managed to run CONSENT properly.

I wanted to say that I ran a previous test with a small dataset and incredibly, contrary to the expectation of shortening the reads, it actually did extend the max read length for about 4k bps! I would like to know why that would have been the case.

This can happen for two reasons. 1) The original uncorrected read contained lots of deletions, and correction added missing bases and extended the reads. 2) The correction process, whether with MSA or with DBG traversal, can slightly extend the corrected read further than its original extremities. Moreover, correction tends to not output very short reads (since they cannot be corrected), and thus impact the average length of the corrected reads: the less short reads you have, the longer your average length will be.

Maybe you can explain this result a little better. The size of the corrected dataset is more than twice the size of the uncorrected dataset! For me this is awesome because in some weird dimension for no reason I have more reads and coverage and CORRECTED READS! But... I would not even expect this- just under any circumstance...

I'm afraid this actually is an abnormal behaviour. Judging by the figures you provided, it looks like some reads are corrected and output multiple times, thus resulting in a much larger corrected reads file containing multiple occurrences of the same reads. You could check that by performing grep ">" correctedReads.fasta | sort | uniq -c | sort -n, and checking if anything else that 1s appear. This was an issue in a previous version of CONSENT that has since been fixed, so this is pretty strange. Are you sure you are using the latest version?

One point is that I wanted to include some quality information and the uncorrected is a FASTQ file, while the uncorrected is a FASTA file.

Indeed, like the vast majority (if not all) self-correctors, CONSENT does not make use of the quality information in the FASTQ files, and thus does not output it.

I would appreciate if you could suggest which method or two methods would be the best ones to try given the characteristics of my dataset. I know that some methods fail with long reads, but at the same time I do not want to shorten my long-reads. I was pretty shocked that I have had reads 1M+ but Guppy failed to base-call them, and they did not show up as pass calls...

This is a tough question given the large amount of error correction tools that exist. A 20x coverage of ONT reads is unfortunately pretty low, and can impact how well self-correction methods will perform. I would advice you using whether Canu of FLAS if you want to try another self-correction method. Canu usually deals better with high error rates, but is usually much slower than FLAS. If I had to pick a tool myself, I think I'd go for FLAS first, see how the results look, and try out Canu if the quality is not high enough.

Since you also have access to short reads, I'd also advise you to try out a hybrid correction tool in addition. HG-CoLoR shows very good performances in terms of reduction of the error rate, and thus leads to very high-quality corrected reads, but can be pretty slow to run. FMLRC and especially LoRDEC are usually much faster, but lead to slightly lower quality in corrected reads than HG-CoLoR. If time is on your hands, I'd pick HG-CoLoR, otherwise, I'd probably go for LoRDEC if I wanted to perform quick correction.

Second, the uncorrected dataset was limited to fragments at least or longer than 500 bps
min_len avg_len max_len
500 1,696.4 266,311
, and in the corrected dataset, there are very small fragments:
min_len avg_len max_len
10 1,743.6 260,137

Lastly, sadly, the largest fragment size of the uncorrected dataset was shortened.

Conversely to what I explained before, correction can also sometimes shorten the reads. This can be the case especially when a given read contains a lot of insertion errors. Correction will remove these erroneous bases, and will thus shorten the read. This is a normal behaviour. Given the figures, your longest read lost 2,25% of its length, which does not seem like a lot to me, especially if it contained a lot of insertions.

Hope this answers most your questions!

Best,
Pierre

desmodus1984 · 2021-01-12T18:10:28Z

Hi Pierre It is very healthy to put absolutely aside work during holidays and truly relaxed. That you are capable of doing that is truly remarkable. I wanted to ask you a question. Hopefully you can help me. I told you that I have ~20x of ONT reads, but I believe that following the standard preprocessing is kinda hurting me somehow, and it is not coherent given that I will do error correction afterwards. My preprocessing is the following, after sequencing I do basecalling with Guppy but I followed the recommendation from ONT people to do quality filtering, and only recovering reads with Qscore >7. For example, I was super excited when during sequencing I got reads > 1M! Also, I do barcode and Adapter removal with Porechop, which also checks for internal adapters + trimming/splitting. I believe that you have more experience than me, is this workflow correct? If not, what do you recommend? Talk with ONT people they have mentioned that it is possible that one read gets sequenced immediately after the other so internal adapter checking is OK. Looking forward to hearing from you soon! Thank you very much! Juan Pablo Aguilar Cabezas Ecology and Evolutionary Biology Ph.D. Student Department of Biological Sciences Ohio University, Athens OH

…

________________________________ From: Pierre Morisse <[email protected]> Sent: Wednesday, January 6, 2021 4:31 AM To: morispi/CONSENT <[email protected]> Cc: Aguilar Cabezas, Juan Pablo <[email protected]>; Author <[email protected]> Subject: Re: [morispi/CONSENT] Error in bin/explode (#22) Hello, Best wishes for this new year! Hope it brings you better things than 2020. :) Sorry for not answering before, but I try to totally cut out work during my holidays. Great to hear you managed to run CONSENT properly. I wanted to say that I ran a previous test with a small dataset and incredibly, contrary to the expectation of shortening the reads, it actually did extend the max read length for about 4k bps! I would like to know why that would have been the case. This can happen for two reasons. 1) The original uncorrected read contained lots of deletions, and correction added missing bases and extended the reads. 2) The correction process, whether with MSA or with DBG traversal, can slightly extend the corrected read further than its original extremities. Moreover, correction tends to not output very short reads (since they cannot be corrected), and thus impact the average length of the corrected reads: the less short reads you have, the longer your average length will be. Maybe you can explain this result a little better. The size of the corrected dataset is more than twice the size of the uncorrected dataset! For me this is awesome because in some weird dimension for no reason I have more reads and coverage and CORRECTED READS! But... I would not even expect this- just under any circumstance... I'm afraid this actually is an abnormal behaviour. Judging by the figures you provided, it looks like some reads are corrected and output multiple times, thus resulting in a much larger corrected reads file containing multiple occurrences of the same reads. You could check that by performing grep ">" correctedReads.fasta | sort | uniq -c | sort -n, and checking if anything else that 1s appear. This was an issue in a previous version of CONSENT that has since been fixed, so this is pretty strange. Are you sure you are using the latest version? One point is that I wanted to include some quality information and the uncorrected is a FASTQ file, while the uncorrected is a FASTA file. Indeed, like the vast majority (if not all) self-correctors, CONSENT does not make use of the quality information in the FASTQ files, and thus does not output it. I would appreciate if you could suggest which method or two methods would be the best ones to try given the characteristics of my dataset. I know that some methods fail with long reads, but at the same time I do not want to shorten my long-reads. I was pretty shocked that I have had reads 1M+ but Guppy failed to base-call them, and they did not show up as pass calls... This is a tough question given the large amount of error correction tools that exist. A 20x coverage of ONT reads is unfortunately pretty low, and can impact how well self-correction methods will perform. I would advice you using whether Canu of FLAS if you want to try another self-correction method. Canu usually deals better with high error rates, but is usually much slower than FLAS. If I had to pick a tool myself, I think I'd go for FLAS first, see how the results look, and try out Canu if the quality is not high enough. Since you also have access to short reads, I'd also advise you to try out a hybrid correction tool in addition. HG-CoLoR shows very good performances in terms of reduction of the error rate, and thus leads to very high-quality corrected reads, but can be pretty slow to run. FMLRC and especially LoRDEC are usually much faster, but lead to slightly lower quality in corrected reads than HG-CoLoR. If time is on your hands, I'd pick HG-CoLoR, otherwise, I'd probably go for LoRDEC if I wanted to perform quick correction. Second, the uncorrected dataset was limited to fragments at least or longer than 500 bps min_len avg_len max_len 500 1,696.4 266,311 , and in the corrected dataset, there are very small fragments: min_len avg_len max_len 10 1,743.6 260,137 Lastly, sadly, the largest fragment size of the uncorrected dataset was shortened. Conversely to what I explained before, correction can also sometimes shorten the reads. This can be the case especially when a given read contains a lot of insertion errors. Correction will remove these erroneous bases, and will thus shorten the read. This is a normal behaviour. Given the figures, your longest read lost 2,25% of its length, which does not seem like a lot to me, especially if it contained a lot of insertions. Hope this answers most your questions! Best, Pierre — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmorispi%2FCONSENT%2Fissues%2F22%23issuecomment-755189167&data=04%7C01%7Cja569116%40ohio.edu%7Cc327e3b2be6149b0e80a08d8b225e1e4%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637455223078544404%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=BHfZg4BzhOlkpkyIhgDM0hVQB4htZx2lXlxaK8vns%2Fc%3D&reserved=0>, or unsubscribe<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJWD2VKJJTBLNBMSBUHEVX3SYQUX7ANCNFSM4USZ3FBA&data=04%7C01%7Cja569116%40ohio.edu%7Cc327e3b2be6149b0e80a08d8b225e1e4%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637455223078544404%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=9FoVKhvrOWwqj0D00X88op7xDGvKkVtXzZMZ24cEosY%3D&reserved=0>.

morispi · 2021-01-13T13:00:45Z

Hello,

Unfortunately, I do not have any experience with basecalling / quality filtering and such things.
I usually directly deal with FASTA/FASTQ files, so I'm afraid I won't be able to help you on this one.

Hope you'll find a fitting answer quickly!

Best,
Pierre

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in bin/explode #22

Error in bin/explode #22

desmodus1984 commented Dec 9, 2020

morispi commented Dec 9, 2020 •

edited

Loading

desmodus1984 commented Dec 9, 2020 via email

morispi commented Dec 10, 2020

desmodus1984 commented Dec 10, 2020 via email

morispi commented Dec 10, 2020

desmodus1984 commented Dec 10, 2020 via email

desmodus1984 commented Dec 16, 2020 via email

morispi commented Dec 16, 2020

desmodus1984 commented Dec 18, 2020 via email

desmodus1984 commented Dec 21, 2020 via email

morispi commented Jan 6, 2021

desmodus1984 commented Jan 12, 2021 via email

morispi commented Jan 13, 2021

Error in bin/explode #22

Error in bin/explode #22

Comments

desmodus1984 commented Dec 9, 2020

morispi commented Dec 9, 2020 • edited Loading

desmodus1984 commented Dec 9, 2020 via email

morispi commented Dec 10, 2020

desmodus1984 commented Dec 10, 2020 via email

morispi commented Dec 10, 2020

desmodus1984 commented Dec 10, 2020 via email

desmodus1984 commented Dec 16, 2020 via email

morispi commented Dec 16, 2020

desmodus1984 commented Dec 18, 2020 via email

desmodus1984 commented Dec 21, 2020 via email

morispi commented Jan 6, 2021

desmodus1984 commented Jan 12, 2021 via email

morispi commented Jan 13, 2021

morispi commented Dec 9, 2020 •

edited

Loading