Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError during linked_fragment_file generation #77

Open
chunlinxiao opened this issue Jun 30, 2019 · 27 comments
Open

UnicodeDecodeError during linked_fragment_file generation #77

chunlinxiao opened this issue Jun 30, 2019 · 27 comments

Comments

@chunlinxiao
Copy link

Traceback (most recent call last):
File "./HapCUT2/utilities/LinkFragments.py", line 492, in
link_fragments(args.fragments,args.VCF,args.bam_file, args.outfile, args.distance, args.single_SNP_frags)
File "./HapCUT2/utilities/LinkFragments.py", line 327, in link_fragments
flist = read_fragment_matrix(hairs_file,vcf_file,chrom_filter=curr_chrom)
File "./HapCUT2/utilities/LinkFragments.py", line 212, in read_fragment_matrix
for line in fm:
File "./HapCUT2/myenv3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 644: invalid start byte

Step1 of extractHAIRS for unlinked_fragment_file is okay.

But step2 of LinkFragments.py for linked_fragment_file generation had an error above.

Do you have any idea? thanks

@FrickTobias
Copy link

I have the same problem.

I've also identified and removed row causing the error (by reading the unlinked.txt file and using the python module f.seek(7010) and f.read(10)) but that just caused an error with position 7021 instead of 7010, so it seems there is a bigger problem here.

error message

cat BLR_out/hapcut2_linkfragments.log 
Linking 10X fragments on chromosome: chr1
Traceback (most recent call last):
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/bin/LinkFragments.py", line 486, in <module>
    link_fragments(args.fragments,args.VCF,args.bam_file, args.outfile, args.distance, args.single_SNP_frags)
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/bin/LinkFragments.py", line 321, in link_fragments
    flist = read_fragment_matrix(hairs_file,vcf_file,chrom_filter=curr_chrom)
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/bin/LinkFragments.py", line 206, in read_fragment_matrix
    for line in fm:
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 7021: invalid start byte

Commands used to get input files

freebayes \
    -f genome_reference \
    input.bam \
    vars.vcf
vcffilter \
    -f "AC > 0" \
    vars.vcf \
    > vars-filt.vcf
extractHAIRS \
         --10X 1 \
         --bam input.bam \
         --VCF vars-filt.vcf \
         --out unlinked.txt
LinkFragments.py \
         --bam input.bam \
         -v vars-filt.vcf \
         --fragments unlinked.txt \
         --out linked.txt

@pjedge
Copy link
Collaborator

pjedge commented Nov 25, 2019

What happens if you change directory to the "utilities" directory and run test_LinkFragment.py? (Please change the value of EXTRACTHAIRS variable in the test script to the path of the extractHAIRS binary).

@FrickTobias
Copy link

FrickTobias commented Nov 26, 2019

Thanks for taking the time to answer.

I'm running it through conda so I am not 100% if my workaround will be equivalent to your description. Therefore I'll include a description of what I did.

Results

It seems to finish without any errors.

cd HapCUT2-1.2/utilities/
pytest .
pytest HapCUT2-1.2/utilities/
================================================ test session starts =================================================
platform linux -- Python 3.6.7, pytest-5.2.2, py-1.8.0, pluggy-0.12.0
rootdir: /domus/h1/tobiasf
collected 31 items                                                                                                   

HapCUT2-1.2/utilities/test_LinkFragment.py ...............................                                     [100%]

================================================= 31 passed in 1.20s =================================================

What I did

To run the test_LinkFragments.py I dowloaded the latest release's source files (v1.2) and then changed the EXTRACTHAIRS variable to the output given by which extractHAIRS. Everything was run in the same environment as the original post and I also double checked the conda HapCUT2 version distribution.

conda search -f hapcut2
Loading channels: done
# Name                       Version           Build  Channel             
hapcut2                          1.1      hed695b0_0  bioconda            
hapcut2                          1.2      hed50d52_0  bioconda

@pjedge
Copy link
Collaborator

pjedge commented Nov 26, 2019

Gotcha. So this bug is not caught by the existing tests. Can you show me which row of the unlinked fragment file was "causing" the crash?

@FrickTobias
Copy link

From the first error where it couldn't decode position 7010:

with open("unlinked.txt", "rb") as f:
    f.seek(7100)
    f.read(60)
b'0 F\n1 A00621:130:HN5HWDSXX:4:2467:8359:1595 2 AAAACAACGATCGT'

@pjedge
Copy link
Collaborator

pjedge commented Nov 27, 2019

This is very helpful. Would you mind seeking to a little earlier, and reading more, so that I can have the equivalent of 5-10 lines from the fragment file around that spot?

@pjedge
Copy link
Collaborator

pjedge commented Nov 27, 2019

I think I reproduced this bug with a 10x dataset of my own. Should be fixed soon.

@FrickTobias
Copy link

FrickTobias commented Nov 27, 2019

That's great, I'll leave the closing of the issue to you when it's been implemented. Let me know if you need something more.

Thank you so much for the help! If I could rate this interaction I'd give it a 10/10!

@FrickTobias
Copy link

This is very helpful. Would you mind seeking to a little earlier, and reading more, so that I can have the equivalent of 5-10 lines from the fragment file around that spot?

You'll find the row posted before highlighted in bold.

with open("unlinked.txt" ,"rb") as f:
... f.seek(6500)
... f.read(1000)
...
6500
b':HN5HWDSXX:4:2549:22426:20071 2 CTTGTGTGCTTCCTTGTTAG -1 1122 00 88\n1 A00621:130:HN5HWDSXX:4:1226:5538:10708 2 AAAACAACGATCGTTCTGCG -1 1132 0 F\n1 A00621:130:HN5HWDSXX:4:2229:6560:19304 2 TTAAGAAAGGCGTTAGCAAC -1 1134 0 F\n1 A00621:130:HN5HWDSXX:4:1557:6650:27007 2 CTTCGAACCTTGCGCACTTC -1 1141 0 F\n1 A00621:130:HN5HWDSXX:4:2365:19479:22169 2 CTAATTAATTTCGGTGTGTG -1 1145 0 F\n1 A00621:130:HN5HWDSXX:4:1502:12246:4695 2 GGTGGAAGTAAGCGCAGTAA -1 1145 00 FF\n1 A00621:130:HN5HWDSXX:4:2207:31331:23375 2 GAACGTTATTCGTGTAGAAA -1 1151 0 F\n1 A00621:130:HN5HWDSXX:4:2509:13485:19022 2 CTTAGGTCCGTCGTTCCTCA -1 1168 0 F\n1 A00621:130:HN5HWDSXX:4:2467:8359:1595 2 AAAACAACGATCGTTCTGCG -1 1178 00000000 FFFFFFFF\n1 A00621:130:HN5HWDSXX:4:2628:4589:9439_FILTERED 2 NULL -1 1189 0 :\n1 A00621:130:HN5HWDSXX:4:2314:5529:24220 2 GTAGGTAGGAAACTCCTTCC -1 1197 10 F:\n1 A00621:130:HN5HWDSXX:4:1145:28646:27743 2 GTACTTTATGTCTGCGCGCC -1 1198 0 <\n1 A00621:130:HN5HWDSXX:4:2341:10375:19210 2 CACGTGCATACAGTCGTATG -1 1198 0 ?\n1 A0062'

@pjedge
Copy link
Collaborator

pjedge commented Nov 27, 2019

So I didn't actually reproduce this same issue, it was a separate utf-8 issue that looked similar.

Would it be possible for you to send your entire unlinked.txt (just that file) to me? edge dot peterj at gmail dot com

@FrickTobias
Copy link

Alright, I've mailed the file to you.

subject: HapCUT2 issue #77 file
sender: [email protected]

@FrickTobias
Copy link

A little bit down the road...

Fixed by two commits:

There was also a downstream problem with buffer sizes which weren't part of the original issue, but was patched to verify the initial utf-8 error patch. This turned out to be rooted to a hard-coded buffer size which needed to be increased.

I've just finished running through my data through all steps without any further hiccups so I consider the issue closed.

Thanks to @pjedge for his excellent perseverance in resolving this issue.

@FrickTobias
Copy link

<Anyone who has author privileges, feel free to close this issue>

@vibansal vibansal closed this as completed Feb 7, 2020
@chunlinxiao
Copy link
Author

thanks

@chunlinxiao
Copy link
Author

still have the same errors:

LinkFragments.py 10X ...
Linking 10X fragments on chromosome: scaffold_1
Traceback (most recent call last):
File "/usr/local/HapMAP2/1.2/utilities/LinkFragments.py", line 486, in
link_fragments(args.fragments,args.VCF,args.bam_file, args.outfile, args.distance, args.single_SNP_frags)
File "/usr/local/HapMAP2/1.2/utilities/LinkFragments.py", line 321, in link_fragments
flist = read_fragment_matrix(hairs_file,vcf_file,chrom_filter=curr_chrom)
File "/usr/local/HapMAP2/1.2/utilities/LinkFragments.py", line 206, in read_fragment_matrix
for line in fm:
File "/usr/local/HapMAP2/1.2/venv/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 4911: invalid start byte

@vibansal vibansal reopened this Feb 20, 2020
@chunlinxiao
Copy link
Author

our systems installed v1.2 release, which may not include all the recent fixes. I guess this may be the reason I still got the same error.

@DrPintoThe2nd
Copy link

Hi,

I hit this same issue on the conda install and when I tried to quick install from github directly to troubleshoot, "make" generates this error:

"gcc -g -O3 -Wall -D_GNU_SOURCE -Ihtslib -c hairs-src/bamread.c -lhts -o build/bamread.o In file included from hairs-src/bamread.c:1:0:
hairs-src/bamread.h:10:10: fatal error: htslib/sam.h: No such file or directory #include "htslib/sam.h" ^~~~~~~~~~~~~~
compilation terminated.
Makefile:33: recipe for target 'build/bamread.o' failed
make: *** [build/bamread.o] Error 1"

htslib v1.9 is in my path.

Best,
bjp

@vibansal
Copy link
Owner

vibansal commented Apr 9, 2020

If you add the actual path to 'htslib' to the Makefile (variable HTSLIB), the code should compile.

I pushed a minor change to LinkFragments.py that could potentially fix the UnicdoeDecode error. If you could please try and let me know, it would be really helpful.

@DrPintoThe2nd
Copy link

Downloaded the new LinkFragments.py and successfully installed htslib and HapCut2 from github and everything appears to be working!! Will post confirmation when it finishes completely, but it has passed that place where it used to error.

Thanks so much for your quick response and fix to this issue!! Really needed a win today!
Best,
bjp

@DrPintoThe2nd
Copy link

Hi Bansal lab,
I appears that the LinkFragments.py has been successfully patched, although it's still running (going on 10 days) and is only about 1/3 of the way through the genome (purely by scaffold #).. Is this a normal run-time for this script?
Thanks again!
Best,
bjp

@pjedge
Copy link
Collaborator

pjedge commented Apr 19, 2020 via email

@DrPintoThe2nd
Copy link

1.8Gb genome, ~3,000 scaffolds (w/ scaffold L90 of 112) w/ ~1.7 billion 10X reads.

Currently approaching scaffold 1,000 w/o any notable increase in speed (although these scaffolds are might small).

@olavurmortensen
Copy link

I'm experiencing the same error as @chunlinxiao and @FrickTobias with v1.3.

I installed HapCUT2 v1.3.2 via conda (https://anaconda.org/bioconda/hapcut2). It's installed in one of my projects here. The recipe for the bioconda pacakge can be found here: https://bioconda.github.io/recipes/hapcut2/README.html

Just to be clear, I don't think this is an issue with the bioconda recipe. In the recipe YAML it clearly uses v1.3.2 as source.

@vibansal
Copy link
Owner

Can you please share a dataset for reproducing this error? I can provide an upload link if needed. Thanks.

@olavurmortensen
Copy link

olavurmortensen commented Feb 19, 2021

@vibansal If I use the method used above to find the particular line that causes the problem, will that suffice? I cannot share the full dataset.

EDIT:

The "method used above" being the following.

with open("unlinked.txt", "rb") as f:
    f.seek(7100)
    f.read(60)

@vibansal
Copy link
Owner

Yes, having a few lines around the problematic line will be helpful.

@olavurmortensen
Copy link

Hi, sorry about leaving this hanging. The problem solved itself. Most likely there was some versioning confusion.

You can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants