Error when running prepDE.py #238

ms-gx · 2019-09-17T22:47:21Z

I get the following error when running prepDE.py on stringtie estimate output:

Traceback (most recent call last):
  File "./prepDE.py", line 255, in <module>
    geneDict.setdefault(geneIDs[i],{}) #gene_id
KeyError: 'STRG.337.1'

I only get this for one sample out of fifteen. I run the HISAT2 > Stringtie workflow as described in Pertea et al. (2016). I want to feed this into edgeR. Is there another option in case I can not resolve this problem?

Additional question: What is the difference between MSTRG and STRG?

The text was updated successfully, but these errors were encountered:

ttgump · 2019-10-14T20:35:58Z

It seems this issue appears in the new version. There is no problem for Stringtie 1.3.x.

gpertea · 2019-10-14T23:14:33Z

@ttgump, that is incorrect (I hope!), prepDE.py should work with version 2.0 just the same, I tested it on the original RNA-Seq protocol data (chrX) and it worked as expected.

[ 10/23 EDIT ] @ttgump you were in fact correct, apologies for the above, it turns out that the chrX data set was too simple to expose the problem in stringtie 2.0 which indeed outputs additional STRG entries when the -e option is given -- which should not happen. A fix is in the works.

Admittedly it might be tricky to get it running properly, especially when a parent folder is provided as input (instead of a space delimited table with sampleID and GTF_path), as it pulls in other .gtf files in that directory tree that you might not have wanted to use for prepDE.py.

For the regular, full stringtie->prepDE pipeline, prepDE is expecting to see a MSTRG prefix for most transcripts - because this prefix represents the transcripts "merged" from multiple samples, and the estimation run should be performed against that common set of "merged" transcripts. The fact that you get an error for a STRG prefix there seems to indicate that prepDE.py is trying to process one of the "initial" (1st phase ) output files of stringtie instead.

Perhaps in the estimation phase you forgot to provide the -e option ? That would explain why your transcript has a STRG prefix - because that prefix is assigned when the -e option is not given and thus StringTie assembled the read alignments into newly reconstructed transcripts, and those get the STRG prefix.

For further diagnosing prepDE running issues I'd suggest to take a look at #234 (comment) where I posted a modified version of the prepDE.py script which should help with finding potential issues with the input GTF files (if they were not all generated as expected). Let me know if you encounter any problems running that version on your data.

yan-cheng-lin · 2019-10-16T02:38:01Z

Hi,
I got the same error when using the new version 2.0.3, but version 1.3.6 is OK. My data is RNA-Seq of a tomato bi-parental population. Therefore, I only used the RNA-Seq data of the parental lines (parents) to generate merged gtf, and used it to estimate the gene expression of the progenies (offspring). Then, I got the same problem when prepDE.py running through some progenies. Should I generate and merge gtfs for all samples? In addition, I also found that the gtf generate from these two versions were quite different, could you explain the main difference?
Thank you !!

ttgump · 2019-10-16T04:26:36Z

@gpertea
When I use new version 2.0, the prepDE.py print the error message:
Traceback (most recent call last):
File "prepDE.py", line 255, in
geneDict.setdefault(geneIDs[i],{}) #gene_id
KeyError: 'ENST00000647043.1'
Then I transfer to use version 1.3.7, it works well. I have spent one afternoon to figure this out.😂

gpertea · 2019-10-16T13:07:25Z

Then that's a problem we'd surely like to fix. Stringtie v2 did improve assembly accuracy in our tests (to answer @andy3404's question), but there were sections of the code which were a complete rewrite compared to v1.3 (some of them due to adding support for long reads) and that accounts for the differences in the output and occasionally, it seems that the output when -e option is used (which means "only estimate the abundance, do not assemble"), in v2 may sometimes "drop" (i.e. not report) some redundant (or uncovered) reference transcripts which in turn makes prepDE.py fail.

But that should not happen and we would like to fix it. It would be very helpful if any of you can provide us with such an example, where running stringtie v2 with -e option seems to have dropped (not report) a transcript from the merge file that was given with -G option.

As an example: @ttgump, before that error message you should have seen the sample name printed there, right? (it would be helpful if you had shown the entire prepDE message here)
Was that the 2nd sample which was printed there as being processed by prepDE ? Because that means that the first sample is actually the one missing that ENST00000647043.1, so the problem would be with the stringtie -e run for that first sample (even though prepDE does not complain about that one), because that was the one dropping ENST00000647043.1 from the output of stringtie -e, even though that transcript is in the -G reference file. Do you think you can identify that first sample and perhaps share with me the corresponding .bam file and the file that you used with -G when you ran stringtie -e on that sample? That would allow us to properly debug this issue.

ttgump · 2019-10-16T19:27:13Z

Yes. The error happens on the second sample. How could I share the bam file with you? It is several Gbs. Thanks.

…

On Oct 16, 2019, at 8:07 AM, Geo Pertea ***@***.***> wrote: Then that's a problem we'd surely like to fix. Stringtie v2 did improve assembly accuracy in our tests (to answer @andy3404's question), but there were sections of the code which were a complete rewrite compared to v1.3 (some of them due to adding support for long reads) and that accounts for the differences in the output and occasionally, it seems that the output when -e option is used (which means "only estimate the abundance, do not assemble"), in v2 may sometimes "drop" (i.e. not report) some redundant (or uncovered) reference transcripts which in turn makes prepDE.py fail. But that should not happen and we would like to fix it. It would be very helpful if any of you can provide us with such an example, where running stringtie v2 with -e option seems to have dropped (not report) a transcript from the merge file that was given with -G option. As an example: @ttgump, before that error message you should have seen the sample name printed there, right? (it would be helpful if you had shown the entire prepDE message here) Was that the 2nd sample which was printed there as being processed by prepDE ? Because that means that the first sample is actually the one missing that ENST00000647043.1, so the problem would be with the stringtie -e run for that first sample (even though prepDE does not complain about that one), because that was the one dropping ENST00000647043.1 from the output of stringtie -e, even though that transcript is in the -G reference file. Do you think you can identify that first sample and perhaps share with me the corresponding .bam file and the file that you used with -G when you ran stringtie -e on that sample? That would allow us to properly debug this issue. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Gin-Wang · 2019-10-22T14:56:03Z

i met same question using prepDE.py updated 8 days ago. Here is the error printed in the screen.

Error: could not locate transcript STRG.2470.1 entry for sample neg2
Traceback (most recent call last):
File "6.prepDE.py", line 281, in
geneDict.setdefault(geneIDs[i],{}) #gene_id
KeyError: 'STRG.2470.1'

maybe the length of gtfs are not equal to each other?

gpertea · 2019-10-22T15:15:39Z

@Gin-Wang did you run the new prepDE.py with the -v option ? That should show the name/path of each sample as it is processed. I'd be interested in the BAM file corresponding to the entry which appears to be successfully processed by prepDE before the sample that caused the error. For example, in your case, the sample neg2 seems to have made prepDE fail, but assuming you have a sample neg1 just before that -- it would be great if you could share the .bam file corresponding to that neg1 sample (not neg2 !). Additionally, I'd need the -G reference file that was used for all those samples with stringtie -e , before you ran prepDE. The name of that file can be found in the first line of the GTF being processed by prepDE for each sample.

@ttgump and others willing/able to share such BAM file, could you upload the file to Google Drive or a similar service which allows sharing the download link (so you can email/share the download link with [email protected]). Alternatively, send me an email so I can send you instructions for uploading the files to our FTP server, though lately we've had some problems with large uploads on that ftp server, but we could give it a try.

Gin-Wang · 2019-10-22T15:56:14Z

@gpertea thanks for your answer, i tried it with "-v" just now and met the similar question and here is the log:

python2 6.prepDE.py -i 5.ballgown -v

processing sample neg1 from file 5.ballgown/neg1/output_merge.gtf
processing sample pos1 from file 5.ballgown/pos1/output_merge.gtf
Error: could not locate transcript STRG.10345.1 entry for sample pos1
Traceback (most recent call last):
File "6.prepDE.py", line 281, in
geneDict.setdefault(geneIDs[i],{}) #gene_id
KeyError: 'STRG.10345.1'

i'd like to share with you the files as you mentioned above, but it may take a while. Actually it printed similar error when i put my pos1 and neg1 files and those bam files are smaller than others, so i will upload those files.
Thanks for your answers and i will share you the links to download the files as soon as possible.

gpertea · 2019-10-22T16:12:29Z

Thank you for your help with this problem. From that log it looks like the files that I would need should be shown in the first line of the file:
./5.ballgown/neg1/output_merge.gtf
That line should show the stringtie command line that was used to generate this output_merge.gtf file. That should show the .bam file that I would need in order to investigate this, and also the reference (guide file) that follows the -G option on that line -- I would need that too.

To repeat and clarify: I would NOT need multiple .bam files, just one should suffice (the one for the neg1 sample in your example here, the one for which prepDE did NOT fail).

LunaPatton · 2019-10-23T01:51:36Z

@gpertea today I ran the new script and met the simliar problem like before. Attach the code below:

#!/bin/sh
#PBS -q long
#PBS -l nodes=1:ppn=1
cd /media/disk1/sijf/yanze/RNA-Seq/stringtie_file/abundances
python2.7 /media/disk1/sijf/yanze/RNA-Seq/software/stringtie/prepDE.py -i count_list.txt -v 2>&1 | tee prepDE.log

And the error is :

processing sample 120076-1_a from file 120076-1_a.gtf
processing sample 120076-2_a from file 120076-2_a.gtf
processing sample 130490-1_a from file 130490-1_a.gtf
processing sample 130490-2_a from file 130490-2_a.gtf
processing sample 140191-1_a from file 140191-1_a.gtf
processing sample 140191-2_a from file 140191-2_a.gtf
processing sample 140195-1_a from file 140195-1_a.gtf
processing sample 140195-2_a from file 140195-2_a.gtf
processing sample 140386-1_a from file 140386-1_a.gtf
processing sample 140386-2_a from file 140386-2_a.gtf
Error: could not locate transcript MSTRG.2936.2 entry for sample 140386-2_a
Traceback (most recent call last):
File "/media/disk1/sijf/yanze/RNA-Seq/software/stringtie/prepDE.py", line 281, in
geneDict.setdefault(geneIDs[i],{}) #gene_id
KeyError: 'MSTRG.2936.2'

Then I view the prepDE.log and nothing in this file. By the way, when I ran the old verison of this script, the error was same as I ran today. Could you tell me reason and solve it?
Thanks a lot : )

gpertea · 2019-10-23T22:32:01Z

Thank you so much @Gin-Wang I took a look at the data you uploaded and I can confirm now that the problem is real and correctly reported by users in this thread (and in #234), stringtie 2.0 with the -e option outputs a few STRG entries which should NOT be there, as the output should only contain the MSTRG entries that were given in the -G file.

We will fix this -e problem soon and report back here, until then please use stringtie v1.3.6 for quantification via prepDE (for short reads there shouldn't be a significant difference).

Gin-Wang · 2019-10-24T14:19:34Z

i'm glad that it may help, and thanks again for the great software stringtie and your work

gpertea · 2019-10-29T20:59:24Z

The main author fixed this issue in v2.0.4 which was just released (see https://github.com/gpertea/stringtie/releases/tag/v2.0.4). Now stringtie v2.0.4 should behave as expected with the -e option (just like v1.3.x) so the quantification pipeline using prepDE.py should no longer stumble at the 2nd file.

MarinaMann · 2019-11-08T22:08:12Z

Hi, I thought I'd add my 2 cents to this discussion since it's still listed as "open".

I was getting the same error messages that everyone has been showing above. As suggested, I counted lines in each of my ballgown/sample/sample.gtf files using wc -l. Each of these files should have had the same number of lines as the "stringtie_merged.gtf" file created during stringtie --merge, but did not. I was also following the protocol outlined by Pertea et al., 2016, which used stringtie v1.2.2 for their analysis.

However, I am using two versions of stringtie: v1.3.5, and v2.0.3.
I found a solution to the problem of the following:
In Pertea et al., they use 'stringtie –e –B -p 8 -G stringtie_merged.gtf -o ballgown/sample1/sample1.gtf sample1.sorted.bam'.

BUT, for both of the stringtie versions I have been using, this command only works as expected if the elements are shuffled:
'stringtie sample1.sorted.bam -G stringtie_merged.gtf -l sample1 -o ballgown/sample1/sample1.gtf -p 8 -e -B'.

It seems that following the published command line causes the newer versions of the program to miss some element of the input, which allows the output to not reply on -e as it should. Even though the man/help page of later versions of stringtie clearly shows the proper organization of input files being different than that used by Pertea et al., I think this is not something people would typically think to look at if they're just following the published protocol verbatim. @gpertea do you think you could build in an error that tells people the organization of their input order is wrong if their ballgown directory file line counts don't match the merged-files line count?

I am super happy that this program exists at all, and I also very happy with the amount of feedback and help available on this github issues page. Please keep up the good work, and I'd love to know if my solution is consistent on your end.
Thanks.

gpertea · 2019-11-09T01:34:29Z

@MarinaMann, thank you for the nice words but that problem is actually unrelated to the stringtie -e issue in this thread.
In fact the order of the command line parameters should never matter for StringTie (due to the way that argument parser is designed). If it did make a difference as it seemed to have happened in your case, that would have been an ugly bug as we've never intended it! But this was something else, a bit more insidious.. see below.

Not sure if it's visible in your browser, but do you notice any difference between these two lines below?

–e –B
-e -B

(in my browser it looks like the dashes in the first line are a bit longer and closer to the letter that follows them, compared to those on the second line).

There seems to have been a editing/typographical (?) error which affected our protocol paper when it was submitted for publishing, where sometimes the plain 'dash' or 'minus' character ('-') in our command lines got unexpectedly replaced with the "em dash" typographical symbol ('–'). I guess some word processing/publishing software like Microsoft Word even makes this kind of substitution automatically in some contexts. The issue was previously reported when users copied gffcompare commands from the protocol paper (see gpertea/gffcompare#3) but now I see it affected other command lines too..
Sorry about that confusion. Unfortunately this means that one should not be using simple copy&paste from the paper (or if one does, one should edit the line before running it in order to delete all the dashes and replace them with the regular '-' character)

nurie05 · 2023-06-20T18:09:33Z

Hello, I'm finding the same issue when using version 2.2.1., the obtained GTF files when using -e -B had less transcripts than the file use as -G and give problems when using prepDE.py . However, version 2.1.4 seems to work just fine.

gpertea · 2023-06-20T18:31:18Z

@nurie05, could you please tell me:

were you running the two versions (v2.2.1 and v2.1.4) on long reads (-L option) or on short read data ?
is the version 2.2.1 that you were using, from the releases page? (as opposed to built from source from the current github head?)
Thank you!

nurie05 · 2023-06-21T08:10:05Z

@gpertea regarding you questions:

I was running short reads.
The version 2.2.1 that I was using was built following the steps indicated in the github page.
I can provide more information or files in case you need it. Thank you for looking at it!

MoonyScotch · 2023-06-23T20:57:15Z

Hello, I am getting the same problem when trying to use prepDE.py3 on assembly with -L.

My error is:
Error: could not locate transcript ENST00000662089.1 entry for sample <sample_name> Traceback (most recent call last): File "/<path_to_file>/prepDE.py3", line 282, in <module> geneDict[geneIDs[i]][s[0]]+=v[s[0]] KeyError: '<sample_name>'

My estimation command is stringtie2 -o $assembled -p 2 -m 100 -L -e -b $stats -G /<path_to_reference>/gencode.v36.primary_assembly.annotation.gtf $BAM
My quantification command is python /<path_to_file>/prepDE.py3 -i ${SAMPLE}

Could you please offer suggestions for how to troubleshoot please? I have successfully used prepDE.py3 on short read data from the same sample without an issue. Is this error specific to -L option?

MoonyScotch · 2023-06-30T17:11:58Z

Hi @gpertea, I just want to update that I figured out the issue with my problem. It turns out, when I used stringtie 2.2.1 from git clone on the github page, it installed a version that is not fully updated. When I reinstalled from your website, or downloaded directly from the latest release, the error I reported above is resolved. It is slightly misleading that both ways of installations gave stringtie 2.2.1. Hopefully this can help others solve their errors too.

gpertea · 2023-06-30T17:26:48Z

Thank you @MoonyScotch for the update - actually the github version is a couple of commits ahead of the "release" 2.2.1 version, and those changes made the -e option work more consistently for long/hybrid reads quantification, but (re-)introduced this issue for short reads.. as you and @nurie05 reported. Still haven't had the time to look more closely into that problem to solve it.
But indeed, as you observed, using the older 2.2.1 "official" release works OK for short reads (but creates a similar issue for long/hybrid input..).

smallfishcui · 2024-05-24T12:03:38Z

Hi, I am using the v2.2.3, and the problem persists. Should I switch back to 2.2.1? I m using only short reads.

thanks,
Cui

gpertea · 2024-05-27T21:39:26Z

@smallfishcui indeed unfortunately I can confirm that this issue is back in v2.2.3 -- when -e is used with short reads, not all reference transcripts are reported in the output.
If you are using only short reads, I advise switching back to v2.2.1 until this short-read quantification issue is resolved. Sorry about the inconvenience.

gpertea mentioned this issue Oct 29, 2019

preDE.py stops at the second file #234

Closed

Pallavimore mentioned this issue Oct 31, 2019

problem with stringtie prepDE.py #246

Closed

ms-gx closed this as completed Jan 7, 2020

gpertea reopened this Jun 20, 2023

gpertea mentioned this issue Jun 26, 2024

Change between 2.2.1 and 2.2.2 on gene abundance estimation with '-e' parameter #434

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when running prepDE.py #238

Error when running prepDE.py #238

ms-gx commented Sep 17, 2019 •

edited

Loading

ttgump commented Oct 14, 2019

gpertea commented Oct 14, 2019 •

edited

Loading

yan-cheng-lin commented Oct 16, 2019

ttgump commented Oct 16, 2019

gpertea commented Oct 16, 2019

ttgump commented Oct 16, 2019 via email

Gin-Wang commented Oct 22, 2019

gpertea commented Oct 22, 2019 •

edited

Loading

Gin-Wang commented Oct 22, 2019 •

edited

Loading

gpertea commented Oct 22, 2019

LunaPatton commented Oct 23, 2019

gpertea commented Oct 23, 2019 •

edited

Loading

Gin-Wang commented Oct 24, 2019

gpertea commented Oct 29, 2019

MarinaMann commented Nov 8, 2019

gpertea commented Nov 9, 2019

nurie05 commented Jun 20, 2023

gpertea commented Jun 20, 2023

nurie05 commented Jun 21, 2023

MoonyScotch commented Jun 23, 2023 •

edited

Loading

MoonyScotch commented Jun 30, 2023

gpertea commented Jun 30, 2023

smallfishcui commented May 24, 2024

gpertea commented May 27, 2024

Error when running prepDE.py #238

Error when running prepDE.py #238

Comments

ms-gx commented Sep 17, 2019 • edited Loading

ttgump commented Oct 14, 2019

gpertea commented Oct 14, 2019 • edited Loading

yan-cheng-lin commented Oct 16, 2019

ttgump commented Oct 16, 2019

gpertea commented Oct 16, 2019

ttgump commented Oct 16, 2019 via email

Gin-Wang commented Oct 22, 2019

gpertea commented Oct 22, 2019 • edited Loading

Gin-Wang commented Oct 22, 2019 • edited Loading

gpertea commented Oct 22, 2019

LunaPatton commented Oct 23, 2019

gpertea commented Oct 23, 2019 • edited Loading

Gin-Wang commented Oct 24, 2019

gpertea commented Oct 29, 2019

MarinaMann commented Nov 8, 2019

gpertea commented Nov 9, 2019

nurie05 commented Jun 20, 2023

gpertea commented Jun 20, 2023

nurie05 commented Jun 21, 2023

MoonyScotch commented Jun 23, 2023 • edited Loading

MoonyScotch commented Jun 30, 2023

gpertea commented Jun 30, 2023

smallfishcui commented May 24, 2024

gpertea commented May 27, 2024

ms-gx commented Sep 17, 2019 •

edited

Loading

gpertea commented Oct 14, 2019 •

edited

Loading

gpertea commented Oct 22, 2019 •

edited

Loading

Gin-Wang commented Oct 22, 2019 •

edited

Loading

gpertea commented Oct 23, 2019 •

edited

Loading

MoonyScotch commented Jun 23, 2023 •

edited

Loading