Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RepeatMasker complains about invalid species search term (but runs at Gitpod) #136

Open
trypchromics opened this issue Sep 9, 2024 · 6 comments

Comments

@trypchromics
Copy link

trypchromics commented Sep 9, 2024

Hi TobyBaril.

I would like to ask for you help concerning a RepeatMasker error.

We are trying to annotate a new genome assembly of Trypanosoma cruzi, whose NCBI Taxid is 5693. Earl Grey was installed through conda at our Linux server. When I execute the command:

earlGrey -g input/genome.fasta -s tryCru-Dm28c -t 88 -r 5693 -o output

I get the following output:

================================================================================================

...
<<< Running Initial Mask with Known Repeats >>>
RepeatMasker version 4.1.5
Search Engine: NCBI/RMBLAST [ 2.14.1+ ]

Using Master RepeatMasker Database: /home/pires/local/src/anaconda/3-2024.06-1/envs/earlGrey/share/RepeatMasker/Libraries/RepeatMaskerLib.h5
Title : Dfam
Version : 3.7
Date : 2023-01-11
Families : 19,768

Species/Taxa Search:
Trypanosoma cruzi [NCBI Taxonomy ID: 5693]
Lineage: root;cellular organisms;Eukaryota;Discoba;Euglenozoa;
Kinetoplastea;Metakinetoplastina;Trypanosomatida
9 families in ancestor taxa; 0 lineage-specific families

analyzing file /storage/zuleika/volume3/project/jcunha/hiChromatin/project/tryCru-Dm28c2018-lcc2024/genomeAnnotation/0-transposableElements/earlGrey-repeatMaskerSearchTerm/input/genome.fasta.prep

Checking for E. coli insertion elements

Checking for E. coli insertion elements
identifying Simple Repeats in batch 2 of 548
identifying matches to 5693 sequences in batch 2 of 548
identifying Simple Repeats in batch 1 of 548
identifying matches to 5693 sequences in batch 1 of 548
identifying Simple Repeats in batch 2 of 548
identifying Simple Repeats in batch 1 of 548

...

Checking for E. coli insertion elements
identifying Simple Repeats in batch 547 of 548
identifying Simple Repeats in batch 548 of 548
identifying matches to 5693 sequences in batch 548 of 548
identifying Simple Repeats in batch 548 of 548

No repetitive sequences were detected in /storage/zuleika/volume3/project/jcunha/hiChromatin/project/tryCru-Dm28c2018-lcc2024/genomeAnnotation/0-transposableElements/earlGrey-repeatMaskerSearchTerm/input/genome.fasta.prep
ERROR: RepeatMasker failed, please check logs. This is likely because of an invalid species search term, if issue persists please use NCBI Taxids (E.G Drosophila is replaced with 7125)

================================================================================================

The curious thing is that, while I was searching for a solution, I reached the following page:
https://tehub.org/tutorials/docs/earlgrey
which recommends running Earl Grey from Gitpod (by the way, this is a great alternative to run Earl Grey that is not documented here at Github). If I try to run the same command at Gitpod, RepeatMasker run is ok and this error doesn't occur:

================================================================================================
Checking for E. coli insertion elements
identifying Simple Repeats in batch 548 of 548
identifying matches to 5693 sequences in batch 548 of 548
identifying Simple Repeats in batch 548 of 548
processing output:
cycle 1 .....................................
cycle 2 .....................................
cycle 3 ...................................
cycle 4 ...................................
cycle 5
cycle 6 ...................................
cycle 7 ...................................
cycle 8 .................................
cycle 9 .................................
cycle 10 .................................
Generating output... ................................
masking
done

          )  (
         (   ) )
         ) ( (
       _______)_
    .-'---------|  
   ( C|/\/\/\/\/|
    '-./\/\/\/\/|
     '_________'
      '-------'
    <<< Detecting Novel Repeats >>>

================================================================================================

Can you help me to figure out what is the problem with the RepeatMasker installed at our conda?

Thanks in advance.

--
David da Silva Pires

@TobyBaril
Copy link
Owner

Hi,

This is a strange case, as an invalid search term will kill a RepeatMasker job before commencing with analysing contigs. In this case, the final line of the log on your server is identifying Simple Repeats in batch 548 of 548 then No repetitive sequences were detected in /storage/zuleika/volume3/project/jcunha/hiChromatin/project/tryCru-Dm28c2018-lcc2024/genomeAnnotation/0-transposableElements/earlGrey-repeatMaskerSearchTerm/input/genome.fasta.prep. In this case it seems that RepeatMasker has not managed to find any repeats. Have you checked this genome file to see if it looks correct? It should be identical to your original input, but with shortened chromosome names (ctg_1 etc). If this file doesn't look correct, it might be that there has been an issue resolving file paths or storage locations, which can sometimes happen on complex HPCs.

In this case, there are also only 9 ancestral TE families, so it might be beneficial to ignore the initial repeatmasker step to provide better information for the de novo annotation, which in turn can generate better and more representative consensus sequences, with better divergence estimates (as the consensus sequences represent the TEs present in the genome analysed, rather than similar families from other species, which might make some TEs look older than they actually are). To start, I would recommend comparing the genome.fasta.prep files between the gitpod install and the server install to check for inconsistencies that could result in RepeatMasker failing to find repeats.

Cheers,

Toby

P.S The gitpod information can be found in the README of this repository, just above the Recommended Installation with Conda or Mamba section. I've added a new hyperlink at the top to make this clearer.

@TobyBaril
Copy link
Owner

Hi!

In addition to the above, there can sometimes be issues on systems where conda doesn't play nicely with other system installations of RepeatMasker. When the conda envrionment is active, I would recommend checking which version/installation of RepeatMasker is being called in case there are also other installations that are interfering with the conda one.

@trypchromics
Copy link
Author

trypchromics commented Sep 17, 2024

Hi Toby.

Thank you very much for the help.

I checked and the file genome.fasta.prep is identical to genome.fasta, as well at my server and at Gitpod (except for chromosome names, as you said, since they were changed to ctg_1 to ctg_99).

Since Trypanosoma cruzi is a non-model organism, I got surprised when I discovered that RepeatMasker recognized the NCBI TaxID 5693. That is why I would like to test Earl Grey with the option -r (besides running with default parameters and comparing the results). By the way, I liked very much of your observation about the number of ancestral TE families. Since I am not used to this kind of pipeline, I had no idea that 9 should be a small number. I know that the documentation of Earl Grey does not intend to explain the basics of TE, but maybe an observation relative to this number at the explanation of option -r would help inexperienced users, like me.

Sorry for not seeing the Gitpod information at the Github README file. The installation if very well documented!

I tried some other things to solve the problem of the option -r:

  1. First, I thought that the installation of EDTA at the same Anaconda installation would be interfering with the installation of Earl Grey (even though each software was installed at a different environment). I completely removed my Anaconda installation and reinstalled Earl Grey in a fresh install of Anaconda. The same error ocurred.

  2. Then, I thought that the version of Ubuntu Server would be the cause. Gitpod is using Ubuntu 22.04.4 and our server is using Ubuntu 20.04.6. But then the system administrator told me that he was able to run Earl Grey with -r at the same genome, but using miniconda3 instead of Anaconda. This made me remember that during Earl Grey install at Anaconda, when I used the command:
    mamba --channel conda-forge --channel bioconda earlgrey=4.4.4
    I got a message saying that Python 3.12 was a pinned package and that Earl Grey would not be installed because of its dependency of Python 3.9. So, I tried the command:
    mamba --channel conda-forge --channel bioconda earlgrey=4.4.5 python=3.9
    and the installation was done (but then I got the cited error).
    Following the sysadmin suggestion, I removed my Anaconda installation and tried miniconda3. I was able to install Earl Grey without specifying python=3.9, but the same error occurs when I use the -r option.

I am trying to contact the sysadmin again to see if my commands are identical to his. Maybe I should clone his miniconda environment, but then we will continue ignoring what is causing the error.

I also checked that our server has RepeatMasker installed, but it is version 4.1.6, different of the version indicated at the log that I copied in the issue: 4.1.5. Anyway, the sysadmin managed to run Earl Grey even with this system installation of RepeatMasker, so I have no idea what is the problem and what else should I try. :-(

Tell me if you have any other idea. I am willing to solve this problem and execute Earl Grey with the option -r at our server instead of Gitpod server.

One last thing that I noticed: I tried the new 4.4.5 version, but if I run just the command earlGrey, without any option, the output reports 4.4.4 instead of 4.4.5.

I am very grateful for all your help. Thanks again.

--
David da Silva Pires

@TobyBaril
Copy link
Owner

Hi David,

This is really strange, but does point to the issue being something to do with your specific conda configuration. As the issue is with RepeatMasker, I'm wondering whether this could be linked to conflicting perl installations...
if you run which perl in the base environment, and then again with the earlgrey environment active, you should get paths to different installations. If the path is the same, the global perl installation is overriding that of the conda environment which could be upsetting RepeatMasker.

One potential way around this is to alter which perl installation is called by RepeatMasker, which does involve reconfiguring RepeatMasker within the conda environment to use the correct perl installation.

@trypchromics
Copy link
Author

Hello Toby,

I checked the Perl installation as you asked and it seems that this is not the problem yet:

[10:03:49] pires@vital:/ $ which perl
/usr/bin/perl
[10:03:56] pires@vital:/ :) $ perl -v

This is perl 5, version 30, subversion 0 (v5.30.0) built for x86_64-linux-gnu-thread-multi
(with 60 registered patches, see perl -V for more detail)

Copyright 1987-2019, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl".  If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.

[10:03:58] pires@vital:/ :) $ mamba activate
(base) [10:04:06] pires@vital:/ $ which perl                                                                                                                           
/usr/bin/perl
(base) [10:04:09] pires@vital:/ $ perl -v                                                                                                                              

This is perl 5, version 30, subversion 0 (v5.30.0) built for x86_64-linux-gnu-thread-multi
(with 60 registered patches, see perl -V for more detail)

Copyright 1987-2019, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl".  If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.

(base) [10:04:14] pires@vital:/ $ mamba activate earlGrey
(earlGrey) [10:04:27] pires@vital:/ $ which perl             
/home/pires/local/src/miniconda3/2024-09-17/envs/earlGrey/bin/perl
(earlGrey) [10:04:31] pires@vital:/ $ perl -v                

This is perl 5, version 32, subversion 1 (v5.32.1) built for x86_64-linux-thread-multi

Copyright 1987-2021, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl".  If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.

As you can see, base's and earlGrey's perl versions are different, as expected. The path seems to be OK for me.

I also compared the list of installed packages at Gitpod (after creating a new workspace with Earl Grey 4.4.5) and at my server (after updating it with mamba update --all) condas (command conda list --explicit). There are some differences relative to versions of some packages, but nothing that seems to be the cause of the problem.

I am still waiting for an answer from the sysadmin to see what he made different from me and I will let you informed as soon as I get any progress.

Thank you for all the help.

--
David da Silva Pires

@trypchromics
Copy link
Author

Hi Tobby.

I'm still not able to say what is the problem with Earl Grey when I try to run it with the option -r at our server. I copied configuration files such as .bashrc and .profile from the sysadmin user, overwriting my owns but the problem persists.

I removed my Anaconda installation and I am using miniconda now, but nothing changes.

In another attempt, I tried to reinstall Repeat Masker and it seems to have a problem with the library Dfam:

$ mamba install --force-reinstall repeatmasker                                                                                                   
                                                                               

Looking for: ['repeatmasker']                                                  

pkgs/r/linux-64                                               No change
pkgs/main/noarch                                              No change
pkgs/r/noarch                                                 No change
pkgs/main/linux-64                                   6.6MB @  11.2MB/s  0.6s
bioconda/noarch                                      4.4MB @   5.9MB/s  0.6s
bioconda/linux-64                                    4.7MB @   4.6MB/s  0.9s
conda-forge/noarch                                  16.9MB @  11.8MB/s  1.4s
conda-forge/linux-64                                39.0MB @  23.3MB/s  1.7s

Pinned packages:                                                               
  - python 3.9.*                                                               


warning  libmamba Invalid package cache, file '/home/pires/local/src/miniconda3/2024-09-17/pkgs/repeatmasker-4.1.5-pl5321hdfd78af_1/share/RepeatMasker/Libraries/Dfam.h5' has incorrect size
Transaction                                                                    

  Prefix: /home/pires/local/src/miniconda3/2024-09-17/envs/earlgrey

  Updating specs:                                                              

   - repeatmasker                                                              


  Package         Version  Build             Channel        Size
──────────────────────────────────────────────────────────────────
  Reinstall:                                                                   
──────────────────────────────────────────────────────────────────

  o repeatmasker    4.1.5  pl5321hdfd78af_1  bioconda     Cached

  Summary:                                                                     

  Reinstall: 1 packages                                                        

  Total download: 0 B                                                          

──────────────────────────────────────────────────────────────────


Confirm changes: [Y/n]                                                         

Downloading and Extracting Packages:                                           

Preparing transaction: done                                                    
Verifying transaction: /                                                       
SafetyError: The package for repeatmasker located at /home/pires/local/src/miniconda3/2024-09-17/pkgs/repeatmasker-4.1.5-pl5321hdfd78af_1                                                                                                                                                                                   
appears to be corrupted. The path 'share/RepeatMasker/Libraries/Dfam.h5'
has an incorrect size.                                                         
  reported size: 68 bytes                                                      
  actual size: 5861387976 bytes                                                


done                                                                           
Executing transaction: done                                                    

The interesting thing is that after the forced reinstalation, the same problem persists (the same message is shown if I execute the reinstall command again).

As a last attempt, I will ask for sysadmin to remove the Repeat Masker installation from the server, to test if there is some kind of conflicting versions, but I have no idea what else I could try if this doesn't work. :-(

I'll keep you updated of the ongoing tests.

P.S.: Now I am trying the new version v5.0.0. I would like to note two typos:

  1. At the "-l" documentation, change "inital" to "initial".
  2. The "-e" documentation is not indented together with the other options.

Thank you very much for the great software. Best regards.

--
David

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants