Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rsync_from_ncbi.pl: FTP connection error: [Net::FTP] Timeout #895

Open
DukeVimes opened this issue Dec 5, 2024 · 16 comments
Open

rsync_from_ncbi.pl: FTP connection error: [Net::FTP] Timeout #895

DukeVimes opened this issue Dec 5, 2024 · 16 comments

Comments

@DukeVimes
Copy link

DukeVimes commented Dec 5, 2024

Using Kraken version 2.1.3
raken2-build --standard --use-ftp --threads 24 --db test-2024-12-05

I get:

Downloading nucleotide gb accession to taxon map... done.
Downloading nucleotide wgs accession to taxon map... done.
Downloaded accession to taxon map(s)
Downloading taxonomy tree data... done.
Uncompressing taxonomy data... done.
Untarring taxonomy tree data... done.
Step 1/2: Performing ftp file transfer of requested files
rsync_from_ncbi.pl: FTP connection error: [Net::FTP] Timeout

without the --use-ftp option I can't even download the taxon map.

I reduced rsync_from_ncbi.pl to the part that create the ftp connection, which does seem to work (at least I was able to download check.txt)

#!/usr/bin/env perl

use strict;
use warnings;
use File::Basename;
use Getopt::Std;
use Net::FTP;
use List::Util qw/max/;

my $PROG = "TEST";
my $SERVER = "ftp.ncbi.nlm.nih.gov";
my $SERVER_PATH = "/genomes";
my $FTP_USER = "anonymous";
my $FTP_PASS = "kraken2download";

sub ftp_connection {
    my $ftp = Net::FTP->new($SERVER, Passive => 1)
        or die "$PROG: FTP connection error: $@\n";
    $ftp->login($FTP_USER, $FTP_PASS)
        or die "$PROG: FTP login error: " . $ftp->message() . "\n";
    $ftp->binary()
        or die "$PROG: FTP binary mode error: " . $ftp->message() . "\n";
    $ftp->cwd($SERVER_PATH)
        or die "$PROG: FTP CD error: " . $ftp->message() . "\n";
    return $ftp;
}

my $ftp = ftp_connection();
warn "we got an ftp connection";
$ftp->get('check.txt');
warn "$PROG: ftp message: ".$ftp->message()."\n";
#        last if $ftp->get($_);
#        warn "$PROG: unable to download $_ on try $try of $ntries: ".$ftp->message()."\n";
#    die "$PROG: unable to download ftp://${SERVER}${SERVER_PATH}/$_\n" if $try == $ntries;
$ftp->quit;
@DukeVimes
Copy link
Author

I tried to see why rsync isnt working for me, with telnet ftp.ncbi.nlm.nih.gov 873 I get an connection to the NCBI rsync server, which indicates to me that port 873 is reachable, nevertheless I temporarily disabled the firewall. Alas, rsync -v --list-only rsync://ftp.ncbi.nlm.nih.gov/genomes doesnt return either.

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Dec 6, 2024

Hello,

Can you give k2 a try? It is a Python script that we are working on to replace the current suite of wrapper scripts.

You can download a library from NCBI like so:

k2 download-library --db <db> --library <library> --threads <n>

The --threads parameter is specific to k2 and will specify the number of connections used to fetch accession files and the number of processes used for post-processing said files.

To build a database you can run: k2 build --db <db> --threads <threads>

The standard database can be built using the command: k2 build --standard --db standard --threads <n>

N.B. the k2 script that shipped with current release of kraken 2 was very much a work in progress. If you are planning on using the script please fetch the most recent version from the kraken 2 master/main branch.

As always, feedback is very much appreciated.

@DukeVimes
Copy link
Author

I tried, but this led to multiple followup problems (most probably due to myself).
First I naively cloned the master/HEAD and called the k2 build --standard --db /mnt/db/db_acc/k2_test --threads 8.

This sucessfully downloaded a lot of files, but couldnt find the k2mask process.
Indeed k2mask isnt in the path.

Next I tried to do a fresh complete installation (without conda), using ./install_kraken2.sh, which resulted in Kraken 2 installation complete, I symlinked kraken2, kraken2-build and kraken2-inspect into the path.

But now ./k2 build --standard --db /mnt/db/db_acc/k2_test --threads 8 fails immediatly with http.client.BadStatusLine: c8 in client.py _read_status:

Traceback (most recent call last):
  File "/mnt/miniconda/lib/python3.12/concurrent/futures/process.py", line 263, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/db/db_acc/k2_bin/kraken2/scripts/./k2", line 1357, in http_download_file2
    response = conn.getresponse()
               ^^^^^^^^^^^^^^^^^^
  File "/mnt/miniconda/lib/python3.12/http/client.py", line 1428, in getresponse
    response.begin()
  File "/mnt/miniconda/lib/python3.12/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/mnt/miniconda/lib/python3.12/http/client.py", line 313, in _read_status
    raise BadStatusLine(line)
http.client.BadStatusLine: c8

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Dec 9, 2024

Ah, it is trying to use a masker that we added to kraken2 in the last release. You can either build it yourself and copy it to a location in your $PATH or use the --no-masking flag to skip the masking process.

Did you experience any issues with downloading the libraries?

@mohitsharma-123
Copy link

mohitsharma-123 commented Dec 10, 2024

unable to download kraken2 standard database using
kraken2-buid --standard --threads 96 --db kraken2db
or
kraken2-buid --standard --threads 96 --db kraken2db/ --use-ftp

@mohitsharma-123
Copy link

(kraken2) mohitsharma@deep:/data/mohitsharma$ kraken2-build --standard --threads 96 --db kraken2db
Downloading nucleotide gb accession to taxon map...rsync: [Receiver] failed to connect to ftp.ncbi.nlm.nih.gov (130.14.250.13): Connection timed out (110)
rsync: [Receiver] failed to connect to ftp.ncbi.nlm.nih.gov (130.14.250.12): Connection timed out (110)
rsync: [Receiver] failed to connect to ftp.ncbi.nlm.nih.gov (130.14.250.31): Connection timed out (110)
rsync: [Receiver] failed to connect to ftp.ncbi.nlm.nih.gov (130.14.250.10): Connection timed out (110)
rsync: [Receiver] failed to connect to ftp.ncbi.nlm.nih.gov (130.14.250.11): Connection timed out (110)
rsync: [Receiver] failed to connect to ftp.ncbi.nlm.nih.gov (130.14.250.7): Connection timed out (110)
rsync: [Receiver] failed to connect to ftp.ncbi.nlm.nih.gov (2607:f220:41e:250::31): Network is unreachable (101)
rsync: [Receiver] failed to connect to ftp.ncbi.nlm.nih.gov (2607:f220:41e:250::11): Network is unreachable (101)
rsync: [Receiver] failed to connect to ftp.ncbi.nlm.nih.gov (2607:f220:41e:250::12): Network is unreachable (101)
rsync: [Receiver] failed to connect to ftp.ncbi.nlm.nih.gov (2607:f220:41e:250::13): Network is unreachable (101)
rsync: [Receiver] failed to connect to ftp.ncbi.nlm.nih.gov (2607:f220:41e:250::10): Network is unreachable (101)
rsync: [Receiver] failed to connect to ftp.ncbi.nlm.nih.gov (2607:f220:41e:250::7): Network is unreachable (101)
rsync error: error in socket IO (code 10) at clientserver.c(139) [Receiver=3.3.0]

@mohitsharma-123
Copy link

(kraken2) mohitsharma@deep:~$ kraken2-build --standard --threads 96 --db kraken2_db/ --use-ftp
Downloading nucleotide gb accession to taxon map... done.
Downloading nucleotide wgs accession to taxon map... done.
Downloaded accession to taxon map(s)
Downloading taxonomy tree data... done.
Uncompressing taxonomy data... done.
Untarring taxonomy tree data... done.
Step 1/2: Performing ftp file transfer of requested files
rsync_from_ncbi.pl: unable to download all/GCF/023/238/205/GCF_023238205.1_ASM2323820v1/GCF_023238205.1_ASM2323820v1_genomic.fna.gz on try 1 of 5: Scanning file for viruses.
There'll be a delay while we scan for viruses.
Opening BINARY mode data connection for all/GCF/023/238/205/GCF_023238205.1_ASM2323820v1/GCF_023238205.1_ASM2323820v1_genomic.fna.gz (1321507 bytes)
Scanning for viruses.
Scanning for viruses.
Idle timeout (60 seconds): closing control connection

rsync_from_ncbi.pl: unable to download all/GCF/023/238/205/GCF_023238205.1_ASM2323820v1/GCF_023238205.1_ASM2323820v1_genomic.fna.gz on try 2 of 5: [Net::FTP] Connection closed
rsync_from_ncbi.pl: unable to download all/GCF/023/238/205/GCF_023238205.1_ASM2323820v1/GCF_023238205.1_ASM2323820v1_genomic.fna.gz on try 3 of 5: [Net::FTP] Connection closed
rsync_from_ncbi.pl: unable to download all/GCF/023/238/205/GCF_023238205.1_ASM2323820v1/GCF_023238205.1_ASM2323820v1_genomic.fna.gz on try 4 of 5: [Net::FTP] Connection closed
rsync_from_ncbi.pl: unable to download all/GCF/023/238/205/GCF_023238205.1_ASM2323820v1/GCF_023238205.1_ASM2323820v1_genomic.fna.gz on try 5 of 5: [Net::FTP] Connection closed
rsync_from_ncbi.pl: unable to download ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/023/238/205/GCF_023238205.1_ASM2323820v1/GCF_023238205.1_ASM2323820v1_genomic.fna.gz

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Dec 13, 2024

Did you try the https://github.com/DerrickWood/kraken2/blob/master/scripts/k2 recommended earlier in this thread?

@mohitsharma-123
Copy link

mohitsharma-123 commented Dec 17, 2024 via email

@DukeVimes
Copy link
Author

using the k2 script and the option --no-mask I get the following results:

under library I see only

  • archaea
  • bacteria

I try to build the database on the cluster, and I add the installation directory to the PATH using:

LSBATCH: User input
export PATH=/mnt/db/db_acc/kraken2_standard/kraken2:$PATH &&    /mnt/db/db_acc/kraken2_standard/kraken2/k2 build --standard --no-masking --db /mnt/db/db_acc/kraken2_standard/new/dataset --threads 128 &&    /mnt/db/db_acc/kraken2_standard/kraken2/k2 inspect --db /mnt/db/db_acc/kraken2_standard/new/dataset --threads 128

despite using the --no-masking optionthe job fails basically immediatly with the "BadStatusLine c8":

Traceback (most recent call last):
  File "/mnt/db/db_acc/kraken2_standard/kraken2/k2", line 3592, in <module>
    k2_main()
    ~~~~~~~^^
  File "/mnt/db/db_acc/kraken2_standard/kraken2/k2", line 3561, in k2_main
    build_standard_database(args)
    ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
  File "/mnt/db/db_acc/kraken2_standard/kraken2/k2", line 2387, in build_standard_database
    download_taxonomy(args)
    ~~~~~~~~~~~~~~~~~^^^^^^
  File "/mnt/db/db_acc/kraken2_standard/kraken2/k2", line 1539, in download_taxonomy
    download_count = future.result()
  File "/home/refdata_acc/.conda/envs/kraken2/lib/python3.13/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ~~~~~~~~~~~~~~~~~^^
  File "/home/refdata_acc/.conda/envs/kraken2/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
http.client.BadStatusLine: c8

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Dec 18, 2024

This seems like an issue with downloading the taxonomy. It is not related to the --no-masking option.

@mohitsharma-123 -- your output seems different from the recent version of the k2 script. It is very likely that you are using the old script that uses FTP for downloading from NCBI.

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Dec 18, 2024

@DukeVimes -- I pushed a commit fixing the download-taxonomy issue. Thank you for your feedback!

@DukeVimes
Copy link
Author

I am in commit 7221593 of the repo (Nov. 19th), which is the current HEAD.
used ./install_kraken2.sh /mnt/db/db_acc/kraken2_standard/kraken2
The files in that directory are from 18th of december, the md5sum of k2 is 7af7f691671e222e0509d64fcebb5e21

on the LSF-Cluster yhe line /mnt/db/db_acc/kraken2_standard/kraken2/k2 build --standard --no-masking --db /mnt/db/db_acc/kraken2_standard/new/dataset --threads 128 is executed.

@ch4rr0
Oh, sorry I got confused about which conversation we are having. I'll pull and try it again.
I would like to thank you (and the whole team) for the very prompt responses!

@mohitsharma-123
Copy link

i installed kraken2 using mamba then use k2 script command k2 build --standard --db standard --threads 96

@DukeVimes
Copy link
Author

@ch4rr0
Very happy to report with the newest commit the download did apparently run through! Thank you very much!
I have a follow up issue/question: how long is estimate_capacity supposed to run? It hit the 12hour wall-time on the queue I was using :-/, despite 128 threads...

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Dec 20, 2024

Unfortunately that was due to a bug that I just now pushed a fix for. I apologize for the inconvenience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants