Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP: get more precise remote file size from content-length header instead of http.group.file.size #50

Open
abretaud opened this issue Aug 23, 2024 · 3 comments

Comments

@abretaud
Copy link
Member

For example here: https://ftp.ncbi.nlm.nih.gov/blast/db/
-> the file sizes are in kb/mb/gb, which means it's not super precise.

The problem is that this size (together with last modification date) is used to check if some files are already present in offline dir (in case of a failed bank update attempt) => all files get redownloaded on each bank update attempt, even if some are already present locally.

Probably needs to be done around https://github.com/genouest/biomaj-download/blob/master/biomaj_download/download/curl.py#L375

It's already implemented in the direct http downloader https://github.com/genouest/biomaj-download/blob/master/biomaj_download/download/direct.py#L233

Not sure if we want it to be configurable. Maybe just do it in this case = http.group.file.size=-1 in properties file?

@mboudet
Copy link
Member

mboudet commented Sep 13, 2024

Should be closed by #51

@osallou
Copy link
Contributor

osallou commented Sep 13, 2024

Group file size is (also) obtained during listing step, ans used with date to decide if file should be downloaded (provided data depends on protocoles and servers config ..).

Using content length will be exact, but may be différent from listing info, and file will be downloaded again on next check.

@mboudet
Copy link
Member

mboudet commented Sep 13, 2024

Hmm. The listing step uses unix file information I think? For the files already on disk.
In any case, parsing directly the html info will break everytime, since it's not usually printed in bytes.

Hopefully content-length will work a bit better, but we'll see

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants