Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid mirror reported with repoqery --location #1673

Open
praiskup opened this issue Sep 3, 2024 · 15 comments
Open

Invalid mirror reported with repoqery --location #1673

praiskup opened this issue Sep 3, 2024 · 15 comments

Comments

@praiskup
Copy link
Member

praiskup commented Sep 3, 2024

Note this output:

[root@ip-172-30-2-126 /]# dnf5 repoquery --location bash-0:5.2.32-1.fc41.x86_64
Updating and loading repositories:
 fedora                                                                                                                                                                                                                                                                                                                                                                                                                                             100% | 236.2 KiB/s |  27.2 KiB |  00m00s
Repositories loaded.
https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/development/rawhide/Everything/x86_64/os/Packages/b/bash-5.2.32-1.fc41.x86_64.rpm

The bash with Release=1 is installed there, so I tried to ask where it comes from. It reports the cloudfront.net URL, which is already providing a Release=2 package, though.

It seems that DNF picked a wrong mirror to load (outdated) metadata, and then assumed that all the packages in the bad metadata are also provided even by (up2date) cloudfront mirror, but no longer they are.

See also: https://pagure.io/fedora-infrastructure/issue/12163

This happens with:

rpm-sequoia-1.7.0-2.fc41.x86_64
rpm-libs-4.19.92-6.fc41.x86_64
rpm-build-libs-4.19.92-6.fc41.x86_64
libsolv-0.7.30-1.fc41.x86_64
libdnf5-5.2.5.0-2.fc41.x86_64
libdnf5-cli-5.2.5.0-2.fc41.x86_64
dnf5-5.2.5.0-2.fc41.x86_64
dnf5-plugins-5.2.5.0-2.fc41.x86_64
rpm-4.19.92-6.fc41.x86_64

But it actually happens with a DNF4 implementation on F40 in the same location (EC2 box, us-east-1).

@ppisar
Copy link
Contributor

ppisar commented Sep 3, 2024

I don't understand what is a bug you report.

"dnf5 repoquery --location" does not report where a package was installed from. It reports where a package would be downloaded from. DNF does not validates whether the file exists there. It simply picks a file path from cached repository data, appends it to a mirror URL returned in a cached mirror manager response.

Naturally if a content of the mirror has changed in the mean time, then the printed URL will be invalid. DNF cannot know until it tries to request that URL from the server.

What behavior do you expect?

@praiskup
Copy link
Member Author

praiskup commented Sep 3, 2024

This is not a race condition :)

"dnf5 repoquery --location" does not report where a package was installed from. It reports where a package would be downloaded from.

Yes, except that it would not - and the report was wrong. The URL reported was available on a different mirror, not the one reported.

Naturally if a content of the mirror has changed in the mean time, then the printed URL will be invalid. DNF cannot know until it tries to request that URL from the server.

No content has changed in the meantime. Both before or after asking dnf5, the d2lzkl7pfhq30w.cloudfront.net did not provide the reported URL, it was migrated to a different metadata version long time before. I also did this query:

[root@ip-172-30-2-126 ~]# dnf repoquery --disablerepo '*' --enablerepo=hell --repofrompath=hell,https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/development/rawhide/Everything/x86_64/os/ -a | grep bash
Added hell repo from https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/development/rawhide/Everything/x86_64/os/
Last metadata expiration check: 0:00:59 ago on Tue 03 Sep 2024 01:03:30 PM UTC.
argbash-0:2.10.0-15.fc41.noarch
augeas-bash-completion-0:1.14.1-2.fc41.noarch
autorandr-bash-completion-0:1.13.3-5.fc41.noarch
bash-0:5.2.32-2.fc42.x86_64
bash-argsparse-0:1.8-5.fc41.noarch
bash-color-prompt-0:0.5-2.fc41.noarch
bash-completion-1:2.13-2.fc41.noarch
bash-completion-devel-1:2.13-2.fc41.noarch
bash-devel-0:5.2.32-2.fc42.x86_64

At that point in time, the cloudfronts repo was correctly providing newer metadata and DNF did not know that.

What behavior do you expect?

DNF should do some basic validation of mirrors, and provide a valid URL (race conditions are acceptable of course).
If one mirror is chosen for reading the metadata, --location results shouldn't be mixed up with mirrors that provide a different version of metadata (newer or older).

@praiskup
Copy link
Member Author

praiskup commented Sep 3, 2024

One more example:

[root@ip-172-30-2-126 /]# curl https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/development/rawhide/Everything/x86_64/os/repodata/repomd.xml https://mirror.slu.cz/fedora/linux/development/rawhide/Everything/x86_64/os/repodata/repomd.xml 2>/dev/null | grep revision
  <revision>1725345239</revision>
  <revision>1725258838</revision>

These mirrors are obviously desynced. RPMs provided by the first mirror may not
necessarily be part of the second one.

There might come other questions like

  • why would DNF pick older-than-necessary mirror
  • why the old (desynced) mirror was not dropped by mirror manager

But these are orthogonal in this ticket; these problems happen from time to time and DNF should deal with that.

@ppisar
Copy link
Contributor

ppisar commented Sep 3, 2024

DNF obtains a list of up-to-date mirrors from a mirror manager. If there are outdated mirrors on the list, it's a bug in the mirror manager. DNF relies on the list and exploits it for parallel fetching from multiple mirrors. If a download fails, DNF retries from another mirror.

So yes, "dnf repoquery --location" can return nonexistent document.

I think we could change "dnf repoquery --location" to always use a mirror it downloaded the repository from. The original URL is somewhere cached, I believe.

But I don't believe that DNF should actively recheck that given mirror contains the same revision of the repository. It would be too expensive. Or would you expect "dnf repoquery --location" to do a GET/HEAD request on every invocation?

@praiskup
Copy link
Member Author

praiskup commented Sep 3, 2024

I think we could change "dnf repoquery --location" to always use a mirror it downloaded the repository from. The original URL is somewhere cached, I believe.

This would sound better! Btw., how are the particular repositories picked (the one to read metadata and the one reported)? If there's no GET/HEAD checking, is there some intentional round-robin mechanism?

Or would you expect "dnf repoquery --location" to do a GET/HEAD request on every invocation?

For particular RPMs? Probably no (even though our use-case would be OK with that). For repomd.html? Maybe. There's something DNF can do for the potential --location consumers.... such a tool would have no info about potential other mirrors (no possibility to implement the fallback).

@ppisar
Copy link
Contributor

ppisar commented Sep 3, 2024

how are the particular repositories picked (the one to read metadata and the one reported)?
If there's no GET/HEAD checking, is there some intentional round-robin mechanism?

If I remember correctly, metadata are fetched from the first item on the list returned by a mirror manager. Mirror manager sorts the list based on location of the client, excluding out-dated mirrors. Mirror manager's reply also contains a current repository revision. DNF then checks that the mirror contains that revision. If it doesn't , next mirror on the list is tried.

Regarding downloading packages, I don't know. I believe that the mirrors are tried in the same order as on the list. I have no idea if there is a kind of round robin employed. Maybe the list is already randomized by the mirror manager. But I really don't know. If you are interested, read librepo sources.

@praiskup
Copy link
Member Author

praiskup commented Sep 4, 2024

The way you describe the mirroring works, it seems robust. If DNF checks the revision, I'm curious how the problem could appear.

@praiskup
Copy link
Member Author

A similar problem was reported by @kdudka for the epel-7 (yum-utils !) build chroot; occasionally, outdated EPEL 7 mirror is chosen for loading of metadata, and the corresponding RPMs are not available (e.g. wrong epel-rpm-macros-7.23 install attempt failures, while we should install epel-rpm-macros-7.38).

@kontura
Copy link
Contributor

kontura commented Sep 27, 2024

I don't believe dnf stores where the current metadata were downloaded from and even if it did I don't think it would be that useful. The mirror where the metadata were last downloaded from is just as likely to have moved on as any other.

Note this output:

[root@ip-172-30-2-126 /]# dnf5 repoquery --location bash-0:5.2.32-1.fc41.x86_64
Updating and loading repositories:
 fedora                                                                                                                                                                                                                                                                                                                                                                                                                                             100% | 236.2 KiB/s |  27.2 KiB |  00m00s
Repositories loaded.
https://d2lzkl7pfhq30w.cloudfront.net/pub/fedora/linux/development/rawhide/Everything/x86_64/os/Packages/b/bash-5.2.32-1.fc41.x86_64.rpm

In the provided output is the updated fedora repo the only available repo? Is it the rawhide repo?
I am thinking if it is possible this is caused by stale metadata?

Can you reproduce with --refresh?

@praiskup
Copy link
Member Author

The mirror where the metadata were last downloaded from is just as likely to have moved on as any other.

I reported this when I was debugging mock --calculate-build-dependencies, which itself has metadata_expire=0. I did a lot of experiments after that in the same (EC2) location, I also tried different DNF4/DNF5 that time in different containers and the result was the same at that point in time. IOW, the cache was old at most a few minutes (but originally in Mock I bet just a few seconds), and the mirror problem was a matter of tens of minutes at least.

The problem disappeared, and I can't reproduce it now.

even if it did I don't think it would be that useful

We use metadata_expire=0. Yes, still racy, but there would be at least a bit of a chance of providing correct info in this situation. Of course, if we could pick the mirror with the greatest revision number the race would be non-existent.

@ppisar
Copy link
Contributor

ppisar commented Sep 27, 2024

I don't believe dnf stores where the current metadata were downloaded from

I thought that "dnf repo info" shows real data. You are right. It simply shows first item on the mirror list.

@kontura
Copy link
Contributor

kontura commented Sep 30, 2024

Interesting, by any chance do you have the logs (dnf5.log) from when it was reproducible? It lists all the mirrors so it should clarify what is going on.

I have found that the Fedora metalink contains alternates which have checksums for older metadata so its possible to download them from not updated mirrors.

(On a side note while librepo takes the alternates into account and happily downloads such metadata dnf on subsequent runs doesn't read them and considers such metadata out of date. Incidentally this might be a good behavior? Though it could lead to repeated downloads of the same metadata.)

I can see one possibility how this bug could happen with metadata_expire=0, is the fastestmirror option set? Because it reorders the mirrors in librepo which could download older metadata but repoquery doesn't know about this and picks the first metalink mirror.

@kontura
Copy link
Contributor

kontura commented Oct 4, 2024

@praiskup ^

@praiskup
Copy link
Member Author

praiskup commented Oct 4, 2024

Unfortunately, I don't have access to the dnf5.log, it is challenging in our testing-farm testsuite to get it, plus this issue is rarely reproducible. I'm waiting for it to happen again. For the affected Mock tests, we haven't set fastestmirror=1. The alternates thing is really interesting! If that attribute is causing troubles, can we have a new knob in DNF to ignore its effects?

@kontura
Copy link
Contributor

kontura commented Oct 7, 2024

The alternates thing is really interesting! If that attribute is causing troubles, can we have a new knob in DNF to ignore its effects?

We probably could (it would require extending librepo) but I don't think it is the root cause given that metadata_expire=0 is set. Though it is hard to tell what is going on without additional information.

I would like to wait for at least the logs if it will be possible to get them once it happens again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@ppisar @praiskup @kontura and others