Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching mirrors #304

Open
brianjmurrell opened this issue Dec 20, 2021 · 16 comments
Open

Caching mirrors #304

brianjmurrell opened this issue Dec 20, 2021 · 16 comments

Comments

@brianjmurrell
Copy link

There are tools that allow one to mirror on demand as a caching proxy mirror. Such mirrors of course are virtually complete but in reality only as complete as the clients that use it have requested files/packages from it.

The problem with this is that report_mirror may report only a partial listing, but the reality is that the mirror is complete from the perspective of it's clients.

It would be useful if mirrormanager and/or report_mirror could be informed of this so that partial report_mirrors don't make invalidate the mirror. In the short-term report_mirror could be modified to fake the output of a full mirror, for example.

@adrianreber
Copy link
Member

There is the flag Always up to date. We use it in our Fedora and CentOS Stream setups to mark primary mirrors and cloudfront based caching setups.

image

Does that help?

@brianjmurrell
Copy link
Author

That might, if I could find where it's set. I've looked in both my site and host configuration and don't see that option.

@adrianreber
Copy link
Member

That might, if I could find where it's set. I've looked in both my site and host configuration and don't see that option.

It is an admin only option.

@brianjmurrell
Copy link
Author

Admin for the [private] mirror? I should be admin for my own private mirror. I can change many other aspects of my site and host.

@adrianreber
Copy link
Member

Ah, private mirror. Right. No, it is a mirrormanager instance admin only option. For private mirrors you cannot set it yourself.

It doesn't sound problematic for private mirrors but currently it is not implemented.

@brianjmurrell
Copy link
Author

So, back to original problem description. :-)

@adrianreber
Copy link
Member

You could fake it by creating the directory tree with empty directories. report_mirror only scans directories (if I remember it correctly) to reduce I/O load. In theory a directory tree with no files should lead to an up to date mirror via report_mirror

@brianjmurrell
Copy link
Author

Indeed. But what should all be in such an empty directory tree? And it does change from release to release (i.e. the introduction of Modular for example) so what would need to be created would need to be different on a per release basis.

This issue is all about resolving this problem given those constraints.

@brianjmurrell
Copy link
Author

I suppose one could do something like:

# rsync -nai -f"+ */" -f"- *" rsync://<mirror>/fedora/releases/35 /path/to/local/mirror/fedora/releases/

To mimic the directory structure. Just sucks that this needs to be such a manual operation on every new release of a distro.

@ott
Copy link
Contributor

ott commented Aug 19, 2024

Mirror servers that are implemented as caching HTTP proxy servers would require proper HTTP caching metadata in headers from upstream servers or manual rules to be effective. So Fedora would be required to have mirror servers that serve their files with proper caching metadata, for example, Cache-Control headers.

@brianjmurrell
Copy link
Author

I'm not sure what you are implying or trying to say @ott but I can tell you, from lots of experience -- both professional (i.e.$dayjob where we do a shedload of RPM mirroring -- think CI systems that get re-installed for every CI run) and personal, that HTTP caching Fedora mirrors works quite well.

@ott
Copy link
Contributor

ott commented Sep 7, 2024

@brianjmurrell I have also worked with HTTP caching semantics. I can assure you by experience and with reference to the HTTP specifications that what you are claiming is generally speaking not true.

HTTP caching semantics are specified in RFC 9111. A list of conditions to cache a response is found in section 3. There might be exceptions but the mirror that I have seen use the default configuration of their webserver and do not serve files with headers that relevant for HTTP caching except for the Last-Modified header for GET requests. As a result, section 4.2.2 applies.

I can assure you that you never want that a HTTP cache uses heuristic freshness. It leads to caching anomalies in web browsers and can break websites. While dnf does not seem to apply heuristic freshness or any sophisticated caching, HTTP reverse proxy servers might. So you have to configure a HTTP reverse proxy server to always validate responses if not told otherwise by the origin server.

So a caching HTTP reverse proxy server would have to check with help of the E-Tag and Last-Modified header whether the request resource has changed on the origin server. In most cases this will not be the case and the origin server will return a response with HTTP status code 304. However, this does not relief the origin server in the same way as a second origin server with the same data would: It still has to access the filesystem and given that most packages are quite small the file contents could have been read in the same filesystem access, especially with hard disks. It does reduce the data transfer volume however. For large files this might be a worthwhile trade-off between simplicity, freshness and resource savings. For many small files it is not such an easy decision.

As most files are never changed on a mirror, it would be possible to use different HTTP caching semantics, for example, the Cache-Control header. Unfortunately, a package that is declared to be immutable or has a long expiration time, cannot be easily revoked or retracted. So if a package would need to be retracted, for example, if it was considered malware or otherwise illegal or would violate the distributions statutes, it would be difficult to impossible to do so and certainly a tedious manual process. So I don't think that it would be a good idea.

It might be would be possible to periodically clean the cache and to cache repository metadata immediately or after the cache has been cleared to make a best-effort to mimic the behaviour of mirror servers that are periodically synchronized with rsync. Another possibility would be to validate the packages based on repository metadata and to remove invalid packages from the cache. However, both possibilities are specific to package managers and distributions and I can image that there might be tricky corner-cases.

@brianjmurrell
Copy link
Author

@ott I don't doubt your experience. But I also don't doubt my literally decades of experience with running RPM/DEB mirror caching proxies with great success. You can continue to tell me that it won't work, but again, I have decades of experience with them telling me it does work.

To be perfectly clear, I am not referring to using generic HTTP proxies like Squid Cache but purpose-built RPM/DEB (and potentially more package formats) proxy caches like Nexus Repository Manager, Artifactory and even the much more lightweight pkg-cacher and/or AptProxy.

@ott
Copy link
Contributor

ott commented Sep 7, 2024

As I said, specialized HTTP caches that can use repository metadata might work but you did not mention such software in your statement. I can't comment on whether the mentioned software works correctly tough. My experience is limited to apt-cacher-ng and I'm not even sure that it works correctly. Moreover, software that is written in an interpreted language and that has to serve every request, might not be suitable to high-volume mirrors.

@brianjmurrell
Copy link
Author

As I said, specialized HTTP caches that can use repository metadata might work but you did not mention such software in your statement.

I didn't need to. I didn't come here to report problems with my caching mirror software so it was not relevant. The discussion of such is only because you started positing that it doesn't/won't/can't work, quite OT for this issue I might add.

Moreover, software that is written in an interpreted language and that has to serve every request, might not be suitable to high-volume mirrors.

I never said I was operating a high-volume mirror with an interpreted language solution . Quite the opposite in fact as the solution that I operate with an interpreted language tool is a private mirror for a small network, so it works sufficiently well.

While I appreciate your perspective on proxy/caching RPM repos, given that I am doing what I am doing with proxy/caching repos quite successfully and have for many many years, you are not going to convince me that it can't work and that I should stop doing it.

@ott
Copy link
Contributor

ott commented Sep 7, 2024

It was not my intention to tell you how you should operate your private mirror servers. I also did not mean to criticize you.

Your first comment in this issue did not mention that your feature request is limited to your private mirror servers. So I was trying to highlight some problems that could result from people running public mirror servers that are actually caching HTTP reverse proxy servers. My only goal was to prevent problems for Fedora that could result from giving people the means to operate caching HTTP reverse proxy servers as mirror servers without understanding why this is commonly not done (at least for public mirror servers) and is not as easy as it might seem.

However, if the scope of this issue is now limited to private mirror server only, I can also unsubscribe from it and not interfere or bother here. I think have also said more or less what can be said about this topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants