-
Notifications
You must be signed in to change notification settings - Fork 528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed and redefined total peering time (%<tt) #1828
Fixed and redefined total peering time (%<tt) #1828
Conversation
[http]::<tt was empty for bumped at 2/3 step configurations (both fwd and tunnel) because HierarchyLogEntry::startPeerClock() was not called for pinned connections. Also adjusted [http]::<tt definition to account for the peer lookup stage (which runs in parallel with connect attempts). Also fixed [http]::<tt for peek-and-splice step 2/3 configurations when a TLS error initiates switchToTunnel(). In this case, the logged time covered only the interval up to the failed TLS attempt and did not include the rest of the tunnel communication.
and a couple of polishing changes.
and a couple of polishing changes.
Also documented what should happen when the %code is used before the timer stops. XXX: That is not what happens right now.
The only stopPeerClock() left is in FwdState::stopAndDestroy(), which covers all cases (sucess, error) when communication with the last peer terminates.
In this temporary approach the time is taken from HttpRequest::hier. Also polished and simplified.
* do not dereference a possibly nil al->request * optimized %<tt formatting code
We do not need this flag anymore since Stopwatch can track multiple resumes and pauses.
It turned out that we cannot guarantee that condition when switchToTunnel() was initiated by noteTakeServerConnectionControl(). In this case, FwdState is destroyed after TunnelStateData, and therefore the timer is still 'running' in TunnelStateData destructor.
thus addressing problems outlined in a693219.
The TODO looks irrelevant, since we do not (re)start peer clock in TunnelStateData::startConnecting(), but still need to reset hier.peer_reply_status there.
... documented in the previous branch commit XXX. I am not sure this complexity is warranted, but storing a raw pointer to a disappearing object (for some definition of "disappearing") did not feel right either. This is the best solution I could find.
It is difficult to justify paying significant price for supporting future hypothetical use cases, especially when we can add support for other use cases (and make the class more complex) as needed instead.
They are no longer needed, but it is probably best to remove them when refactoring this code to replace member assignments with member initialization code. While at it, we should consider either adding RefCount and CbcPointer methods to guarantee object existence (similar to std::optional::value() and std::vector::at()) OR a global function providing that same guarantees for any pointer-like object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Strawman argument there. Other criteria are irrelevant. Merge approval requires one of: This PR has (still today) not yet achieved any of those. Marking it for automation to take over (ie no more human input required) is wrong.
... OR, one could follow the github discussion notifications.
I did request more review ... by setting that "waiting-for-more-reviewers" label which you cleared. It should have more eyes-on but a) anyone will do, and b) I might not have time, and c) not strongly enough to block the merge more than the 10+ day limit imposes. To put it another way; what we have going on in this PR with the two of you working on that long set of changes (and Factory internal reviews?) is essentially the same as one author writing code, submitting a PR and marking it cleared for merge immediately without review.
IIRC, that is a new take we have not discussed. AIUI, we allow anyone to set the label IF (and only IF) the procedure requirements for setting it are met. Most of the time it is one of us in the core team doing the review - which meets that (B) criteria immedately passes and we can set it on a pass vote
That is abusing Anubis and should ideally not be happening. Reviewers should at least wait until the automated tests have passed before spending time on review. A portability failure may cause major changes to the logic/design and re-reviewing such things is becoming a noticeable waste of our time.
Rule of thumb is that authors should not be clearing their own PRs, reviewers should be doing that. Though after the 10-day criteria passes it is reasonable to see that happen. |
src/cf.data.pre
Outdated
starts with the first connect request (or write I/O) | ||
sent to the first selected peer. The timer stops | ||
with the last I/O with the last peer. | ||
[http::]<tt Total peering time in milliseconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"peering time" is not a defined concept, and is quite obscure - not being exclusive to peers, nor requiring their use at all. AFAIK other documentation calls this "forwarding" not "peering", so:
[http::]<tt Total peering time in milliseconds. | |
[http::]<tt Total time in milliseconds spent forwarding to origin servers or peers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Squid documentation and code uses "peer" to mean "origin server or cache_peer". The term "peering" itself is also used. I think it is a better approach than spelling out "origin server or cache_peer" or forgetting one or the other. All of this is inconsistent and poorly defined, of course, but this PR is not making things worse. Arguably, it even improves the situation a bit.
Said that, I have changed that brief <tt
description to use the terminology you prefer: "Total time spent forwarding to origin servers or cache_peers (milliseconds)." I hope this addresses your concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that were true, you would not have insisted on me writing "server or peer" in all previous PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alex: I have changed that brief
<tt
description to use the terminology you prefer: "Total time spent forwarding to origin servers or cache_peers (milliseconds)." I hope this addresses your concern.
Amos: If that were true, you would not have insisted on me writing "server or peer" in all previous PRs.
I do not know what "it" refers to, but can we focus on moving forward this stalled PR rather than on my alleged past actions in other PRs, please? AFAICT, this change request has been addressed two months ago, using phrasing nearly identical to the one you have suggested. Please resolve this specific change request and, if possible, dismiss your negative PR review.
src/cf.data.pre
Outdated
When Squid re-forwards the request (e.g., due to an HTTP 304 | ||
response indicating that the cached response is stale), the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small problem:
- We document elsewhere (eg FAQ) that Squid does not follow 3xx status from servers, and
- 304 status indicates that response is fresh
Can you point me at the logic where this example is happening in latest Squid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reforwarding cases correspond to two processMiss() calls in clientReplyContext::handleIMSReply(). There is also an XXX there that (if addressed) would probably result in another processMiss() call. However, the example text does not match those cases. I fixed that bad example.
Between those two timer events, Squid may perform DNS lookups, | ||
query external ACL helpers, adapt responses using pre-cache | ||
RESPMOD services, and participate in other concurrent | ||
secondary activities. Most secondary activities increase | ||
peering time. In some cases, a secondary activity may start | ||
before the timer starts or end after the timer stops, leading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of this is essentially saying that this timer follows wall-time clock, not a CPU-cycles clock. This list of things that can happen parallel or in sequence with an HTTP response arriving is already quite long and likely to have bitrot quite easily.
Can you come up with a simpler way to convey the concept without listing specifics of other Squid feature implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of this is essentially saying that this timer follows wall-time clock, not a CPU-cycles clock
I strongly disagree. The text gives the reader a good idea of what activities are included in %<tt
measurements. Many admins do not realize that what Squid does when peering. Many admins do not realize that many of the peering activities may happen in parallel with other peering activities. Most do not realize that a secondary activity such as a DNS lookup may start before and/or end after peering. Specific examples help with all of that (and more).
This list of things that can happen parallel or in sequence with an HTTP response arriving is already quite long and likely to have bitrot quite easily.
The list has 3 specific examples. I disagree that three is "quite long". There is no significant danger of bitrot here because those three examples are very unlikely to go away, and the list is explicitly not exclusive.
Can you come up with a simpler way to convey the concept without listing specifics of other Squid feature implementations?
The current version is the best I can come up with. Its (moderate, IMHO) level of complexity is necessitated by the complexity of the thing being measured; a simpler version is already available as a brief <%tt
description. Specific examples is a good think in this context IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nobody likes the text, but it's a text and thus better than nothing.
I'll take a stab at improving it in a separate PR eventually, and we can discuss things there.
src/cf.data.pre
Outdated
When Squid re-forwards the request (e.g., due to an HTTP 304 | ||
response indicating that the cached response is stale), the | ||
timer may restart. In this case, the new measurement is added | ||
to the value accumulated from previous forwarding attempts. | ||
The time interval between forwarding attempts is not added | ||
to the final result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW; I believe it was clearly implied by the old text stating the value as covering time from start of I/O with peer A to end of I/O with peer B.
I do not think we need to go into the level of internal logic this 3rd paragraph has. All reasons for any re-forwarding, re-tries, and redirection are covered by and/or already alluded to by the new paragraph 2 statement "(including any retries)
"
// XXX: al->hier.totalPeeringTime is not updated until prepareLogWithRequestDetails(). | ||
// TODO: Avoid the need for updates by keeping totalPeeringTime (or even ALE::hier) in one place. | ||
const auto &timer = (!al->hier.totalPeeringTime.ran() && al->request) ? | ||
al->request->hier.totalPeeringTime : al->hier.totalPeeringTime; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have some cases where the al->heir
is set without al->request
being set?
AFAIK we should not have hierarchy contact without a request message, and the ALE should be setup with that. Perhapse it is possible to improve this code by just accessing the al->request->heir
record and treating al->heir
as deprecated ?
outtv.tv_sec = std::chrono::duration_cast<std::chrono::seconds>(duration).count(); | ||
const auto totalUsec = std::chrono::duration_cast<std::chrono::microseconds>(duration); | ||
outtv.tv_usec = (totalUsec % std::chrono::microseconds(1s)).count(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO code should not be doing this conversion at all. In here at least it should be displaying the duration object, not a timeval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, misclick on the above.
I will not go into details of the actual criteria in my response; the ABC version above differs from the official one in many ways, but I do not want this discussion to digress into those completely unrelated details (that are important in other contexts, but not in clearing M-cleared-for-merge misunderstandings).
There is absolutely nothing wrong with merging a PR approved by at least one core developer after a 10-day timeout without any further human input. Which is exactly what would have happened here if all core developers disappeared right after this PR was approved by me. Very roughly speaking, GitHub+Anubis merging algorithm is this: if (A and B and C and ... Y and Z) {
merge PR
} M-cleared-for-merge label is just one of the many conditions that GitHub and Anubis check before merging a PR. They include the conditions derived from A), B), and C) approval conditions you have listed above (e.g., 10-day timeout) as well as conditions like "all required tests passed" or "PR title is short enough". Adding M-cleared-for-merge label satisfies exactly one of those A..Z conditions. Nothing more! It does not overwrite any voting requirements or accelerate any timeouts. None of that! Side note: I have probably mentioned that before somewhere, but reacting to M-cleared-for-merge is an optional Anubis feature. Anubis will work correctly -- obeying all Squid Project PR merging requirements it is told to obey -- with that feature turned off. There are large, busy open source projects where merge bots do not wait for any such label to be set; they merge immediately when all labeling-unrelated conditions (e.g., voting and testing) are satisfied. We enabled this Anubis feature just as an extra layer of protection or safety mechanism to reduce chances of accidental PR mergers (because Squid merge process involves timeouts and other conditions that most humans can occasionally misjudge and because humans occasionally misclick).
In the scope of my assertion about negative side effects:
S-waiting-for-more-reviewers label is not related to my rule of thumb quoted above. That rule is about a reviewer that needs to review a PR (but cannot do it now) or a person that wants another (specific) person to review a PR. It is not about attracting attention of other potential reviewers. All open PRs invite reviews by default (no special labeling is required or wanted!), so I recommend adding a brief GitHub comment when adding S-waiting-for-more-reviewers label to clarify what you expect to happen (i.e. why and how we should deviate from that natural default). In most cases, that label should not be added to PRs without negative votes that are approved by a core developer. And in exceptional cases, an explanation is warranted.
(a) is true for pretty much any PR (no special labeling is needed), (b) is not reflected well by S-waiting-for-more-reviewers -- the label is meant to encourage reviews by others, and (c) is not going to work by labeling PRs: There is currently no mechanism to block a PR for 10 days. If you want to propose new mechanisms to better address that use case, let's discuss, but please do not remove M-cleared-for-merge labels from unblocked PRs approved by another core developer.
Correction: Not "without review" but "without independent review" (or some such). PR code was reviewed by two developers, one of which is a core developer. As for the rest of it, yes, one can view the situation that way, but the PR still has to obey the 10 day timeout, like any other PR with just one core developer approval. There is nothing exceptional going on here! M-cleared-for-merge label simply tells other core developers (and Anubis) that this PR has been checked by a core developer (i.e. the human assigning that label) and will be auto-merged if it satisfies all the usual criteria, including timeouts. Nothing more.
Absolutely! That is (one of the reasons) why we have a 10 day timeout. That timeout balances the need for a second pair of core developer eyes with the need to make progress when core developers cannot be compelled to post prompt reviews. M-cleared-for-merge has no effect on that (or any other) timeout.
What is a "new take", that review requests block PRs? That condition was in place since 2018 when the very first version of Anubis was deployed. Anubis even adds "waiting for requested reviews" PR Approval status to GitHub checks in that case (if all other conditions are satisfied). The impact of review requests is also explicitly disclosed in Anubis documentation, of course. It is difficult for me to imagine a reasonable scheme that would not block a PR when there is a pending core developer review request. AFAICT, such an implementation would require us to vote negatively just to say "please wait for my review". Not a good UX for a very common PR state!
AFAIK, only core developers (and GitHub accounts with similar privileges) can set PR labels. M-cleared-for-merge label should be set when the person setting it believes this PR can be merged after satisfying all merge preconditions (voting, timeouts, tests, etc.). It is just an extra safety check. The label is also occasionally useful when, for example, a core developer wants to approve a PR (to explicitly record their vote and support) but does not a PR to be merged (for now); those exceptional cases need a GitHub comment explaining them, of course.
Yes, I have set M-cleared-for-merge label after a review.
Why?! I do not see any signs of abuse here.
I disagree that they should wait (in most cases), but if somebody wants to wait, they certainly can. There are no rules against that!
Agreed. Each reviewer has to balance various risks/costs/factors. I see no relevance of any of this to M-cleared-for-merge. That label is not about review timing.
Labeling a PR with M-cleared-for-merge does not preclude others from reviewing that PR! That PR will wait at least 10 days. I do not understand why you want me to manually track which PRs have passed that 10-day criteria (so that I can label them then). Doing so will waste my time and delay merging of eligible PRs, with no positives to offset that harm (that cannot be accomplished by other/better means without these negative side effects). If changing the spelling of M-cleared-for-merge would help, please propose better alternatives. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yadij, I hope I have addressed all your change requests. Please re-review.
src/cf.data.pre
Outdated
starts with the first connect request (or write I/O) | ||
sent to the first selected peer. The timer stops | ||
with the last I/O with the last peer. | ||
[http::]<tt Total peering time in milliseconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Squid documentation and code uses "peer" to mean "origin server or cache_peer". The term "peering" itself is also used. I think it is a better approach than spelling out "origin server or cache_peer" or forgetting one or the other. All of this is inconsistent and poorly defined, of course, but this PR is not making things worse. Arguably, it even improves the situation a bit.
Said that, I have changed that brief <tt
description to use the terminology you prefer: "Total time spent forwarding to origin servers or cache_peers (milliseconds)." I hope this addresses your concern.
Between those two timer events, Squid may perform DNS lookups, | ||
query external ACL helpers, adapt responses using pre-cache | ||
RESPMOD services, and participate in other concurrent | ||
secondary activities. Most secondary activities increase | ||
peering time. In some cases, a secondary activity may start | ||
before the timer starts or end after the timer stops, leading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of this is essentially saying that this timer follows wall-time clock, not a CPU-cycles clock
I strongly disagree. The text gives the reader a good idea of what activities are included in %<tt
measurements. Many admins do not realize that what Squid does when peering. Many admins do not realize that many of the peering activities may happen in parallel with other peering activities. Most do not realize that a secondary activity such as a DNS lookup may start before and/or end after peering. Specific examples help with all of that (and more).
This list of things that can happen parallel or in sequence with an HTTP response arriving is already quite long and likely to have bitrot quite easily.
The list has 3 specific examples. I disagree that three is "quite long". There is no significant danger of bitrot here because those three examples are very unlikely to go away, and the list is explicitly not exclusive.
Can you come up with a simpler way to convey the concept without listing specifics of other Squid feature implementations?
The current version is the best I can come up with. Its (moderate, IMHO) level of complexity is necessitated by the complexity of the thing being measured; a simpler version is already available as a brief <%tt
description. Specific examples is a good think in this context IMO.
src/cf.data.pre
Outdated
When Squid re-forwards the request (e.g., due to an HTTP 304 | ||
response indicating that the cached response is stale), the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reforwarding cases correspond to two processMiss() calls in clientReplyContext::handleIMSReply(). There is also an XXX there that (if addressed) would probably result in another processMiss() call. However, the example text does not match those cases. I fixed that bad example.
src/cf.data.pre
Outdated
When Squid re-forwards the request (e.g., due to an HTTP 304 | ||
response indicating that the cached response is stale), the | ||
timer may restart. In this case, the new measurement is added | ||
to the value accumulated from previous forwarding attempts. | ||
The time interval between forwarding attempts is not added | ||
to the final result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW; I believe it was clearly implied by the old text stating the value as covering time from start of I/O with peer A to end of I/O with peer B.
This text is not about switching from peer A to peer B (i.e. retrying peers using one FwdState instance). It is about re-forwarding requests using two FwdState instances. In some of those two FwdStates cases, Squid will use peer A, then peer B, and then peer A again and peer B again. The original description did not document what would happen in those cases. An admin could guess correctly, but they had similar if not greater chances of guessing wrong.
I do not think we need to go into the level of internal logic this 3rd paragraph has. All reasons for any re-forwarding, re-tries, and redirection are covered by and/or already alluded to by the new paragraph 2 statement "(including any retries)"
I think we do need to detail that because paragraph 2 only documents that the timer starts at A and stops at B. That documentation is not wrong (for some definition of a "retry"1), but it does not cover what happens when there are two pairs of A and B events (A1 and B1 plus A2 and B2). Interpreting those two pairs as A1 and B2 would be wrong -- the timer is not running between B1 and A2.
If you would like, we can replace "starts" and "stops" in paragraph 2 with "starts (or resumes)" and "pauses", but I am not sure it will improve things -- I think it is best to keep that initial paragraph simple and document special cases later. And even if we do make those replacements in paragraph 2, I would still keep this paragraph (in some shape of form) to explain what is going on in those special (but not so rare in collapsed forwarding environments) re-forwarding cases. Said that, if you insist, I will remove this paragraph. Your call.
Footnotes
-
Related Squid terminology is inconsistent, but we tend to call attempts within a single FwdState instance "retries" (e.g., connect_retries and retry_on_error). ↩
// XXX: al->hier.totalPeeringTime is not updated until prepareLogWithRequestDetails(). | ||
// TODO: Avoid the need for updates by keeping totalPeeringTime (or even ALE::hier) in one place. | ||
const auto &timer = (!al->hier.totalPeeringTime.ran() && al->request) ? | ||
al->request->hier.totalPeeringTime : al->hier.totalPeeringTime; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have some cases where the
al->heir
is set withoutal->request
being set?
I am not aware of any.
AFAIK we should not have hierarchy contact without a request message, and the ALE should be setup with that. Perhapse it is possible to improve this code by just accessing the
al->request->heir
record and treatingal->heir
as deprecated ?
I hope there are better ways to address this problem than "treating ALE::hier as deprecated" (which,
AFAICT, means ignoring ALE::hier when logging %<tt
).
The corresponding official code uses ALE::hier instead of ALE::request. Doing the opposite without serious analysis/tests/etc. is risky (and this PR does not take that risk). ALE::hier is used in several code contexts, including contexts unrelated to peering time. Ignoring it here may introduce bugs and/or more inconsistencies. Let this problem be handled in a dedicated PR (which will focus on addressing these issues properly rather than taking unnecessary risks).
outtv.tv_sec = std::chrono::duration_cast<std::chrono::seconds>(duration).count(); | ||
const auto totalUsec = std::chrono::duration_cast<std::chrono::microseconds>(duration); | ||
outtv.tv_usec = (totalUsec % std::chrono::microseconds(1s)).count(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that no conversion to timeval should be necessary, but improving Format::assemble() to handle std::chrono durations without such conversion should be done in a PR dedicated to that enhancement. This PR does not even introduce an std::chrono conversion to Format::assemble() -- it already exists in official Format::assemble() code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the documentation can be probably improved - noone really
likes it - it's not bad enough for me to reject the PR.
There will be followup PRs to do the followup, that's good enough for me
As authorised by Amos, I am satisfied that the PR is good enough to ahead, with followups
This change adjusts `%[http::]<tt` definition to include peer selection stage (with any relevant DNS lookups). That stage runs in parallel with connection establishment attempts, so excluding portions of that stage made interpreting logged delays more difficult (and code more complex). Also documented how cache refresh (TCP_REFRESH_MODIFIED) affects `%<tt`. Fetching a new version of the stale object creates a second FwdState object, effectively repeating peer selection activities and request forwarding attempts. `%<tt` includes the time spent by those activities. The following three bugs were fixed (at least): * Squid logged dash `%<tt` values for pinned connections (e.g., connections bumped at SslBump step 2 or 3) because Squid did not call HierarchyLogEntry::startPeerClock() for pinned connections. * Logged `%<tt` values for connections spliced at SslBump step 2 or 3 included TCP connection establishment time and TLS Server Hello fetching time (if any) but were missing the rest of Squid-peer communication (i.e. when Squid was tunneling bytes to and from the peer) because Squid stopped counting after calling switchToTunnel(). * Squid logged dash `%<tt` values for external_acl_type and most other directives that support logformat codes because ALE::hier data member is unset until access logging, when prepareLogWithRequestDetails() copies HttpRequest::hier to ALE. This is a short-term fix. A proper one deserves a dedicated change (that should also address similar concerns for other HierarchyLogEntry-derived %codes).
This change adjusts `%[http::]<tt` definition to include peer selection stage (with any relevant DNS lookups). That stage runs in parallel with connection establishment attempts, so excluding portions of that stage made interpreting logged delays more difficult (and code more complex). Also documented how cache refresh (TCP_REFRESH_MODIFIED) affects `%<tt`. Fetching a new version of the stale object creates a second FwdState object, effectively repeating peer selection activities and request forwarding attempts. `%<tt` includes the time spent by those activities. The following three bugs were fixed (at least): * Squid logged dash `%<tt` values for pinned connections (e.g., connections bumped at SslBump step 2 or 3) because Squid did not call HierarchyLogEntry::startPeerClock() for pinned connections. * Logged `%<tt` values for connections spliced at SslBump step 2 or 3 included TCP connection establishment time and TLS Server Hello fetching time (if any) but were missing the rest of Squid-peer communication (i.e. when Squid was tunneling bytes to and from the peer) because Squid stopped counting after calling switchToTunnel(). * Squid logged dash `%<tt` values for external_acl_type and most other directives that support logformat codes because ALE::hier data member is unset until access logging, when prepareLogWithRequestDetails() copies HttpRequest::hier to ALE. This is a short-term fix. A proper one deserves a dedicated change (that should also address similar concerns for other HierarchyLogEntry-derived %codes).
This change adjusts
%[http::]<tt
definition to include peer selectionstage (with any relevant DNS lookups). That stage runs in parallel with
connection establishment attempts, so excluding portions of that stage
made interpreting logged delays more difficult (and code more complex).
Also documented how cache refresh (TCP_REFRESH_MODIFIED) affects
%<tt
.Fetching a new version of the stale object creates a second FwdState
object, effectively repeating peer selection activities and request
forwarding attempts.
%<tt
includes the time spent by those activities.The following three bugs were fixed (at least):
Squid logged dash
%<tt
values for pinned connections (e.g.,connections bumped at SslBump step 2 or 3) because Squid did not call
HierarchyLogEntry::startPeerClock() for pinned connections.
Logged
%<tt
values for connections spliced at SslBump step 2 or 3included TCP connection establishment time and TLS Server Hello
fetching time (if any) but were missing the rest of Squid-peer
communication (i.e. when Squid was tunneling bytes to and from the
peer) because Squid stopped counting after calling switchToTunnel().
Squid logged dash
%<tt
values for external_acl_type and most otherdirectives that support logformat codes because ALE::hier data member
is unset until access logging, when prepareLogWithRequestDetails()
copies HttpRequest::hier to ALE. This is a short-term fix. A proper
one deserves a dedicated change (that should also address similar
concerns for other HierarchyLogEntry-derived %codes).