Fixed and redefined total peering time (%<tt) #1828

rousskov · 2024-06-01T00:45:30Z

This change adjusts %[http::]<tt definition to include peer selection
stage (with any relevant DNS lookups). That stage runs in parallel with
connection establishment attempts, so excluding portions of that stage
made interpreting logged delays more difficult (and code more complex).

Also documented how cache refresh (TCP_REFRESH_MODIFIED) affects %<tt.
Fetching a new version of the stale object creates a second FwdState
object, effectively repeating peer selection activities and request
forwarding attempts. %<tt includes the time spent by those activities.

The following three bugs were fixed (at least):

Squid logged dash %<tt values for pinned connections (e.g.,
connections bumped at SslBump step 2 or 3) because Squid did not call
HierarchyLogEntry::startPeerClock() for pinned connections.
Logged %<tt values for connections spliced at SslBump step 2 or 3
included TCP connection establishment time and TLS Server Hello
fetching time (if any) but were missing the rest of Squid-peer
communication (i.e. when Squid was tunneling bytes to and from the
peer) because Squid stopped counting after calling switchToTunnel().
Squid logged dash %<tt values for external_acl_type and most other
directives that support logformat codes because ALE::hier data member
is unset until access logging, when prepareLogWithRequestDetails()
copies HttpRequest::hier to ALE. This is a short-term fix. A proper
one deserves a dedicated change (that should also address similar
concerns for other HierarchyLogEntry-derived %codes).

[http]::<tt was empty for bumped at 2/3 step configurations (both fwd and tunnel) because HierarchyLogEntry::startPeerClock() was not called for pinned connections. Also adjusted [http]::<tt definition to account for the peer lookup stage (which runs in parallel with connect attempts). Also fixed [http]::<tt for peek-and-splice step 2/3 configurations when a TLS error initiates switchToTunnel(). In this case, the logged time covered only the interval up to the failed TLS attempt and did not include the rest of the tunnel communication.

and a couple of polishing changes.

Also documented what should happen when the %code is used before the timer stops. XXX: That is not what happens right now.

The only stopPeerClock() left is in FwdState::stopAndDestroy(), which covers all cases (sucess, error) when communication with the last peer terminates.

In this temporary approach the time is taken from HttpRequest::hier. Also polished and simplified.

* do not dereference a possibly nil al->request * optimized %<tt formatting code

We do not need this flag anymore since Stopwatch can track multiple resumes and pauses.

It turned out that we cannot guarantee that condition when switchToTunnel() was initiated by noteTakeServerConnectionControl(). In this case, FwdState is destroyed after TunnelStateData, and therefore the timer is still 'running' in TunnelStateData destructor.

thus addressing problems outlined in a693219.

The TODO looks irrelevant, since we do not (re)start peer clock in TunnelStateData::startConnecting(), but still need to reset hier.peer_reply_status there.

... documented in the previous branch commit XXX. I am not sure this complexity is warranted, but storing a raw pointer to a disappearing object (for some definition of "disappearing") did not feel right either. This is the best solution I could find.

It is difficult to justify paying significant price for supporting future hypothetical use cases, especially when we can add support for other use cases (and make the class more complex) as needed instead.

They are no longer needed, but it is probably best to remove them when refactoring this code to replace member assignments with member initialization code. While at it, we should consider either adding RefCount and CbcPointer methods to guarantee object existence (similar to std::optional::value() and std::vector::at()) OR a global function providing that same guarantees for any pointer-like object.

eduard-bagdasaryan

LGTM.

yadij · 2024-06-08T05:16:40Z

Removing the merge tag since this has not cleared the merge procedure criteria (10 days or 1+ non-author votes)

Re-adding M-cleared-for-merge label. It does not make sense to remove that label for the reason specified above because the label does not violate (and is independent from) other "procedure criteria", including various timeouts.

Strawman argument there. Other criteria are irrelevant.

Merge approval requires one of:
A) 3 total votes (anyone)
B) 1 non-author core vote
C) 10 days since last code update to PR

This PR has (still today) not yet achieved any of those. Marking it for automation to take over (ie no more human input required) is wrong.

Removing the label (for that reason) has negative side effects: It forces folks to needlessly remember to add it later, wasting human time and increasing PR merging delays.

... OR, one could follow the github discussion notifications.

If you need to review a PR, request your own review. If you want another person to review a PR, request their review.

I did request more review ... by setting that "waiting-for-more-reviewers" label which you cleared. It should have more eyes-on but a) anyone will do, and b) I might not have time, and c) not strongly enough to block the merge more than the 10+ day limit imposes.

To put it another way; what we have going on in this PR with the two of you working on that long set of changes (and Factory internal reviews?) is essentially the same as one author writing code, submitting a PR and marking it cleared for merge immediately without review.
While I do trust both of you not to be malicious, it is still good/best practice to have someone unfamiliar with the changes to eyeball it before approval is granted.

Any request for review blocks the PR from merging (and one may remove the M-cleared-for-merge label in that case because it does little good when the PR is waiting for review(s) -- the last reviewer can clear the PR for merging as needed).

IIRC, that is a new take we have not discussed.

AIUI, we allow anyone to set the label IF (and only IF) the procedure requirements for setting it are met. Most of the time it is one of us in the core team doing the review - which meets that (B) criteria immedately passes and we can set it on a pass vote

In other cases, a PR with a positive vote (implicit or explicit) and without negative ones, may have an M-cleared-for-merge label while it is waiting for various timeouts to expire (or other significant PR events).

That is abusing Anubis and should ideally not be happening. Reviewers should at least wait until the automated tests have passed before spending time on review. A portability failure may cause major changes to the logic/design and re-reviewing such things is becoming a noticeable waste of our time.
[ yes I am guilty of doing this early-review, and pay the price in wasted time ]

If you do not want to merge your own PR until somebody else (anybody) reviews it, do not clear it for merge, add a S-waiting-for-more-reviewers label, and a comment explaining this (rare) situation. That is what that label is for -- attract other reviewers to a PR that (effectively) does not have enough votes to be merged.

Rule of thumb is that authors should not be clearing their own PRs, reviewers should be doing that. Though after the 10-day criteria passes it is reasonable to see that happen.

yadij · 2024-06-08T05:24:20Z

src/cf.data.pre

-				starts with the first connect request (or write I/O)
-				sent to the first selected peer. The timer stops
-				with the last I/O with the last peer.
+		[http::]<tt	Total peering time in milliseconds.


"peering time" is not a defined concept, and is quite obscure - not being exclusive to peers, nor requiring their use at all. AFAIK other documentation calls this "forwarding" not "peering", so:

Suggested change

[http::]<tt Total peering time in milliseconds.

[http::]<tt Total time in milliseconds spent forwarding to origin servers or peers.

Squid documentation and code uses "peer" to mean "origin server or cache_peer". The term "peering" itself is also used. I think it is a better approach than spelling out "origin server or cache_peer" or forgetting one or the other. All of this is inconsistent and poorly defined, of course, but this PR is not making things worse. Arguably, it even improves the situation a bit.

Said that, I have changed that brief <tt description to use the terminology you prefer: "Total time spent forwarding to origin servers or cache_peers (milliseconds)." I hope this addresses your concern.

If that were true, you would not have insisted on me writing "server or peer" in all previous PRs.

Alex: I have changed that brief <tt description to use the terminology you prefer: "Total time spent forwarding to origin servers or cache_peers (milliseconds)." I hope this addresses your concern.

Amos: If that were true, you would not have insisted on me writing "server or peer" in all previous PRs.

I do not know what "it" refers to, but can we focus on moving forward this stalled PR rather than on my alleged past actions in other PRs, please? AFAICT, this change request has been addressed two months ago, using phrasing nearly identical to the one you have suggested. Please resolve this specific change request and, if possible, dismiss your negative PR review.

src/cf.data.pre

yadij · 2024-06-08T05:32:18Z

src/cf.data.pre

+				When Squid re-forwards the request (e.g., due to an HTTP 304
+				response indicating that the cached response is stale), the


Small problem:

We document elsewhere (eg FAQ) that Squid does not follow 3xx status from servers, and

304 status indicates that response is fresh

Can you point me at the logic where this example is happening in latest Squid?

Reforwarding cases correspond to two processMiss() calls in clientReplyContext::handleIMSReply(). There is also an XXX there that (if addressed) would probably result in another processMiss() call. However, the example text does not match those cases. I fixed that bad example.

yadij · 2024-06-08T05:40:06Z

src/cf.data.pre

+				Between those two timer events, Squid may perform DNS lookups,
+				query external ACL helpers, adapt responses using pre-cache
+				RESPMOD services, and participate in other concurrent
+				secondary activities. Most secondary activities increase
+				peering time. In some cases, a secondary activity may start
+				before the timer starts or end after the timer stops, leading


All of this is essentially saying that this timer follows wall-time clock, not a CPU-cycles clock. This list of things that can happen parallel or in sequence with an HTTP response arriving is already quite long and likely to have bitrot quite easily.
Can you come up with a simpler way to convey the concept without listing specifics of other Squid feature implementations?

All of this is essentially saying that this timer follows wall-time clock, not a CPU-cycles clock

I strongly disagree. The text gives the reader a good idea of what activities are included in %<tt measurements. Many admins do not realize that what Squid does when peering. Many admins do not realize that many of the peering activities may happen in parallel with other peering activities. Most do not realize that a secondary activity such as a DNS lookup may start before and/or end after peering. Specific examples help with all of that (and more).

This list of things that can happen parallel or in sequence with an HTTP response arriving is already quite long and likely to have bitrot quite easily.

The list has 3 specific examples. I disagree that three is "quite long". There is no significant danger of bitrot here because those three examples are very unlikely to go away, and the list is explicitly not exclusive.

Can you come up with a simpler way to convey the concept without listing specifics of other Squid feature implementations?

The current version is the best I can come up with. Its (moderate, IMHO) level of complexity is necessitated by the complexity of the thing being measured; a simpler version is already available as a brief <%tt description. Specific examples is a good think in this context IMO.

Nobody likes the text, but it's a text and thus better than nothing.
I'll take a stab at improving it in a separate PR eventually, and we can discuss things there.

yadij · 2024-06-08T05:44:31Z

src/cf.data.pre

+				When Squid re-forwards the request (e.g., due to an HTTP 304
+				response indicating that the cached response is stale), the
+				timer may restart. In this case, the new measurement is added
+				to the value accumulated from previous forwarding attempts.
+				The time interval between forwarding attempts is not added
+				to the final result.


FWIW; I believe it was clearly implied by the old text stating the value as covering time from start of I/O with peer A to end of I/O with peer B.

I do not think we need to go into the level of internal logic this 3rd paragraph has. All reasons for any re-forwarding, re-tries, and redirection are covered by and/or already alluded to by the new paragraph 2 statement "(including any retries)"

yadij · 2024-06-08T05:52:47Z

src/format/Format.cc

+            // XXX: al->hier.totalPeeringTime is not updated until prepareLogWithRequestDetails().
+            // TODO: Avoid the need for updates by keeping totalPeeringTime (or even ALE::hier) in one place.
+            const auto &timer = (!al->hier.totalPeeringTime.ran() && al->request) ?
+                                al->request->hier.totalPeeringTime : al->hier.totalPeeringTime;


Do you have some cases where the al->heir is set without al->request being set?

AFAIK we should not have hierarchy contact without a request message, and the ALE should be setup with that. Perhapse it is possible to improve this code by just accessing the al->request->heir record and treating al->heir as deprecated ?

yadij · 2024-06-08T08:27:06Z

src/format/Format.cc

+                outtv.tv_sec = std::chrono::duration_cast<std::chrono::seconds>(duration).count();
+                const auto totalUsec = std::chrono::duration_cast<std::chrono::microseconds>(duration);
+                outtv.tv_usec = (totalUsec % std::chrono::microseconds(1s)).count();


IMO code should not be doing this conversion at all. In here at least it should be displaying the duration object, not a timeval.

yadij

sorry, misclick on the above.

rousskov · 2024-06-10T02:43:31Z

Merge approval requires one of:
A) 3 total votes (anyone)
B) 1 non-author core vote
C) 10 days since last code update to PR

I will not go into details of the actual criteria in my response; the ABC version above differs from the official one in many ways, but I do not want this discussion to digress into those completely unrelated details (that are important in other contexts, but not in clearing M-cleared-for-merge misunderstandings).

This PR has (still today) not yet achieved any of those. Marking it for automation to take over (ie no more human input required) is wrong.

There is absolutely nothing wrong with merging a PR approved by at least one core developer after a 10-day timeout without any further human input. Which is exactly what would have happened here if all core developers disappeared right after this PR was approved by me.

Very roughly speaking, GitHub+Anubis merging algorithm is this:

if (A and B and C and ... Y and Z) {
    merge PR
}

M-cleared-for-merge label is just one of the many conditions that GitHub and Anubis check before merging a PR. They include the conditions derived from A), B), and C) approval conditions you have listed above (e.g., 10-day timeout) as well as conditions like "all required tests passed" or "PR title is short enough". Adding M-cleared-for-merge label satisfies exactly one of those A..Z conditions. Nothing more! It does not overwrite any voting requirements or accelerate any timeouts. None of that!

Side note: I have probably mentioned that before somewhere, but reacting to M-cleared-for-merge is an optional Anubis feature. Anubis will work correctly -- obeying all Squid Project PR merging requirements it is told to obey -- with that feature turned off. There are large, busy open source projects where merge bots do not wait for any such label to be set; they merge immediately when all labeling-unrelated conditions (e.g., voting and testing) are satisfied. We enabled this Anubis feature just as an extra layer of protection or safety mechanism to reduce chances of accidental PR mergers (because Squid merge process involves timeouts and other conditions that most humans can occasionally misjudge and because humans occasionally misclick).

Removing the label (for that reason) has negative side effects: It forces folks to needlessly remember to add it later, wasting human time and increasing PR merging delays.

... OR, one could follow the github discussion notifications.

In the scope of my assertion about negative side effects:

Following notifications is not an option today: There are simply no notifications for an expired timeout.
Following notifications is not a good goal for future work: We do not want to create an interface where a human action is always required after a notification about an expired timeout pops up because such an interface will always create completely unnecessary human work and merging delays.

If you need to review a PR, request your own review. If you want another person to review a PR, request their review.

I did request more review ... by setting that "waiting-for-more-reviewers" label

S-waiting-for-more-reviewers label is not related to my rule of thumb quoted above. That rule is about a reviewer that needs to review a PR (but cannot do it now) or a person that wants another (specific) person to review a PR. It is not about attracting attention of other potential reviewers.

All open PRs invite reviews by default (no special labeling is required or wanted!), so I recommend adding a brief GitHub comment when adding S-waiting-for-more-reviewers label to clarify what you expect to happen (i.e. why and how we should deviate from that natural default). In most cases, that label should not be added to PRs without negative votes that are approved by a core developer. And in exceptional cases, an explanation is warranted.

[This PR] should have more eyes-on but a) anyone will do, and b) I might not have time, and c) not strongly enough to block the merge more than the 10+ day limit imposes.

(a) is true for pretty much any PR (no special labeling is needed), (b) is not reflected well by S-waiting-for-more-reviewers -- the label is meant to encourage reviews by others, and (c) is not going to work by labeling PRs: There is currently no mechanism to block a PR for 10 days. If you want to propose new mechanisms to better address that use case, let's discuss, but please do not remove M-cleared-for-merge labels from unblocked PRs approved by another core developer.

To put it another way; what we have going on in this PR with the two of you working on that long set of changes (and Factory internal reviews?) is essentially the same as one author writing code, submitting a PR and marking it cleared for merge immediately without review.

Correction: Not "without review" but "without independent review" (or some such). PR code was reviewed by two developers, one of which is a core developer.

As for the rest of it, yes, one can view the situation that way, but the PR still has to obey the 10 day timeout, like any other PR with just one core developer approval. There is nothing exceptional going on here! M-cleared-for-merge label simply tells other core developers (and Anubis) that this PR has been checked by a core developer (i.e. the human assigning that label) and will be auto-merged if it satisfies all the usual criteria, including timeouts. Nothing more.

While I do trust both of you not to be malicious, it is still good/best practice to have someone unfamiliar with the changes to eyeball it before approval is granted.

Absolutely! That is (one of the reasons) why we have a 10 day timeout. That timeout balances the need for a second pair of core developer eyes with the need to make progress when core developers cannot be compelled to post prompt reviews. M-cleared-for-merge has no effect on that (or any other) timeout.

Any request for review blocks the PR from merging (and one may remove the M-cleared-for-merge label in that case because it does little good when the PR is waiting for review(s) -- the last reviewer can clear the PR for merging as needed).

IIRC, that is a new take we have not discussed.

What is a "new take", that review requests block PRs? That condition was in place since 2018 when the very first version of Anubis was deployed. Anubis even adds "waiting for requested reviews" PR Approval status to GitHub checks in that case (if all other conditions are satisfied). The impact of review requests is also explicitly disclosed in Anubis documentation, of course.

It is difficult for me to imagine a reasonable scheme that would not block a PR when there is a pending core developer review request. AFAICT, such an implementation would require us to vote negatively just to say "please wait for my review". Not a good UX for a very common PR state!

AIUI, we allow anyone to set the label IF (and only IF) the procedure requirements for setting it are met.

AFAIK, only core developers (and GitHub accounts with similar privileges) can set PR labels.

M-cleared-for-merge label should be set when the person setting it believes this PR can be merged after satisfying all merge preconditions (voting, timeouts, tests, etc.). It is just an extra safety check.

The label is also occasionally useful when, for example, a core developer wants to approve a PR (to explicitly record their vote and support) but does not a PR to be merged (for now); those exceptional cases need a GitHub comment explaining them, of course.

Most of the time it is one of us in the core team doing the review

Yes, I have set M-cleared-for-merge label after a review.

In other cases, a PR with a positive vote (implicit or explicit) and without negative ones, may have an M-cleared-for-merge label while it is waiting for various timeouts to expire (or other significant PR events).

That is abusing Anubis and should ideally not be happening.

Why?! I do not see any signs of abuse here.

Reviewers should at least wait until the automated tests have passed before spending time on review.

I disagree that they should wait (in most cases), but if somebody wants to wait, they certainly can. There are no rules against that!

A portability failure may cause major changes to the logic/design and re-reviewing such things is becoming a noticeable waste of our time.

Agreed. Each reviewer has to balance various risks/costs/factors. I see no relevance of any of this to M-cleared-for-merge. That label is not about review timing.

If you do not want to merge your own PR until somebody else (anybody) reviews it, do not clear it for merge, add a S-waiting-for-more-reviewers label, and a comment explaining this (rare) situation. That is what that label is for -- attract other reviewers to a PR that (effectively) does not have enough votes to be merged.

Rule of thumb is that authors should not be clearing their own PRs, reviewers should be doing that. Though after the 10-day criteria passes it is reasonable to see that happen.

Labeling a PR with M-cleared-for-merge does not preclude others from reviewing that PR! That PR will wait at least 10 days. I do not understand why you want me to manually track which PRs have passed that 10-day criteria (so that I can label them then). Doing so will waste my time and delay merging of eligible PRs, with no positives to offset that harm (that cannot be accomplished by other/better means without these negative side effects).

If changing the spelling of M-cleared-for-merge would help, please propose better alternatives.

Context: squid-cache#1828 (comment)

rousskov

@yadij, I hope I have addressed all your change requests. Please re-review.

rousskov · 2024-06-10T13:14:38Z

src/cf.data.pre

-				starts with the first connect request (or write I/O)
-				sent to the first selected peer. The timer stops
-				with the last I/O with the last peer.
+		[http::]<tt	Total peering time in milliseconds.


Squid documentation and code uses "peer" to mean "origin server or cache_peer". The term "peering" itself is also used. I think it is a better approach than spelling out "origin server or cache_peer" or forgetting one or the other. All of this is inconsistent and poorly defined, of course, but this PR is not making things worse. Arguably, it even improves the situation a bit.

Said that, I have changed that brief <tt description to use the terminology you prefer: "Total time spent forwarding to origin servers or cache_peers (milliseconds)." I hope this addresses your concern.

rousskov · 2024-06-10T13:28:06Z

src/cf.data.pre

+				Between those two timer events, Squid may perform DNS lookups,
+				query external ACL helpers, adapt responses using pre-cache
+				RESPMOD services, and participate in other concurrent
+				secondary activities. Most secondary activities increase
+				peering time. In some cases, a secondary activity may start
+				before the timer starts or end after the timer stops, leading


All of this is essentially saying that this timer follows wall-time clock, not a CPU-cycles clock

I strongly disagree. The text gives the reader a good idea of what activities are included in %<tt measurements. Many admins do not realize that what Squid does when peering. Many admins do not realize that many of the peering activities may happen in parallel with other peering activities. Most do not realize that a secondary activity such as a DNS lookup may start before and/or end after peering. Specific examples help with all of that (and more).

This list of things that can happen parallel or in sequence with an HTTP response arriving is already quite long and likely to have bitrot quite easily.

The list has 3 specific examples. I disagree that three is "quite long". There is no significant danger of bitrot here because those three examples are very unlikely to go away, and the list is explicitly not exclusive.

Can you come up with a simpler way to convey the concept without listing specifics of other Squid feature implementations?

The current version is the best I can come up with. Its (moderate, IMHO) level of complexity is necessitated by the complexity of the thing being measured; a simpler version is already available as a brief <%tt description. Specific examples is a good think in this context IMO.

rousskov · 2024-06-10T13:58:22Z

src/cf.data.pre

+				When Squid re-forwards the request (e.g., due to an HTTP 304
+				response indicating that the cached response is stale), the


Reforwarding cases correspond to two processMiss() calls in clientReplyContext::handleIMSReply(). There is also an XXX there that (if addressed) would probably result in another processMiss() call. However, the example text does not match those cases. I fixed that bad example.

rousskov · 2024-06-10T14:22:05Z

src/cf.data.pre

+				When Squid re-forwards the request (e.g., due to an HTTP 304
+				response indicating that the cached response is stale), the
+				timer may restart. In this case, the new measurement is added
+				to the value accumulated from previous forwarding attempts.
+				The time interval between forwarding attempts is not added
+				to the final result.


FWIW; I believe it was clearly implied by the old text stating the value as covering time from start of I/O with peer A to end of I/O with peer B.

This text is not about switching from peer A to peer B (i.e. retrying peers using one FwdState instance). It is about re-forwarding requests using two FwdState instances. In some of those two FwdStates cases, Squid will use peer A, then peer B, and then peer A again and peer B again. The original description did not document what would happen in those cases. An admin could guess correctly, but they had similar if not greater chances of guessing wrong.

I do not think we need to go into the level of internal logic this 3rd paragraph has. All reasons for any re-forwarding, re-tries, and redirection are covered by and/or already alluded to by the new paragraph 2 statement "(including any retries)"

I think we do need to detail that because paragraph 2 only documents that the timer starts at A and stops at B. That documentation is not wrong (for some definition of a "retry"¹), but it does not cover what happens when there are two pairs of A and B events (A1 and B1 plus A2 and B2). Interpreting those two pairs as A1 and B2 would be wrong -- the timer is not running between B1 and A2.

If you would like, we can replace "starts" and "stops" in paragraph 2 with "starts (or resumes)" and "pauses", but I am not sure it will improve things -- I think it is best to keep that initial paragraph simple and document special cases later. And even if we do make those replacements in paragraph 2, I would still keep this paragraph (in some shape of form) to explain what is going on in those special (but not so rare in collapsed forwarding environments) re-forwarding cases. Said that, if you insist, I will remove this paragraph. Your call.

Footnotes

Related Squid terminology is inconsistent, but we tend to call attempts within a single FwdState instance "retries" (e.g., connect_retries and retry_on_error). ↩

rousskov · 2024-06-10T14:43:26Z

src/format/Format.cc

+            // XXX: al->hier.totalPeeringTime is not updated until prepareLogWithRequestDetails().
+            // TODO: Avoid the need for updates by keeping totalPeeringTime (or even ALE::hier) in one place.
+            const auto &timer = (!al->hier.totalPeeringTime.ran() && al->request) ?
+                                al->request->hier.totalPeeringTime : al->hier.totalPeeringTime;


Do you have some cases where the al->heir is set without al->request being set?

I am not aware of any.

AFAIK we should not have hierarchy contact without a request message, and the ALE should be setup with that. Perhapse it is possible to improve this code by just accessing the al->request->heir record and treating al->heir as deprecated ?

I hope there are better ways to address this problem than "treating ALE::hier as deprecated" (which,
AFAICT, means ignoring ALE::hier when logging %<tt).

The corresponding official code uses ALE::hier instead of ALE::request. Doing the opposite without serious analysis/tests/etc. is risky (and this PR does not take that risk). ALE::hier is used in several code contexts, including contexts unrelated to peering time. Ignoring it here may introduce bugs and/or more inconsistencies. Let this problem be handled in a dedicated PR (which will focus on addressing these issues properly rather than taking unnecessary risks).

rousskov · 2024-06-10T14:50:56Z

src/format/Format.cc

+                outtv.tv_sec = std::chrono::duration_cast<std::chrono::seconds>(duration).count();
+                const auto totalUsec = std::chrono::duration_cast<std::chrono::microseconds>(duration);
+                outtv.tv_usec = (totalUsec % std::chrono::microseconds(1s)).count();


I agree that no conversion to timeval should be necessary, but improving Format::assemble() to handle std::chrono durations without such conversion should be done in a PR dedicated to that enhancement. This PR does not even introduce an std::chrono conversion to Format::assemble() -- it already exists in official Format::assemble() code.

src/cf.data.pre

kinkie

While the documentation can be probably improved - noone really
likes it - it's not bad enough for me to reject the PR.
There will be followup PRs to do the followup, that's good enough for me

As authorised by Amos, I am satisfied that the PR is good enough to ahead, with followups

This change adjusts `%[http::]<tt` definition to include peer selection stage (with any relevant DNS lookups). That stage runs in parallel with connection establishment attempts, so excluding portions of that stage made interpreting logged delays more difficult (and code more complex). Also documented how cache refresh (TCP_REFRESH_MODIFIED) affects `%<tt`. Fetching a new version of the stale object creates a second FwdState object, effectively repeating peer selection activities and request forwarding attempts. `%<tt` includes the time spent by those activities. The following three bugs were fixed (at least): * Squid logged dash `%<tt` values for pinned connections (e.g., connections bumped at SslBump step 2 or 3) because Squid did not call HierarchyLogEntry::startPeerClock() for pinned connections. * Logged `%<tt` values for connections spliced at SslBump step 2 or 3 included TCP connection establishment time and TLS Server Hello fetching time (if any) but were missing the rest of Squid-peer communication (i.e. when Squid was tunneling bytes to and from the peer) because Squid stopped counting after calling switchToTunnel(). * Squid logged dash `%<tt` values for external_acl_type and most other directives that support logformat codes because ALE::hier data member is unset until access logging, when prepareLogWithRequestDetails() copies HttpRequest::hier to ALE. This is a short-term fix. A proper one deserves a dedicated change (that should also address similar concerns for other HierarchyLogEntry-derived %codes).

eduard-bagdasaryan and others added 30 commits March 25, 2024 01:30

Moved startPeerClock()/stopPeerClock() into constructor/destructor

2313710

Removed an excessive stopPeerClock() from FwdState::retryOrBail()

8688bb5

and a couple of polishing changes.

Drop nil checks for always-present TunnelStateData::request

9ee4c5a

and a couple of polishing changes.

Documented the effects of secondary activities and re-forwarding

c834331

Also documented what should happen when the %code is used before the timer stops. XXX: That is not what happens right now.

Using Stopwatch for totalResponseTime() calculation

a71a763

Removed an excessive stopPeerClock()

182b682

The only stopPeerClock() left is in FwdState::stopAndDestroy(), which covers all cases (sucess, error) when communication with the last peer terminates.

Adjusted documentation

a2c176f

Fixed unit tests

c2637ca

Overcome the unassigned ALE totalResponseTime()

94bb99b

In this temporary approach the time is taken from HttpRequest::hier. Also polished and simplified.

A couple of fixes

0a2b9b4

* do not dereference a possibly nil al->request * optimized %<tt formatting code

Fixed LFT_TOTAL_SERVER_SIDE_RESPONSE_TIME calculation

9e192dc

Renamed startPeerClock()/stopPeerClock() and adjused documentation

c8ac2b7

Removed the branch-added tunneled flag

5fb6a31

We do not need this flag anymore since Stopwatch can track multiple resumes and pauses.

Request is always non-nil in TunnelStateData::retryOrBail()

86033b5

Removed a garbage line

9340dac

Ensure that TunnelStateData calls hier.stopPeering() exactly once

8092565

Added an ActivityTimer wrapper for Stopwatch

b9347ee

thus addressing problems outlined in a693219.

Polished and removed a stale TODO

8c86590

The TODO looks irrelevant, since we do not (re)start peer clock in TunnelStateData::startConnecting(), but still need to reset hier.peer_reply_status there.

Addressed an ActivityTimer TODO

80eed21

Use an overloaded chrono::duration operator %

a9c2c01

fixup: Polished ActivityTimer documentation

d4b420d

fixup: Simplified TunnelStateData::peeringTimer initialization

fe72ba9

fixup: Polished ActivityTimer::paused name/docs, added description

95c3e8a

fixup: Simplified ActivityTimer code by making the class less general

48b31b2

It is difficult to justify paying significant price for supporting future hypothetical use cases, especially when we can add support for other use cases (and make the class more complex) as needed instead.

fixup: Fixed %<tt documentation typo

f3ee47b

fixup: Polished branch-added comments

95f59c5

rousskov added the S-could-use-an-approval An approval may speed this PR merger (but is not required) label Jun 4, 2024

eduard-bagdasaryan approved these changes Jun 5, 2024

View reviewed changes

yadij approved these changes Jun 8, 2024

View reviewed changes

yadij previously requested changes Jun 8, 2024

View reviewed changes

yadij added S-waiting-for-author author action is expected (and usually required) and removed M-cleared-for-merge https://github.com/measurement-factory/anubis#pull-request-labels S-could-use-an-approval An approval may speed this PR merger (but is not required) labels Jun 8, 2024

rousskov and others added 3 commits June 10, 2024 09:15

fixup: Avoid "peering" in %<tt brief description

02cf233

Context: squid-cache#1828 (comment)

fixup: Clarify special cases using paragraphs

c7236dc

Context: squid-cache#1828 (comment)

fixup: Fix example in re-forwarding case docs

4443bdf

Context: squid-cache#1828 (comment)

rousskov commented Jun 10, 2024

View reviewed changes

rousskov requested a review from yadij June 10, 2024 14:59

rousskov added S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box and removed S-waiting-for-author author action is expected (and usually required) labels Jun 10, 2024

squid-anubis added M-failed-other https://github.com/measurement-factory/anubis#pull-request-labels and removed M-failed-other https://github.com/measurement-factory/anubis#pull-request-labels labels Aug 9, 2024

kinkie approved these changes Aug 20, 2024

View reviewed changes

kinkie removed the request for review from yadij August 20, 2024 20:36

kinkie added M-cleared-for-merge https://github.com/measurement-factory/anubis#pull-request-labels and removed S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box labels Aug 20, 2024

squid-anubis added the M-waiting-staging-checks https://github.com/measurement-factory/anubis#pull-request-labels label Aug 20, 2024

squid-anubis closed this Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed and redefined total peering time (%<tt) #1828

Fixed and redefined total peering time (%<tt) #1828

rousskov commented Jun 1, 2024

eduard-bagdasaryan left a comment

yadij commented Jun 8, 2024

yadij Jun 8, 2024

rousskov Jun 10, 2024

yadij Aug 20, 2024

rousskov Aug 20, 2024

yadij Jun 8, 2024

rousskov Jun 10, 2024

yadij Jun 8, 2024

rousskov Jun 10, 2024

kinkie Aug 20, 2024

yadij Jun 8, 2024

yadij Jun 8, 2024

yadij Jun 8, 2024

yadij left a comment •

edited

Loading

rousskov commented Jun 10, 2024 •

edited

Loading

rousskov left a comment •

edited

Loading

rousskov Jun 10, 2024

rousskov Jun 10, 2024

rousskov Jun 10, 2024

rousskov Jun 10, 2024 •

edited

Loading

rousskov Jun 10, 2024

rousskov Jun 10, 2024

kinkie left a comment

	[http::]<tt Total peering time in milliseconds.
	[http::]<tt Total time in milliseconds spent forwarding to origin servers or peers.

		When Squid re-forwards the request (e.g., due to an HTTP 304
		response indicating that the cached response is stale), the

Fixed and redefined total peering time (%<tt) #1828

Fixed and redefined total peering time (%<tt) #1828

Conversation

rousskov commented Jun 1, 2024

eduard-bagdasaryan left a comment

Choose a reason for hiding this comment

yadij commented Jun 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yadij left a comment • edited Loading

Choose a reason for hiding this comment

rousskov commented Jun 10, 2024 • edited Loading

rousskov left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rousskov Jun 10, 2024 • edited Loading

Choose a reason for hiding this comment

Footnotes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kinkie left a comment

Choose a reason for hiding this comment

yadij left a comment •

edited

Loading

rousskov commented Jun 10, 2024 •

edited

Loading

rousskov left a comment •

edited

Loading

rousskov Jun 10, 2024 •

edited

Loading