-
Notifications
You must be signed in to change notification settings - Fork 528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed and redefined total peering time (%<tt) #1828
Changes from all commits
b2193b4
2313710
8688bb5
9ee4c5a
c834331
a71a763
182b682
a2c176f
c2637ca
94bb99b
0a2b9b4
9e192dc
c8ac2b7
5fb6a31
86033b5
9340dac
8092565
a693219
b9347ee
8c86590
80eed21
a9c2c01
d4b420d
fe72ba9
fa32752
95c3e8a
48b31b2
f3ee47b
95f59c5
8755344
2ccfd0a
02cf233
c7236dc
4443bdf
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -49,6 +49,40 @@ void ResetMarkingsToServer(HttpRequest *, Comm::Connection &); | |
|
||
class HelperReply; | ||
|
||
/// Eliminates excessive Stopwatch pause() calls in a task with multiple code | ||
/// locations that pause a stopwatch. Ideally, there would be just one such | ||
/// location (e.g., a task class destructor), but current code idiosyncrasies | ||
/// necessitate this state. For simplicity sake, this class currently manages a | ||
/// Stopwatch at a hard-coded location: HttpRequest::hier.totalPeeringTime. | ||
class PeeringActivityTimer | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have actually started with a much more generic ActivityTimer class (even suitable for src/base/), but ran into concerns that TunnelStateData creates its ActivityTimer data member with an address of PR's PeeringActivityTimer is simple and reliable (no raw pointers). It can be generalized if we need to support other HierarchyLogEntry or even ALE members (making code more complex). |
||
{ | ||
public: | ||
PeeringActivityTimer(const HttpRequestPointer &); ///< resumes timer | ||
~PeeringActivityTimer(); ///< \copydoc stop() | ||
|
||
/// pauses timer if stop() has not been called | ||
void stop() | ||
{ | ||
if (!stopped) { | ||
timer().pause(); | ||
stopped = true; | ||
} | ||
} | ||
|
||
private: | ||
/// managed Stopwatch object within HierarchyLogEntry | ||
Stopwatch &timer(); | ||
|
||
/// the owner of managed HierarchyLogEntry | ||
HttpRequestPointer request; | ||
|
||
// We cannot rely on timer().ran(): This class eliminates excessive calls | ||
// within a single task (e.g., an AsyncJob) while the timer (and its ran() | ||
// state) may be shared/affected by multiple concurrent tasks. | ||
/// Whether the task is done participating in the managed activity. | ||
bool stopped = false; | ||
}; | ||
|
||
class FwdState: public RefCountable, public PeerSelectionInitiator | ||
{ | ||
CBDATA_CHILD(FwdState); | ||
|
@@ -212,6 +246,9 @@ class FwdState: public RefCountable, public PeerSelectionInitiator | |
/// Whether the entire reply (including any body) was written to Store. | ||
/// The string literal value is only used for debugging. | ||
const char *storedWholeReply_; | ||
|
||
/// Measures time spent on selecting and communicating with peers. | ||
PeeringActivityTimer peeringTimer; | ||
}; | ||
|
||
class acl_tos; | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4995,10 +4995,34 @@ DOC_START | |
[http::]<pt Peer response time in milliseconds. The timer starts | ||
when the last request byte is sent to the next hop | ||
and stops when the last response byte is received. | ||
[http::]<tt Total time in milliseconds. The timer | ||
starts with the first connect request (or write I/O) | ||
sent to the first selected peer. The timer stops | ||
with the last I/O with the last peer. | ||
[http::]<tt Total time spent forwarding to origin servers or | ||
cache_peers (milliseconds). | ||
|
||
The timer starts when Squid decides to forward the request (to | ||
an origin server or cache_peer) and peer selection begins. The | ||
timer stops when relevant forwarding activities (including any | ||
retries) end. | ||
|
||
Between those two timer events, Squid may perform DNS lookups, | ||
query external ACL helpers, adapt responses using pre-cache | ||
RESPMOD services, and participate in other concurrent | ||
secondary activities. Most secondary activities increase | ||
peering time. In some cases, a secondary activity may start | ||
before the timer starts or end after the timer stops, leading | ||
Comment on lines
+5006
to
+5011
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All of this is essentially saying that this timer follows wall-time clock, not a CPU-cycles clock. This list of things that can happen parallel or in sequence with an HTTP response arriving is already quite long and likely to have bitrot quite easily. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I strongly disagree. The text gives the reader a good idea of what activities are included in
The list has 3 specific examples. I disagree that three is "quite long". There is no significant danger of bitrot here because those three examples are very unlikely to go away, and the list is explicitly not exclusive.
The current version is the best I can come up with. Its (moderate, IMHO) level of complexity is necessitated by the complexity of the thing being measured; a simpler version is already available as a brief There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nobody likes the text, but it's a text and thus better than nothing. |
||
to misleading results of simple computations like %<tt - %dt. | ||
Comment on lines
+5010
to
+5012
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A secondary activity may start before the transaction peering timer when a transaction is joining an already pending secondary activity. For example, an HTTP request handled by FwdState ( Similarly, a secondary activity may continue beyond a peering transaction phase. For example, a DNS A lookup may still be pending while the transaction proceeds with the DNS AAAA lookup result received earlier (due to Happy Eyeballs optimizations). These natural imperfect/unpredictable concurrent activity/transaction overlaps imply that |
||
|
||
If this logformat %code is used before its timer starts, the | ||
corresponding measurement has no value (and the %code expands | ||
to a single dash ("-") character). | ||
|
||
If this code is used while its timer is running, the time | ||
spent so far is used as the measurement value. | ||
|
||
When Squid re-forwards the request (e.g., after certain cache | ||
revalidation failures), the timer may restart. In this case, | ||
the new measurement is added to the value accumulated from | ||
previous forwarding attempts. The time interval between | ||
forwarding attempts is not added to the final result. | ||
|
||
Squid handling related format codes: | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -635,9 +635,16 @@ Format::Format::assemble(MemBuf &mb, const AccessLogEntry::Pointer &al, int logS | |
break; | ||
|
||
case LFT_TOTAL_SERVER_SIDE_RESPONSE_TIME: { | ||
struct timeval totalResponseTime; | ||
if (al->hier.totalResponseTime(totalResponseTime)) { | ||
outtv = totalResponseTime; | ||
// XXX: al->hier.totalPeeringTime is not updated until prepareLogWithRequestDetails(). | ||
// TODO: Avoid the need for updates by keeping totalPeeringTime (or even ALE::hier) in one place. | ||
const auto &timer = (!al->hier.totalPeeringTime.ran() && al->request) ? | ||
al->request->hier.totalPeeringTime : al->hier.totalPeeringTime; | ||
Comment on lines
+638
to
+641
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Implementing this TODO in this PR would create a lot of out-of-scope noise. That refactoring is also likely to require making some difficult design decisions and may result in other bug fixes or surprises. We should do it in a dedicated PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you have some cases where the AFAIK we should not have hierarchy contact without a request message, and the ALE should be setup with that. Perhapse it is possible to improve this code by just accessing the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I am not aware of any.
I hope there are better ways to address this problem than "treating ALE::hier as deprecated" (which, The corresponding official code uses ALE::hier instead of ALE::request. Doing the opposite without serious analysis/tests/etc. is risky (and this PR does not take that risk). ALE::hier is used in several code contexts, including contexts unrelated to peering time. Ignoring it here may introduce bugs and/or more inconsistencies. Let this problem be handled in a dedicated PR (which will focus on addressing these issues properly rather than taking unnecessary risks). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think this PR is the right place to deduplicate |
||
if (timer.ran()) { | ||
using namespace std::chrono_literals; | ||
const auto duration = timer.total(); | ||
outtv.tv_sec = std::chrono::duration_cast<std::chrono::seconds>(duration).count(); | ||
const auto totalUsec = std::chrono::duration_cast<std::chrono::microseconds>(duration); | ||
outtv.tv_usec = (totalUsec % std::chrono::microseconds(1s)).count(); | ||
Comment on lines
+645
to
+647
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This code converts an std::chrono::duration to struct timeval. Eduard has evaluated runtime efficiency of this code and confirmed that its optimized performance is equivalent to traditional manual division using magic constants. We will submit a dedicated followup PR offering a reusable version of this conversion, replacing this and other then-current code doing this conversion. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMO code should not be doing this conversion at all. In here at least it should be displaying the duration object, not a timeval. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that no conversion to timeval should be necessary, but improving Format::assemble() to handle std::chrono durations without such conversion should be done in a PR dedicated to that enhancement. This PR does not even introduce an std::chrono conversion to Format::assemble() -- it already exists in official Format::assemble() code. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd resist scope creep here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Done in #1915. |
||
doMsec = 1; | ||
} | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -211,6 +211,9 @@ class TunnelStateData: public PeerSelectionInitiator | |
/// over the (encrypted, if needed) transport connection to that cache_peer | ||
JobWait<Http::Tunneler> peerWait; | ||
|
||
/// Measures time spent on selecting and communicating with peers. | ||
PeeringActivityTimer peeringTimer; | ||
|
||
void copyRead(Connection &from, IOCB *completion); | ||
|
||
/// continue to set up connection to a peer, going async for SSL peers | ||
|
@@ -310,7 +313,7 @@ TunnelStateData::serverClosed() | |
{ | ||
server.noteClosure(); | ||
|
||
request->hier.stopPeerClock(false); | ||
peeringTimer.stop(); | ||
|
||
finishWritingAndDelete(client); | ||
} | ||
|
@@ -386,14 +389,25 @@ TunnelStateData::deleteThis() | |
delete this; | ||
} | ||
|
||
// TODO: Replace with a reusable API guaranteeing non-nil pointer forwarding. | ||
/// safely extracts HttpRequest from a never-nil ClientHttpRequest pointer | ||
static auto & | ||
guaranteedRequest(const ClientHttpRequest * const cr) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are lots of places in Squid code where we store a raw pointer and use it for configuring other members. Some of them assert that the pointer is not nil (and violate data member initialization guidelines to inject those assertions). Some just assume that the pointer is not nil (and occasionally crash). I think we can do better. STL offers some relevant guidance with throwing (but slower) std::optional::value() and std::vector::at() interfaces, but raw pointers will need a different solution. An investigation of various options and a dedicated PR is warranted here. That PR will, among other things, allow us to remove this guaranteedRequest() helper and simplify/improve constructors like TunnelStateData constructor below (that contains similar assertions). |
||
{ | ||
Assure(cr); | ||
Assure(cr->request); | ||
return *cr->request; | ||
} | ||
|
||
TunnelStateData::TunnelStateData(ClientHttpRequest *clientRequest) : | ||
startTime(squid_curtime), | ||
destinations(new ResolvedPeers()), | ||
destinationsFound(false), | ||
committedToServer(false), | ||
n_tries(0), | ||
banRetries(nullptr), | ||
codeContext(CodeContext::Current()) | ||
codeContext(CodeContext::Current()), | ||
peeringTimer(&guaranteedRequest(clientRequest)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could simplify a little (eliminating referencing/dereferencing) by making guaranteedRequest() return a pointer: its very name states that there is an existing request. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, we could, but I did not want a function that guarantees request existence to return a pointer to that existing request: If the object is guaranteed to exist, its accessor should return a reference rather than a pointer. Either way, this code is problematic as detailed in #1828 (comment). I hope it will go away with changes discussed in that change request. Said that, if others insist, I will refactor the helper function to remove this |
||
{ | ||
debugs(26, 3, "TunnelStateData constructed this=" << this); | ||
client.readPendingFunc = &tunnelDelayedClientRead; | ||
|
@@ -477,8 +491,7 @@ TunnelStateData::retryOrBail(const char *context) | |
|
||
/* bail */ | ||
|
||
if (request) | ||
request->hier.stopPeerClock(false); | ||
peeringTimer.stop(); | ||
|
||
// TODO: Add sendSavedErrorOr(err_type type, Http::StatusCode, context). | ||
// Then, the remaining method code (below) should become the common part of | ||
|
@@ -1403,8 +1416,7 @@ TunnelStateData::sendError(ErrorState *finalError, const char *reason) | |
{ | ||
debugs(26, 3, "aborting transaction for " << reason); | ||
|
||
if (request) | ||
request->hier.stopPeerClock(false); | ||
peeringTimer.stop(); | ||
|
||
cancelStep(reason); | ||
|
||
|
@@ -1438,15 +1450,12 @@ TunnelStateData::cancelStep(const char *reason) | |
void | ||
TunnelStateData::startConnecting() | ||
{ | ||
if (request) | ||
request->hier.startPeerClock(); | ||
|
||
assert(!destinations->empty()); | ||
assert(!transporting()); | ||
|
||
delete savedError; // may still be nil | ||
savedError = nullptr; | ||
request->hier.peer_reply_status = Http::scNone; // TODO: Move to startPeerClock()? | ||
request->hier.peer_reply_status = Http::scNone; | ||
|
||
const auto callback = asyncCallback(17, 5, TunnelStateData::noteConnection, this); | ||
const auto cs = new HappyConnOpener(destinations, callback, request, startTime, n_tries, al); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would prefer to rely exclusively on RAII (provided by PeeringActivityTimer) instead of sprinkling code with these stop() calls. Unfortunately, FwdState objects are reference-counted, and there is no strong/obvious/maintained guarantee that a FwdState object does not survive long after stopAndDestroy() has been called. Thus, we should call stop() explicitly. RAII will still kick in if we miss a destruction code path, of course.