Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunked put starvation fix craig #169

Open
wants to merge 10 commits into
base: development
Choose a base branch
from

Conversation

charmster
Copy link
Contributor

Build on the work that @edmc-ss did to solve the chunked put context hangs.

Change the way "open log segments" are associated with inodes so that open log segments can be flushed without getting the inode lock. In particuar, this fixes some additional hangs that we have seen with chunked put context "starvation". The fs unit tests now pass with SwiftClient.ChunkedConnectionPoolSize set to either 1 or 2.

i think this also fixes a couple of potential races in doReadPlan() and with chunked put connections changing between "Active" and "Inactive" as retries occur.

i think we can discard the separate chunked put connections for checkpoint operations, but i need to look at that code more to be sure.

Ed McClanahan and others added 10 commits August 6, 2018 12:47
This will be used to re-implement the enforcement of the MaxFlushTime
logic (currently up in fs) as well as a new "lighter-weight" mini-flush
mechanism (triggered by the MaxFlushTime expiration as well as the
ChunkedPutContext quota exhaustion/starvation upcall).
The LRU inFlightLogSegment will simply be Close()'d thus freeing up
the Chunked PUT Context it was consuming. This is done synchronously
with respect to the starvation condition being detected.
Changed the code in swiftclient that does chunked put to re-open the
connection directly rather then re-opening by giving up its connection
slot by calling relasedChunkedPutConnect()/acquireChunkedPutConnect().
If the first attempt for a Chunked Put operation failed it would be
retried, bbut the first failure did not count against the retry limit
and we would not see a message that the retry succeeded.  Fix that by
moving the fist call to closeHelper() inside the request function and
special casing the retry() logic so its not called on the first attempt.

More explicitly, previously if there was a failure in SendChunk() we
would see the failure in the log::
> time="2018-08-04T03:24:58.549673Z" level=error msg="swiftclient.chunkedPutContext.SendChunk(\"TestAccount/TestContainer/FooBarCopy\") got writeHTTPPutChunk() error" error="writeHTTPPutChunk() simulated error" function=SendChunk goroutine=6 package=swiftclient

but if the _first_ retry succeeded, we would not see a message by
RequestWithRetry() saying that the request had succeded:
time="2018-08-04T03:24:58.803157Z" level=info msg="retry.RequestWithRetry(): swiftclient.chunkedPutContext.Close(\"TestAccount/TestContainer/FooBarCopy\") succeeded after 2 attempts in 0.253 sec" function=RequestWithRetry goroutine=6 package=swiftclient pid=2204

although we would see the message if it succeeded after 3 or more
attempts.

(logger.PanicfWithError()) because one of them started happening (due
to a bug in my changes) and it wasn't very noticable without a panic.
Most Swift client operations (GET/PUT/POST/HEAD) on objects, containers,
etc. will be retried if they fail and if the retry succeeds then its
not an actual error (no error was returned to the caller).

Change error messages in these situations to warning messages. For
example, where we would previously have seen:

> time="2018-08-04T03:34:24.003931Z" level=error msg="swiftclient.accountHead(\"TestAccount\") got readHTTPStatusAndHead ers() bad status" error="HEAD TestAccount returned HTTP StatusCode 500" function=accountHead goroutine=6 package=swiftclient
>                                    ^^^^^^^^^^^
> time="2018-08-04T03:34:24.256097Z" level=info msg="retry.RequestWithRetry(): swiftclient.accountHead(\"TestAccount\") succeeded after 2 attempts in 0.252 sec" function=RequestWithRetry goroutine=6 package=swiftclient pid=2449

indicating that the operation succeeded, we now see:

> time="2018-08-04T03:34:24.003931Z" level=warning msg="swiftclient.accountHead(\"TestAccount\") got readHTTPStatusAndHead ers() bad status" error="HEAD TestAccount returned HTTP StatusCode 500" function=accountHead goroutine=6 package=swiftclient
>                                    ^^^^^^^^^^^^^
> time="2018-08-04T03:34:24.256097Z" level=info msg="retry.RequestWithRetry(): swiftclient.accountHead(\"TestAccount\") succeeded after 2 attempts in 0.252 sec" function=RequestWithRetry goroutine=6 package=swiftclient pid=2449

which does not contain any "level=error" messages.

if all of the retries fail, we will eventually see a "level=error" message like this:

> time="2018-08-04T03:34:23.460393Z" level=error msg="retry.RequestWithRetry(): swiftclient.testRetry.request(3) failed after 5 attempts in 0.152 sec with retriable error" error="Simulate a retriable errror" function=RequestWithRetry goroutine=6 package=swiftclient
>                                    ^^^^^^^^^^^                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The basic issue was three test cases were launching chunked pool
connections by writing to a file... but not flushing the file.
As a result, at the end of the test - where no proper shutdown
is performed - the chunked PUT connection was left "open". There
were three such test cases... meaning that when running the
entire test suite, subsequent tests would certainly be beyond,
say, a limit of two.

Still, the question remains why didn't the starvationCallback()
take care of the situation. This begs for the missing starvation
test case.
Remove the calls to testVolumeHandle.Flush(fileInodeNumber, false)
in inode unit tests.
Separate flushing (Close()) of chunked put contexts so the inode lock
is not required.  chunkedPutConnectionPoolStarvationCallback() can
now flush a chunked put connection without needing any inode locks.

Divorce "open log segments", that is "active" chunked put connections
from inodes by removing the openLogSegment field and replacing
it with a file-system wide "cache" (map) of open log segments,
globals.openLogSegmentCache.  if you want to write to a file
(doSendChunk()), then check globals.openLogSegmentCache for a
connection and, if present, use it.  If not, create a new one.
Change doFileInodeDataFlush() to do the necesary clenaup of chunked put
connections in `fileInode.inFlightLogSegmentMap`, including chunked puts
that failed.

Remove the calls to `openLogSegmentLRUTouch()` in `doReadPlan()` and
then delete openLogSegmentLRUTouch() because its no longer used.  Calling
openLogSegmentLRUTouch() in doReadPlan() is likely a race condition
waiting to core dump since the chunked put contxt in question may not be
in the Log Segment LRU in the first place.

Decrease the number of connections in the chunked put connect pool for
the inode and fs unit tests to "2" and "1" respectively.
Fix a bug that caused us to hang forever if we hit a race condition in
acquireNonChunkedConnection() or acquireChunkedConnection() by reworking
the code to simplify the synchronization, at the expense of "fairness"
when we are in starvation conditions (we should have flushing at higher
levels that works hard to avoid starvation conditions in the first place).

Part of this change eliminated an endless series of calls to the chunked
put connection pool starvation routine; instead each requestor calls
the starvation callback only once (or again if it is woken up and no
connection slots are available).

This introduces the possibility of a hang if the starvation callback finds
no work to do and the inodes (requests) holding chunked put connections
are never flushed.  The correct solution is to introduce a background
flusher for "open log segments" that insures that, even if idle, they are
eventually flushed to storage (preferably before the Swift server gives
up on them and fails the chunked put operation).  We need the chunked put
connection flusher regardless of whether there is a starvation condition,
so this doesn't really add a requirement for new code.

Change chunkedPutConnectionPoolStarvationCallback() to Close() a
chunked put connection in a separate go routine instead of doing
it itself (and blocking).  This decreases the latency of a call to
chunkedPutConnectionPoolStarvationCallback() to nearly zero, but also
means that the condition is probably not resolved by the time it returns.

Collapse the chunked and nonChunked connection pool code into a single set
of functions that operate on a *connectionPoolStruct.  This avoids bugs
being fixed in the chunked (or nonChunked) pool code and not the other.
This also increases unit test coverage by having all connection related
tests test the same set of code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant