Fix deadlocking issues in TMMemoryCache related to thread starvation #71

garrettmoon · 2014-12-03T19:31:57Z

Re-architected everything to be synchronous by default and asynchronous methods are simply wrapped versions of the synchronous ones.
Changed TMMemoryCache to use pthread_locks instead of dispatch_queues to protect resources (this is where we lose our performance, context switching is faster than locks).
Added an integration test which shows the deadlocking (but passes with the above changes).

This results in a slight performance degradation, but it resolves deadlocking issues. Would love to get feedback on this or hear other ideas for resolving the issue.

irace · 2014-12-04T05:25:44Z

@garrettmoon Are both 1 and 2 necessary to resolve the deadlocking issues?

garrettmoon · 2014-12-04T05:51:51Z

Yes, using the dispatch signal / wait methods to create synchronous versions of asynchronous methods seems to be a sure fire way to cause dead locking when threads become scarce.

garrettmoon · 2014-12-04T05:53:45Z

As far as number 2 goes, relying on a queue to guard shared resources has the same issues with thread starvation. I've tried to contact Apple about this, no response.

I'm open to the idea that I'm missing something.

irace · 2014-12-04T18:30:04Z

So basically the semaphore is used to block the main thread but you're then unable to dispatch to the state queue, since there aren't any threads available, meaning the semaphore never gets signaled and the main thread waits forever?

Do you think there's anything that TMCache is specifically doing that results in threads becoming scarce in the first place?

garrettmoon · 2014-12-04T18:38:15Z

That's correct. It can happen off the main thread too, which locks up TMCache but not the main thread.

This is the thing, TMCache isn't doing anything that would necessarily make threads scarce, but it's definitely possible to make threads scarce and cause this by simply calling the synchronous method a couple of times in a dispatch_async. This is obviously a contrived example, but it exposes the issue.

I'm of the opinion that a library shouldn't expose a way to allow users to shoot themselves in the foot like this. As a developer, I should be able to dispatch as many operations as I want without having to worry about this type of thing. I shouldn't have to marshal the number of things I'm putting on disparate queues in my application. I.e. I could create an operation queue to limit the number of calls I put into TMCache to some arbitrary number, but if I have a library elsewhere that has the same pattern, I'll have to marshal that one in conjunction with TMCache. On top of all of this, grand central doesn't guarantee a number of available threads.

Basically, it seems unsafe to lock resources with queues.

irace · 2014-12-04T18:57:00Z

Basically, it seems unsafe to lock resources with queues.

Which is crazy if true considering it's exactly what Apple recommends in the “Eliminating Lock-Based Code” section of the Concurrency Programming Guide.

garrettmoon · 2014-12-04T19:05:32Z

I know. I thought the same thing when I read that.

garrettmoon · 2014-12-04T19:06:11Z

Like I said, I could be wrong about all this, I'm open to that idea. But. Give my integration test a try on the library as is.

garrettmoon · 2014-12-04T19:08:59Z

When you try out testDeadlocks, be sure to pause and check out the stack trace. You'll notice bunches of threads all waiting.

garrettmoon · 2014-12-04T19:16:32Z

Although, rereading the document, it says you should be using dispatch_semaphores, not queues to regulate resources. If, instead of pthread locks a dispatch_semaphore was used, that might not cause this issue. I decided on pthread so that I could use a read/write lock.

irace · 2014-12-14T19:40:37Z

@garrettmoon I've talked this over with @jstn and we're definitely in favor of moving over to the synchronous methods being wrapped by the asynchronous ones, instead of the other way around. This should reduce the number of threads used, and as such, the number of deadlocks encountered.

That said, we're both not thrilled about potentially using pthread_locks instead of GCD queues. I'd be interested in seeing if simply flipping the asynchronous-sychronous swap would have the desired effect without:

Making even broader changes to the codebase
Raising the barrier to entry by using a technology that most iOS developers aren't familiar with (and that Apple doesn't even seem to recommend for this purpose)

(We also noticed a potential race condition in your unit test, since the enumCount variable can be incremented by numerous threads in a non-synchronous manner)

garrettmoon · 2014-12-15T19:15:33Z

Thanks for catching that on the test, I updated it :)

So, I can tell you that switching to asynchronous methods wrapping synchronous was the first thing I tried (now I wish I hadn't squashed my commits). It may alleviate, but won't fix the issue.

I can't address the 'making even broader changes to the codebase' issue, I understand this is a frustratingly large change.

I can address the lack of knowledge about pthreads though, it's very simple to switch to dispatch_semaphores. In fact, I tried that out and it surprisingly seems to perform better than pthreads. With this change, the performance is nearly what it was originally (it performs ~2.65% slower than using queues). This is what the Apple docs actually recommend, not queues. I've pushed this change.

Anyway, I understand the desire to not rewrite half the library. Let me know what you all decide :)

1. Re-architected everything to be synchronous by default and asynchronous methods are simply wrapped versions of the synchronous ones. 2. Changed TMMemoryCache to use pthread_locks instead of dispatch_queues to protect resources (this is where we lose our performance, context switching is faster than locks). 3. Added an integration test which shows the deadlocking (but passes with the above changes).

…hem.

…tter.

irace · 2015-02-13T04:54:19Z

@garrettmoon Sorry that you felt the need to close this. I must yet again apologize for having been negligent/busy – I haven’t had nearly the free time necessary to perform the amount of testing it’d require for me to feel comfortable merging such a large change.

That said, I still think it’s the right change to make, and I would certainly like to, but definitely think you should move ahead with a local fork in order to not let us hold you up. This is the beauty of OSS 😄

To clarify, since you’re not using pthreads anymore, you’ve simply moved everything to be synchronous by default, correct? And are using semaphores to implement locking to keep things thread-safe? Do you think there’s much of a difference left between this PR and #37?

garrettmoon · 2015-02-25T20:01:59Z

Hmm, it's still significantly different from the PR on #37. Sadly, in our testing, that PR isn't even thread safe. It also won't solve the underlying issue entirely. We've decided that you're right, forking is probably best for all involved. You can read more about it here:

http://engineering.pinterest.com/post/112057905219/open-sourcing-pincache

irace · 2015-02-25T21:02:43Z

@garrettmoon Great blog post. I’m thrilled to see the changes hit production and am happy that this approach has been working out for you. As mentioned before, I think this is totally the right approach we just don’t have the resources at the moment.

That said, I’m a little disappointed that it isn’t a proper GitHub fork of this repository, and that TMCache isn’t mentioned at all in the README 😢

garrettmoon · 2015-02-25T21:16:26Z

@irace Whoops, I went too far trying to make sure I wasn't treading on existing TM's. New README indicates that it's a fork in the first sentence. Sorry about that!

irace · 2015-02-26T02:52:11Z

@garrettmoon Thanks Garrett! Out of curiosity, how come you chose not to make it an “official“ GitHub fork?

garrettmoon · 2015-02-26T04:41:06Z

Entirely unintentional. I forked, moved to a private repo to work on internally and published from their. Apparently something got lost in the transition?

irace · 2015-02-26T19:50:47Z

@garrettmoon Are all of the changes that you ended up making (related to thread starvation, I know that PINCache has other changes as well) still encompassed in this branch? If so, can I ask you to reopen? If not, would you be able to add them here?

Considering putting TMCache into basically a “maintenance mode” – and suggesting new users use PINCache or something else under more active development – but this seems important enough to dedicate the resources towards getting it merged in and released before doing so.

irace · 2015-03-03T14:34:08Z

@garrettmoon Ping! 😄

garrettmoon · 2015-03-03T20:57:25Z

Sadly, no, it doesn't encompass all the changes because I realized the same issues, while likely encountered less frequently, could affect the disk cache. In making those changes, I was required to modify the public facing APIs.

So, the original pull request encompasses most of the changes and would address the issue in probably all but extreme circumstances. It probably also needs a bit of cleanup. Would you like me to open up a new pull request with them (I can't reopen the old, I removed my repo)?

It seems more prudent to simply point people who are seeing the issue to PINCache and not force others through the change?

irace · 2015-03-03T21:08:17Z

Would you like me to open up a new pull request with them (I can't reopen the old, I removed my repo)?

If it’s not too much work, this would be greatly appreciated.

It seems more prudent to simply point people who are seeing the issue to PINCache and not force others through the change?

We’re most likely going to do this for folks who would prefer to adopt something that’s going to be more regularly maintained and improved going forward. That said, if we’re going to “cease development” on TMCache I’d prefer it to be with most high-priority known issues having already been addressed. I can see people who already use it upgrading without coming here to read all about it.

garrettmoon · 2015-03-03T22:05:02Z

I'll speak with my colleagues about prioritizing the work.

garrettmoon · 2015-03-04T19:58:07Z

Hey Bryan,

Spoke with my colleagues: I'm happy to send you the original patch, but our position is we'd rather contribute to the OS community by working on PINCache. So it will be as is and we'd ask that you all make the effort to review and fix any issues and deal with the pull request. Feel free to send me an email if you'd like to pursue this, I'll reply with the diff.

liuliu · 2015-05-17T08:21:18Z

Is the underlying problem the GCD has a upper limit on how many threads it manages, therefore, we can reach that limits by dispatch a lot waiting blocks on global concurrent queue?

It seems that this suggests everyone to use serial queue throughout their implementations, otherwise if you are loading 1,000 images on the global concurrent queue, you may have a lot problems yourself (and really, why you want to do that?).

liuliu · 2015-05-17T08:34:37Z

To be more specific, you can pass the test case @garrettmoon provided:

- (void)testDeadlocks
{
    NSString *key = @"key";
    NSUInteger objectCount = 1000;
    [self.cache setObject:[self image] forKey:key];
    dispatch_queue_t testQueue = dispatch_queue_create("test queue", DISPATCH_QUEUE_CONCURRENT);

    NSLock *enumCountLock = [[NSLock alloc] init];
    __block NSUInteger enumCount = 0;
    dispatch_group_t group = dispatch_group_create();
    for (NSUInteger idx = 0; idx < objectCount; idx++) {
        dispatch_group_async(group, testQueue, ^{
            [self.cache objectForKey:key];
            [enumCountLock lock];
            enumCount++;
            [enumCountLock unlock];
        });
    }

    dispatch_group_wait(group, [self timeout]);
    STAssertTrue(objectCount == enumCount, @"was not able to fetch 1000 objects, possibly due to deadlock.");
}

by doing:

- (void)testDeadlocks
{
    NSString *key = @"key";
    NSUInteger objectCount = 1000;
    [self.cache setObject:[self image] forKey:key];
    dispatch_queue_t testQueue = dispatch_queue_create("test queue", DISPATCH_QUEUE_CONCURRENT);
    dispatch_set_target_queue(testQueue, dispatch_get_global_queue(0, 0x2ull));

    NSLock *enumCountLock = [[NSLock alloc] init];
    __block NSUInteger enumCount = 0;
    dispatch_group_t group = dispatch_group_create();
    for (NSUInteger idx = 0; idx < objectCount; idx++) {
        dispatch_group_async(group, testQueue, ^{
            [self.cache objectForKey:key];
            [enumCountLock lock];
            enumCount++;
            [enumCountLock unlock];
        });
    }

    dispatch_group_wait(group, [self timeout]);
    STAssertTrue(objectCount == enumCount, @"was not able to fetch 1000 objects, possibly due to deadlock.");
}

Which will enable the overcommit behavior on the testQueue, and GCD will actually create enough threads to cope this particular situation. I am not sure why OVERCOMMIT (the 0x2ull paramter maps to https://libdispatch.macosforge.org/trac/browser/trunk/private/queue_private.h#L46) is not the default behaviour from GCD designers, but they may have some very good reasons. Can someone file a rdar to get clarification from GCD designers?

irace added the bug label Dec 4, 2014

irace mentioned this pull request Dec 4, 2014

Resolves deadlock issues by making synchronous objectForKey: methods truly synchronous #37

Open

mkauppila mentioned this pull request Dec 4, 2014

Modernization #70

Closed

garrettmoon added 6 commits February 3, 2015 10:18

Ensure that all objects have been added to cache before enumerating t…

65485bd

…hem.

Fix threading issues in testDeadLocks

0870fe7

Fixed race condition in test.

584c5c0

Switched to dispatch_semaphore pattern which surprisingly performs be…

73403b6

…tter.

Fix non-OS_OBJECT_USE_OBJC

ee97c91

garrettmoon closed this Feb 13, 2015

Fix deadlocking issues in TMMemoryCache related to thread starvation #71

Fix deadlocking issues in TMMemoryCache related to thread starvation #71

Uh oh!

Conversation

garrettmoon commented Dec 3, 2014

Uh oh!

irace commented Dec 4, 2014

Uh oh!

garrettmoon commented Dec 4, 2014

Uh oh!

garrettmoon commented Dec 4, 2014

Uh oh!

irace commented Dec 4, 2014

Uh oh!

garrettmoon commented Dec 4, 2014

Uh oh!

irace commented Dec 4, 2014

Uh oh!

garrettmoon commented Dec 4, 2014

Uh oh!

garrettmoon commented Dec 4, 2014

Uh oh!

garrettmoon commented Dec 4, 2014

Uh oh!

garrettmoon commented Dec 4, 2014

Uh oh!

irace commented Dec 14, 2014

Uh oh!

garrettmoon commented Dec 15, 2014

Uh oh!

irace commented Feb 13, 2015

Uh oh!

garrettmoon commented Feb 25, 2015

Uh oh!

irace commented Feb 25, 2015

Uh oh!

garrettmoon commented Feb 25, 2015

Uh oh!

irace commented Feb 26, 2015

Uh oh!

garrettmoon commented Feb 26, 2015

Uh oh!

irace commented Feb 26, 2015

Uh oh!

irace commented Mar 3, 2015

Uh oh!

garrettmoon commented Mar 3, 2015

Uh oh!

irace commented Mar 3, 2015

Uh oh!

garrettmoon commented Mar 3, 2015

Uh oh!

garrettmoon commented Mar 4, 2015

Uh oh!

liuliu commented May 17, 2015

Uh oh!

liuliu commented May 17, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants