Fix for crash when using NSAutoreleasePool on macOS #1414

MarkOates · 2023-03-01T23:08:48Z

Problem

This pull request fixes a crash that can occur when using NSAutoreleasePool during al_destroy_display in macOS. The crash appears inconsistently, at different times, with different types of crashes, and sometimes not at all - but only occurs after a call to al_destroy_display(). Also, the crashes only began to appear after I upgraded macOS to Ventura (13.1) on M1.

I believe it's the result of a double-free.

Possible Solution

To fix this issue, the line al_free(d) was removed from the end of al_destroy_display(). In this case, I'm guessing the declaration of ALLEGRO_DISPLAY_OSX_WIN* dpy = (ALLEGRO_DISPLAY_OSX_WIN*) d; at the top of the function within the scope of a NSAutoreleasePool may have resulted in the [pool drain] destroying dpy. Then, with al_free(d) having already destroyed the resource, the crash could possibly occur at some managed point after a pool drain, once the draining thread had a chance to address it.

Now, I'm not familiar with Objective-C and not entirely confidant with how Allegro 5 manages its memory and contexts internally, so this is just a guess to the solution/cause. Removing the al_free(d) (or removing the [pool drain] line entirely) did stop the crashes.

Recreating the Issue

This issue only appeared on macOS (Ventura 13.1, M1), with the fix here resulting in no more crashing.

Also, I tested the change on macOS (Monterey 12.5.1, Intel), which did not produce the issue before, and it still continues to run normally.

Easiest way to reproduce the problem (if you can reproduce the ✨secret hardware configuration✨) might be with this code below. Running the program will likely reproduce a crash 9 out of 10 runs.

#include <allegro5/allegro.h>
#include <iostream>

int main(int argc, char** argv)
{
   al_init();
   ALLEGRO_DISPLAY *display = nullptr;

   int attempts = 20;

   for (int attempt=0; attempt<attempts; attempt++)
   {
      std::cout << "creating display (attempt " << attempt << ")" << std::endl;
      display = al_create_display(1920, 1080);
      std::cout << "   display created" << std::endl;
      std::cout << "destroying display" << std::endl;
      al_destroy_display(display);
      std::cout << "   display destroyed" << std::endl;
   }

   std::cout << "Process finished." << std::endl;

   al_uninstall_system();
   return 0;
}

MarkOates · 2023-03-02T14:59:51Z

I've attached some crash report info here:

Zip contains the results of 10 runs. Each run that crashed (9 of 10) includes a backtrace of all the threads, and if a crash report was output to the terminal that is included as well.

The terminal's output of the crash reports did appear threaded, so it's possible there could be multiple crash reports threaded together in the output. Backtraces are clean though.

crash_reports.zip

allefant · 2023-03-02T16:11:33Z

How did you create the backtraces? E.g. the stack trace in run05-crash-report.txt does not match the backtrace in run05-backtrace.txt, ideally it would be the one from the crash but it seems to be from some time afterwards?

allefant · 2023-03-02T16:12:53Z

I usually do this:

lldb ./my-app
r
(wait for crash)
bt all

Then the console should contain the crash output as well as the backtraces.

MarkOates · 2023-03-02T16:29:22Z

lldb ./my-app
r
(wait for crash)
bt all

Yes, that's actually exactly what I did. The stack traces and crash reports do not match. 🤷‍♂️

Here are some screenshots of the terminal (different kinds of crashes, some with reports output to terminal, some not): more_crash_reports-screenshots.zip

It's honestly the weirdest problem I've ever observed/debugged.

pedro-w · 2023-03-02T22:16:54Z

I noticed the crash seems to happen quite often when creating the new display, in [NSOpenGLContext makeCurrentContext] where presumably making a new context current means making the old context not current, but the old context might have been destroyed by then. Is that what you're thinking? (sorry just trying to catch up)
At this point we're in the main thread and we don't do graphics on the main thread, so the reason for making the new context current is (maybe?) just so the calls to glGetString() will succeed in osxgl.m:1640-42.
I don't know if releasing a context when it's current automatically causes it to be made not current, so there are two possible things to look at

call [NSOpenGLContext clearCurrentContext] when destroying a display
make sure the release is called on the same thread as the alloc, currently it isn't as the former is on the user's thread and the latter is on the main thread

Could you try making some changes to osxgl.m as below and seeing if it makes any difference? Frustrating because I don't have your hardware and I can't ever get it to crash like this.

diff --git a/src/macosx/osxgl.m b/src/macosx/osxgl.m
index a4db583e0..69319365c 100644
--- a/src/macosx/osxgl.m
+++ b/src/macosx/osxgl.m
@@ -1774,9 +1774,11 @@ static void destroy_display(ALLEGRO_DISPLAY* d)
          [dpy->win close];
          dpy->win = nil;
       }
+      // Context was alloc'd on the main thread so should be released on the same thread
+      [NSOpenGLContext clearCurrentContext];
+      [dpy->ctx release];
    });
    _al_ogl_unmanage_extensions(&dpy->parent);
-   [dpy->ctx release];
    [dpy->cursor release];
    _al_event_source_free(&d->es);
    al_free(d->ogl_extras);

MarkOates · 2023-03-03T16:55:38Z

Could you try making some changes to osxgl.m as below and seeing if it makes any difference?

I tried out these changes but unfortunately it didn't seem to have any affect on the amount/number of crashes.

Frustrating because I don't have your hardware and I can't ever get it to crash like this.

I hear that. I wish there was something I could do to mitigate it. Bear in mind, I'd be more than happy to attempt any other ideas you may want to try out though.

pedro-w · 2023-03-05T13:36:23Z

I'm running a bit low on ideas. Still thinking about the fact the crash seems to come from with [NSOpenGLContext dealloc] could you try moving the call to [NSOpenGLContext clearCurrentContext] (line 1780) in your screenshot to the start of the function (just after the line unsigned int i)?

Because al_free is just a wrapper around free, if removing it fixes the crash it must be that something is holding a reference to part of the display structure which is modified (i.e. corrupted) by the action of free.

pedro-w · 2023-03-05T13:46:24Z

One other thing - I don't think the autorelease pool is needed in destroy_display, maybe try getting rid of it altogether - if it was needed you should see a message printed to the console when you run it

Remove al_free() possibly conflicting with pool drain

42588c0

MarkOates force-pushed the remove-double-free branch from b0f62ed to 42588c0 Compare March 28, 2023 15:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for crash when using NSAutoreleasePool on macOS #1414

Fix for crash when using NSAutoreleasePool on macOS #1414

MarkOates commented Mar 1, 2023

MarkOates commented Mar 2, 2023

allefant commented Mar 2, 2023

allefant commented Mar 2, 2023 •

edited

Loading

MarkOates commented Mar 2, 2023 •

edited

Loading

pedro-w commented Mar 2, 2023

MarkOates commented Mar 3, 2023

pedro-w commented Mar 5, 2023 •

edited

Loading

pedro-w commented Mar 5, 2023

Fix for crash when using NSAutoreleasePool on macOS #1414

Are you sure you want to change the base?

Fix for crash when using NSAutoreleasePool on macOS #1414

Conversation

MarkOates commented Mar 1, 2023

Problem

Possible Solution

Recreating the Issue

MarkOates commented Mar 2, 2023

allefant commented Mar 2, 2023

allefant commented Mar 2, 2023 • edited Loading

MarkOates commented Mar 2, 2023 • edited Loading

pedro-w commented Mar 2, 2023

MarkOates commented Mar 3, 2023

pedro-w commented Mar 5, 2023 • edited Loading

pedro-w commented Mar 5, 2023

allefant commented Mar 2, 2023 •

edited

Loading

MarkOates commented Mar 2, 2023 •

edited

Loading

pedro-w commented Mar 5, 2023 •

edited

Loading