Massive GDI (region) leak. Help needed. #11334

kirsan31 · 2024-05-08T13:42:17Z

.NET version

.Net7 and .Net8

Did it work in .NET Framework?

Yes

Did it work in any of the earlier releases of .NET Core or .NET 5+?

We didn't see these problems before .Net7.

Issue description

For several month we are trying to investigate huge GDI (regions) leak in our app. This leak is critical because can reach GDI limit (10k) in one day.

The leaking GDI objects are regions.
This is not managed leak - we have no leaking managed objects in dumps.
It's start leaking suddenly and most of the time continue to the the limit or app restart, but some time stopped.
The leak is related to elements redraw because of regions of course and most of the time it's happens around rdp connect/disconnect.
We can't repro this leak :(
We once managed to catch this behavior using performance HUD. This is quite problematic, because HUD slows down the working PC. And what we saw was very strange, it felt like something was starting to leak that had not leak before. Unfortunately, the call stack saved in HUD (*.hudinsight) does not show the names of the methods when viewed (may be some one knew how to overcome this?) :( And even those copied manually as text also turned out to be cropped due to the large size. Therefore, I will present here what I managed to get (sorry for this).

479 leaked regions due to redrawing (almost all are system calls):

1.csv
84 leaked regions due to close all opened child mdi forms. Closing all windows is done through the menu, when opened, the child menu is filled with open forms (15 in our case). Big part of it is leaking in MenuStrip -> Control.SetBoundsCore -> SetWindowPos. Call stacks ToolStripDropDownItem.OnDropDownOpened -> EtwWriteTransfer and ToolStripDropDownItem.OnDropDownClosed -> EtwWriteTransfer are full.

2.csv
The logic in tsmiWidow_DropDownOpened is populate childe DropDownItems with 15 (in this situation) items.
The logic in tsmiWidow_DropDownClosed is clear all items previously added:

while (tmi.DropDownItems.Count > 0) 
{
    ToolStripItem ti = tmi.DropDownItems[tmi.DropDownItems.Count - 1];
    ti.MouseDown -= ActivateW;
    var img = ti.Image;
    ti.Dispose(); // will remove on dispose
    img?.Dispose();
}

There are no managed leaks, all objects were properly deleted (this is not always 100% true I explain below).
It is very strange that the leak occurs both when adding and removing elements. In 99% of cases everything works completely correctly.
While researching I found a small managed leak here:

This small managed leak is reproducible and can't lead to such catastrophic consequences. Can easily be fixed with WeakReference here (I will open a PR later):

winforms/src/System.Windows.Forms/src/System/Windows/Forms/Input/MouseHoverTimer.cs

Lines 10 to 11 in 7504692

    
           // Consider - weak reference? 
        
           private ToolStripItem? _currentItem;

800+ (maybe not all of 937 are leaked) leaked regions including points 1. and 2. This is also done through the menu, and one submenu is filled and cleared with 43 items in our case, exactly the same as in point 2. These are all the regions that the performance HUD detected after closing all child windows. In this state (in normal condition) our app consume about 100 GDI objects and 10 of them are regions. And in this situation there were about 5500 GDI objects and 5400 of them are regions.

all.csv

OS: Windows 10 Pro for Workstations 22H2.

In conclusion, it seems to me that the problem is somewhere in Winforms, or even in the OS. Any assistance in further investigation is greatly appreciated. 🙏

Steps to reproduce

--

The text was updated successfully, but these errors were encountered:

JeremyKuhne · 2024-05-08T21:16:05Z

@kirsan31 I'm very interested in looking more deeply at this. Unfortunately, I'm tied up for a number of weeks on critical BinaryFormatter work. If you have some mitigations like the WeakReference piece, I'm happy to take a look at PRs there.

When the other work is done, I can try to see what I might be able to find out.

Also, have you been able to repro the same thing with .NET 9?

kirsan31 · 2024-05-09T07:13:37Z

@JeremyKuhne

Also, have you been able to repro the same thing with .NET 9?

I can’t reproduce this at all (no matter how hard I try). This only happens on a working machine and only during work. Moreover, leaks have never started immediately after launching the application, only after a few days. Therefore, my ability to experiment there is very limited and I cannot use .Net9 :(
I will continue my experiments...

@weltkante sorry to bother you. but may be you have some ideas?

I'm tied up for a number of weeks on critical BinaryFormatter work.

By the way, I have a question on this topic that no one has answered yet.

weltkante · 2024-05-09T08:36:27Z

@weltkante sorry to bother you. but may be you have some ideas?

No problem, unfortunately this is nothing I've come across in the past. So far I've always been able to rely on managed leaks and memory snapshots/dumps to compare, or being able to reproduce the problem locally and do a time travel debug trace for inspecting the unmanaged leaks. Seems like neither is an option for you.

If I had to diagnose this issue I'd probably try to isolate what effects it:

make a dump of the leaking process from the task manager and check for any unexpected 3rd party dlls that may have injected themselves into the process
have a second machine being setup and used. if it never happens on another machine the machine may be simply broken or the windows installation has been corrupted
if possible consider some sort of virtualization for the second setup for easy transfer between machines. I've been using hyperv based windows sandbox scripts lately to setup isolated applications in tricky cases, but there may be alternatives that are easier

weltkante · 2024-05-09T08:49:42Z

Oh, and make sure the finalizer thread is not stuck on something (look at a few dumps in a debugger after leaks started and check that the thread is idle or at least differs between dumps). Depending on your tooling finalizable objects may not show up as leaks in your managed analysis, but if the finalizer thread is hanging and can't finalize things anymore that may end up this way.

kirsan31 · 2024-05-09T12:37:29Z

@weltkante

make a dump of the leaking process from the task manager and check for any unexpected 3rd party dlls that may have injected themselves into the process

Nice case, just checked - everything is ok here. But the probability was extremely low because... A work PC has very high restrictions on installed software and Internet use.

have a second machine being setup and used. if it never happens on another machine the machine may be simply broken or the windows installation has been corrupted
if possible consider some sort of virtualization for the second setup for easy transfer between machines. I've been using hyperv based windows sandbox scripts lately to setup isolated applications in tricky cases, but there may be alternatives that are easier

Due to the specifics, there is no way for us to configure either a second physical machine or a virtual one. And users won’t approve of this, it’s easier for them to restart the application every few days :)

Oh, and make sure the finalizer thread is not stuck on something (look at a few dumps in a debugger after leaks started and check that the thread is idle or at least differs between dumps). Depending on your tooling finalizable objects may not show up as leaks in your managed analysis, but if the finalizer thread is hanging and can't finalize things anymore that may end up this way.

There's nothing wrong with that. Because other objects are finalized normally, and most regions too. Also in dumps after GC, ready for finalization objects are empty and nothing extraordinary in dead objects.

Thank you for your attention any way 🙏

kirsan31 · 2024-05-10T13:29:03Z

A small update on what I found out.

As a result, native functions leak, very often these are SetWindowPos.
When running our two applications in parallel, the leaks are not the same. In one application, every menu restructuring is leaked, while in another it is not leaked at all. It follows from this that the problem is not system wide, but begins in a certain process after certain conditions are met (and after certain conditions can stop).

weltkante · 2024-05-10T14:18:11Z

As a result, native functions leak, very often these are SetWindowPos.

Just as a side note to avoid other people reading this drawing the wrong conclusions: SetWindowPos can trigger a lot messages, including callbacks into managed code, so its unlikely to be the direct cause of the leak

kirsan31 · 2024-05-10T14:29:18Z

SetWindowPos can trigger a lot messages, including callbacks into managed code, so its unlikely to be the direct cause of the leak

Of course that's not the direct cause of the leak. And SetWindowPos really trigger a lot messages, but no one callback to managed code:

I point this method like the most common last managed method in call stack.

weltkante · 2024-05-10T14:45:48Z

but no one callback to managed code

Sounds weird, all WinForms controls should, at the very least, go through the managed message handler of the control. And SetWindowPos can (and usually does) trigger resize and redraw logic, both of which can have managed event handlers that need to be dispatched too, even if they are empty.

Anyways, just meant to say that seeing this method as call root doesn't mean the problem is guaranteed to be on the native code.

lonitra · 2024-07-23T18:45:48Z

@kirsan31 Do you think you could provide consistent repro for us to investigate this?

kirsan31 · 2024-07-23T18:49:25Z

@kirsan31 Do you think you could provide consistent repro for us to investigate this?

I hope so... Currently the issue still exist (very sporadically) and I can't get the root cause :(

JeremyKuhne · 2024-07-24T23:18:10Z

@kirsan31 as soon as we get actionable stuff here we can assign it to whatever the current release is.

kirsan31 · 2024-08-06T10:49:56Z

Once again we were able to directly catch leaks and use the performance HUD. What we found out:

Leaks are completely related to the RDP session; they occur during connection, disconnection and also when manipulating the RDP window; one of the leaks happened simply when minimizing the RDP window.
Leaks do not depend on the RDP client application.
Leaks do not depend on the client OS.
Two of our applications were launched - leaks were observed simultaneously in both.
As I already wrote, it feels like absolutely everything related to redrawing is starting to leak. Moreover, this is not a constant process - not every connection/disconnect/change in the RDP window entails a leak.

This time I copied (in several approaches) all the stacks for two applications from the performance HUD (If necessary, I will provide them all). But they don’t give anything new - all the stacks are some kind of drawing of a menu/tooltip, etc., which always end with EtwWriteTransfer. Example:

leaked.mp4

From all this and the fact that before .Net7 such behavior was not observed, I have only two possible assumptions - either this is somehow a Windows bug/corruption (appeared with some kind of system update), or, after all, a regression in .Net.

Does anyone have any other ideas or tips?

P.S. why do all messages go through office component ComponentManager.Microsoft.Office.IMsoComponentManager.FPushMessageLoop?
//cc @JeremyKuhne

weltkante · 2024-08-06T12:45:27Z

P.S. why do all messages go through office component
ComponentManager.Microsoft.Office.IMsoComponentManager.FPushMessageLoop?

Thats just an interface for Office/VisualStudio compatibility, the naming is historical, WinForms uses its own implementation if Office/VS is not detected to provide the interface implementation.

JeremyKuhne · 2024-08-06T16:52:57Z

omponentManager.FPushMessageLoop?

Thats just an interface for Office/VisualStudio compatibility, the naming is historical, WinForms uses its own implementation if Office/VS is not detected to provide the interface implementation.

Note that this will now be turned off by default (.NET 9). It can be turned back on with the "Switch.System.Windows.Forms.EnableMsoComponentManager" switch.

Even with the stub it was a fair amount of overhead for message processing. As we had to rewrite all of our COM for ComWrappers we took the opportunity to simplify the message loop.

kirsan31 added the untriaged The team needs to look at this issue in the next triage label May 8, 2024

lonitra added this to the .NET 9.0 milestone May 8, 2024

lonitra removed the untriaged The team needs to look at this issue in the next triage label May 8, 2024

elachlan added tenet-performance Improve performance, flag performance regressions across core releases 💥 regression-release Regression from a public release labels May 9, 2024

kirsan31 mentioned this issue May 11, 2024

Fix ToolStrip memory leak due to MouseHoverTimer and #4808 #11358

Merged

lonitra removed the 💥 regression-release Regression from a public release label Jul 23, 2024

lonitra added the 📭 waiting-author-feedback The team requires more information from the author label Jul 23, 2024

dotnet-policy-service bot added untriaged The team needs to look at this issue in the next triage and removed 📭 waiting-author-feedback The team requires more information from the author labels Jul 23, 2024

JeremyKuhne modified the milestones: .NET 9.0, Future Jul 24, 2024

JeremyKuhne removed the untriaged The team needs to look at this issue in the next triage label Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Massive GDI (region) leak. Help needed. #11334

Massive GDI (region) leak. Help needed. #11334

kirsan31 commented May 8, 2024 •

edited

Loading

JeremyKuhne commented May 8, 2024

kirsan31 commented May 9, 2024

weltkante commented May 9, 2024

weltkante commented May 9, 2024 •

edited

Loading

kirsan31 commented May 9, 2024

kirsan31 commented May 10, 2024

weltkante commented May 10, 2024 •

edited

Loading

kirsan31 commented May 10, 2024

weltkante commented May 10, 2024 •

edited

Loading

lonitra commented Jul 23, 2024

kirsan31 commented Jul 23, 2024

JeremyKuhne commented Jul 24, 2024

kirsan31 commented Aug 6, 2024

weltkante commented Aug 6, 2024

JeremyKuhne commented Aug 6, 2024

Massive GDI (region) leak. Help needed. #11334

Massive GDI (region) leak. Help needed. #11334

Comments

kirsan31 commented May 8, 2024 • edited Loading

.NET version

Did it work in .NET Framework?

Did it work in any of the earlier releases of .NET Core or .NET 5+?

Issue description

Steps to reproduce

JeremyKuhne commented May 8, 2024

kirsan31 commented May 9, 2024

weltkante commented May 9, 2024

weltkante commented May 9, 2024 • edited Loading

kirsan31 commented May 9, 2024

kirsan31 commented May 10, 2024

weltkante commented May 10, 2024 • edited Loading

kirsan31 commented May 10, 2024

weltkante commented May 10, 2024 • edited Loading

lonitra commented Jul 23, 2024

kirsan31 commented Jul 23, 2024

JeremyKuhne commented Jul 24, 2024

kirsan31 commented Aug 6, 2024

weltkante commented Aug 6, 2024

JeremyKuhne commented Aug 6, 2024

kirsan31 commented May 8, 2024 •

edited

Loading

weltkante commented May 9, 2024 •

edited

Loading

weltkante commented May 10, 2024 •

edited

Loading

weltkante commented May 10, 2024 •

edited

Loading