-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
With solo5, firewall runs out of memory after a while and can't recover #120
Comments
I managed to get memtrace (and memtrace_viewer) to work with MirageOS unikernels -- take a look at https://github.com/hannesm/memtrace-mirage this may be useful for debugging it further (being able to see the OCaml view of which call stacks refer to how much memory) -- see https://blog.janestreet.com/finding-memory-leaks-with-memtrace/ for a great description of memtrace_viewer. |
My current thinking is that The second solution offered by @mato in #116 (comment) was to use The easiest workaround may be to add the |
It seems to fail very suddenly. Here's an example (I got it to print GC stats and heap stats once a minute; the unikernel had been running fine for several days before this):
So, it had 115 MB free. Then it tried to increase the OCaml heap by 1.5 MB. Then there was only 12 MB free?? When this happened, I was uploading a 30K file to gmail. When it happens, it always seems to have just under 10% free. |
If you give it, say, 512MB of memory, and sample meminfo every 30s or so (both the raw numbers from GC and OS.MM), what does that look like over time? And, is there an amount of memory with which you reach a plateau and don't run out? |
@mato: that's roughtly what I'm doing already (except sampling every 60s rather than every 30s). I'd normally run the firewall with 32MB, but here I'm testing with 128MB, which should be way more than it needs. Memory use looks completely flat until it fails. I only showed the last two good reports before failure, but it continued like that for days. Note that the reports above where it went from 115 MB free to 12 MB free (and |
Ah, so, it suffers from "sudden death" after working fine for days. That's very strange. No ideas off the top of my head, I'll think about it. |
Between I'm not sure I understand correctly how and when the footprint value is updated but looks like we might have to expect success in this test : https://github.com/mirage/ocaml-freestanding/blob/master/nolibc/dlmalloc.i#L4775 This bug is difficult to trigger because we may have to wait a long time to get it. I'm currently trying to evaluate the free memory available with |
Update:
This morning I got the following in the unikernel logs:
So now I still don't know what the bug is due to, but when it appears the kernel runs 100% for 30" then goes back to a normal situation. |
After the previous Now I'm trying to use |
Update: Unfortunately, using I actually have the same memory reporting with
and with
|
bit of a strange observation here. for the last 5 days, i have been running a mirage-fw with the following particulars there:
this fw has been stable for 5d now, and grepping the log there has been zero movement in the mem stats, they have all been "INF [memory_pressure] Writing meminfo: free 987MiB / 996MiB (99.16 %)". for comparison, another instance (no mem-reporting, started as dom0-supplied kernel, 64MB total) would reliably hit the OOM wall within very few hours. not fully sure what that means yet, some options include "it didnt see any peaks at all", "the peaks are too short to be seen with 10min reporting, and stay well below 1GB", or "the OOM doesnt happen when started through multiboot". going to crosstest with a different load profile // usecase and a different mem setting and/or bootpath in the backup role next. |
i can rule out "something different about the used package versions" (caused by pinning solo5 to dev for multiboot or building outside docker in general): still no OOM when multibooted, and using it in "all roles" now. one unrelated looking crash:
no idea what that was about or what triggered it. |
Any developments on this? |
From #120 (comment) I hear:
Any updates on this @xaki23? The Netif error "Failure 'Netif: ERROR response'" is rather poor error handling, maybe @talex5 or @djs55 or @avsm have an idea what a good error handle should be? Maybe the netif should be removed? (Failing/exiting looks like the wrong strategy). |
"Netif: ERROR response" only appears in frontend.ml, so this means the failed connection is the one to sys-net, so recovery probably isn't possible. Though perhaps it could try reinitialising the interface. |
Here are the results of my investigation. Under the following conditions :
I managed to get the following results :
Here are the last logs of the firewall :
I managed to get these results on both the unikernel as a VM-kernel firewall and with the grub multiboot version (QubesOS/qubes-issues#6162). It occurs most of time within 10 minutes when used as a VM-kernel. It takes a bit more time on the grub multiboot version. Can it be possible that CPU overloading imply some tasks on memory balance disturbing the firewall ? |
Update: Here are some news regarding this issue. The #121 misled us because it seems that the memory suddenly drops below 10% which indicates that the unikernel is suffering from a sudden death error. In fact, the stats parameter is not updated before every meminfo prints.
We can see that the free memory slowly decreases until it reaches the 10% threshold (without sudden death issue). In the following log, the free memory decreases (a print is made every 2") when network traffic occurs:
The current footprint function used to estimate the free memory comes from #116 (comment) because We have therefore tried to implement a replacement for this The following log shows 3 differents things:
In the previous we can see that the Stat_slow (mallinfo) and the Stat_alt (alternate used memory) are almost the same, while the footprint values drops down quickly. So far, I'm still facing an issue, at some point we overestimate the free memory (rephrase we underestimate the used memory), I must be missing an allocation in dlmalloc or counting twice some frees:
Despite the last issue I think that this track is promising, any thought on that ? |
Nice work - and sorry about the misleading PR! I don't have a QubesOS machine any longer, but has anyone tried @hannesm's memory profiling mentioned in #120 (comment)? That will probably show what's taking up the space. |
I can also confirm that this estimate is pretty good, I have the same free memory percentage as mallinfo on the 2 decimals places (which gives an error of at most 6kB). I could also observe a time when the fast estimate is too optimistic compared to the real free memory value, and then it comes back to a good estimation within 20 sec:
I will try to monitor this alternative quick stat function on a simpler unikernel. With that I guess I will be able to use the memstat profiler. |
Update (with good news):
And add the following modification into
It probably will needs some tuning around the 60% and 10% values, but the idea behind that is that we now have the real memory used and even if we're a little bit below 60% of free memory it may be like guyere and Ocaml will have some troubles getting blocks large enough for some datas allocations. When I tried to be too eager with the first fraction I ran into an issue where the heap goes really high (at 90+% of the heap+stack area) and triggers a page fault ( Using the same protocol as @winux138 and with this memory pressure policy change I never get down to 25% of free memory with |
This is excellent news, thanks for you hard work on it @palainp. Could you open PRs for the projecs involved? Then we can review, merge, and release :) |
This unifies the three approaches to memory map: - multiboot - start_info - E820 Since the hypercall and E820 map works in both settings, cut the code complexity and always use the XENMEM_memory_map. See QubesOS/qubes-issues#6162 and mirage/qubes-mirage-firewall#120
This unifies the three approaches to memory map: - multiboot - start_info - E820 Since the hypercall and E820 map works in both settings, cut the code complexity and always use the XENMEM_memory_map. See QubesOS/qubes-issues#6162 and mirage/qubes-mirage-firewall#120
This unifies the three approaches to memory map: - multiboot - start_info - E820 Since the hypercall and E820 map works in both settings, cut the code complexity and always use the XENMEM_memory_map. See QubesOS/qubes-issues#6162 and mirage/qubes-mirage-firewall#120
First reported by @palainp in #116 (comment). I see this too now, after a few days.
The text was updated successfully, but these errors were encountered: