Why dev.library.kiwix.org is regularly extremly slow? #194

kelson42 · 2024-05-16T16:38:53Z

I guess the hardware is on its limit but what are the services mostly responsible for that?

rgaudin · 2024-05-16T17:34:29Z

it's not (entirely) an hardware issue. It's kiwix-serve crashing frequently. We chose to expose it directly to be aware of those things but it seems that the higher number of test zims increased the instability.

I'd suggest we assess the need for each zim there and remove or move elsewhere those that don't need to be there (anymore).
This would greatly simplify any investigation.

We already have a ticket on lib kiwix about those crashes.

If we want to rely on dev library, kiwix serve should not be exposed.

benoit74 · 2024-05-17T07:30:18Z

I don't think it is hardware limitation either, library.kiwix.org is running on the same machine and is not experiencing much slow down.

Dev library is currently serving 897 ZIMs which should not be a concern (at least I would expect kiwix-serve to be able to handle this amount of ZIMs when run anywhere in the wild)

So while we all agree we could probably prune most ZIMs present in this dev library, I consider this is not the right approach yet.

Current situation is more a good opportunity to learn what is going wrong.

This is the memory consumption of kiwix-serve for dev library (timezone is UTC)

As you see, it restarts many times per day. Some of these restarts (e.g. at 4am UTC this morning) is linked to a rolling update due to a new image being available (we use the nightly build which is rebuild quite a lot obviously), hence the short double RAM usage (kubernetes starts a new updated container before stopping the old one).

What is interesting to notice is that it seems to restart every time we get close to 1G RAM ... which is the limit of memory we've assigned for this container in Kubernetes. It does not looks like it is an OOM kill however, I do not find usual logs stating this event. This is nevertheless a very significant difference with prod which does not have any limit in terms of memory consumption.

As an experiment, I've increased the memory limit to 1.5G, so that we can confirm if there is a correlation between the memory consumption / limit and the service restarts.

Another aspects to keep in mind, as already stated in #147, is that we have a number of levers at our disposal to customize kiwix-serve behavior and control its memory consumption. None of them have been customized for dev.library. I continue to consider that doing small experiments on these values would greatly help understand and properly customize kiwix-serve behavior.

benoit74 · 2024-05-17T07:35:29Z

I've pushed the memory graph to a dashboard dedicated to dev library. I hope we will update this dashboard with more metrics upon time.

https://kiwixorg.grafana.net/d/fdlyk9cwqr8xsb/dev-library?orgId=1

rgaudin · 2024-05-17T07:51:12Z

So while we all agree we could probably prune most ZIMs present in this dev library, I consider this is not the right approach yet.

My suggestion is linked to kiwix/libkiwix#760. We believe that some incorrect (how?) ZIM trigger crashes. Since we did not remove ZIMs from there, the culprits of that time are probably still present.
I am curious to know if removing them would reduce the number of crashes/restart.

We want to investigate those crashes but it's unrealistic because the library is huge, we have no idea which ones causes issues and the kiwix-serve logs are unusable: because of its formatting, because there's multi-users traffic at all times and because it doesn't log probably when this happens.

The periodic restarts might be RAM related ; we'll see if the graph repeat but around 1.5GB 👍

benoit74 · 2024-05-17T12:15:13Z

Increasing memory available might then help with restarts but make the situation worse regarding crashes ^^

If we confirm we still have crashes, I would suggest to simply trash mostly the whole dev library in a one-shot manual action:

I list the ZIMs present today
we pin the few ZIMs which are necessary to keep
we move everything else to a quarantine zone for 3 months, should we realize we forgot to pin some valuable ZIMs
3 months later, we delete the quarantine zone

kelson42 · 2024-05-17T17:40:01Z

Current situation is more a good opportunity to learn what is going wrong.

I really agree with this. What could be done as approach to better identify the reproduction steps for crash scenarios?

To me: good chances we have a problem around Kiwix Server mgmt, see also kiwix/libkiwix#1025

benoit74 · 2024-05-21T06:48:49Z

Experiment conclusion seems quite clear: when we add more RAM, the DEV server restarts way less often.

I increased the allocated RAM even further to 2.5GB, which seems to be sufficient for 24h activity (the DEV server always restarts at 4am UTC to apply the nightly build. I don't say this is the proper long term solution, but it might allow to confirm if we still suffer from crashes and when.

What could be done as approach to better identify the reproduction steps for crash scenarios?

I don't know

rgaudin · 2024-05-21T08:12:32Z

Fortunately, there's a lot of RAM to spare on storage server.

benoit74 · 2024-05-21T08:29:12Z

Fortunately, there's a lot of RAM to spare on storage server.

Yep, and I'm quite sure I will soon start to experiment with kiwix-serve environment variables to reduce this RAM usage to a way more sustainable level 🤓

rgaudin · 2024-05-21T08:34:05Z

Yes, as discussed separately ; it's really important that those switches are properly documented so we can also leverage them on the hotspot.

kelson42 · 2024-05-25T07:40:43Z

See also #170

kelson42 · 2024-05-25T07:43:12Z

@rgaudin @benoit74 i believe we might run a performance push taskforce around kiwix-serve to tackle these kind of problems. Might be actually a hackhaton topic.

rgaudin · 2024-05-27T08:16:01Z

Twice this week the GH actions that runs at 8am UTC failed: On May 25th and on May 27th. In both cases I get Read timed out (5s) on the test but the service is running, has not restarted and is not close to the RAM crash in the graph.
Testing on some random ZIM/content soon after one failure was working OK. Maybe some requests from the tests (all are catalog related) are difficult to answer within 5s under certain circumstances…

benoit74 · 2024-05-28T07:33:05Z

As discussed yesterday, we all consider it is now time to move to a plan B, but this plan is still unclear.

From my perspective, the experience with dev.library.kiwix.org is way better than before the RAM increase, but it is still not satisfactory, i.e. there are still some slowdowns.

After some thought, I wonder if these slowdowns are not just linked to IO issues on the disk. On production, these issues could be hidden by the varnish cache which is expected to be especially efficient on the catalog and hence won't trigger problem on 8am UTC tests.

How easy would it be to implement a Varnish cache in front of dev.library.kiwix.org as well? It looks pretty straightforward to me, and will clearly a flight forward, it will help to confirm the problem is most-probably IO related.

rgaudin · 2024-05-28T07:57:18Z

it will help to confirm the problem is most-probably IO related.

Absolutely not. It will hide everything but the first request to a resource. It would be a good measure to improve the service to users but will not help (on the contrary) with finding the actual cause(s) behind this.

I am still awaiting an update clarifying the role of dev.library. We had a lot of discussions about this when we started it but it seems to have shifted.

Currently this is an internal testing tool:

for ZIM content team to validate their WIP recipes
to test nightly kiwix-serve via many eyes on a live scenario.

I understand we are now sending links to users/clients on dev.library. That's the role of a staging library.

Do we want prod/staging/dev ? Just prod/staging?

kelson42 · 2024-05-28T08:54:38Z

Currently this is an internal testing tool:

Yes and we see that this might be quickly challenging to change the scope of it Therefore I have open a dedicated issue to think our requirement out of the box. See #199

rgaudin · 2024-06-11T08:10:25Z

Still fails everyday (timeout)

Not related to resources apparently (restarts at 04:00)

kelson42 added the question Further information is requested label May 16, 2024

kelson42 assigned rgaudin and benoit74 May 16, 2024

benoit74 mentioned this issue Aug 2, 2024

Investigate potential hardware issue on Scaleway storage node #227

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why dev.library.kiwix.org is regularly extremly slow? #194

Why dev.library.kiwix.org is regularly extremly slow? #194

kelson42 commented May 16, 2024

rgaudin commented May 16, 2024

benoit74 commented May 17, 2024

benoit74 commented May 17, 2024

rgaudin commented May 17, 2024

benoit74 commented May 17, 2024

kelson42 commented May 17, 2024 •

edited

Loading

benoit74 commented May 21, 2024

rgaudin commented May 21, 2024

benoit74 commented May 21, 2024

rgaudin commented May 21, 2024

kelson42 commented May 25, 2024

kelson42 commented May 25, 2024

rgaudin commented May 27, 2024

benoit74 commented May 28, 2024

rgaudin commented May 28, 2024

kelson42 commented May 28, 2024

rgaudin commented Jun 11, 2024

Why dev.library.kiwix.org is regularly extremly slow? #194

Why dev.library.kiwix.org is regularly extremly slow? #194

Comments

kelson42 commented May 16, 2024

rgaudin commented May 16, 2024

benoit74 commented May 17, 2024

benoit74 commented May 17, 2024

rgaudin commented May 17, 2024

benoit74 commented May 17, 2024

kelson42 commented May 17, 2024 • edited Loading

benoit74 commented May 21, 2024

rgaudin commented May 21, 2024

benoit74 commented May 21, 2024

rgaudin commented May 21, 2024

kelson42 commented May 25, 2024

kelson42 commented May 25, 2024

rgaudin commented May 27, 2024

benoit74 commented May 28, 2024

rgaudin commented May 28, 2024

kelson42 commented May 28, 2024

rgaudin commented Jun 11, 2024

kelson42 commented May 17, 2024 •

edited

Loading