-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Could Not Expand Globs - Context Cancelled #479
Comments
@bom-d-van Any chance you should shed some light on this? :) |
it's most likely hitting a timeout. how many whisper files do you have for the server? are you able to |
@bom-d-van Its not exhausting IOps, CPU or RAM. There are 30+ TB of whisper files, so a lot! Any suggestions for improving performance? |
@gkramer problem with no index at all approach is that you rely on OS to be able to give you list of files fast. However, depending on your filesystem and other factors That is also the reason why @bom-d-van asked to try to measure how fast |
@Civil Thank you for the response. So the most efficient approach is to add more RAM [in order to enable indexing] to the system? (I find myself wondering about the impact of using swap, if only to validate the indexing approach temporarily - this is the primary service running on the box) side points:
|
@Civil @bom-d-van The problem is back, and indexing is turned on - both! Any ideas? Should I look to upgrade the disks to host-based SSD? It wouldnt appear as if this is the bottleneck, as IOps never spike above 4k, and the allocation is 8k. I'm also seeing upstream issues with carbonapi, but I'd prefer to resolve the go-carbon issues first. Do you know if the developers have stress-tested the daemon for parallel queries - as in aggregating 50+ wsp files? I'd really like to get this issue resolved and behind me :) Thanks for all your advice and insight to date! |
hi, can you share the config for go-carbon for us to have more context?
50+ metrics should not be a problem in my experience. but usually the servers that I have seen running go-carbon has 96-128GB of memories and 16+ cpu cores. so it feels like that you might have to either scale out and scale up. What are the timeouts below that you have for carbonserver?
|
go-carbon.conf
carbonapi.conf
|
try the configuration bellow to see if it helps, maybe you can also generate some go profiling results like cpu, heap, and goroutines if you are still seeing problems:
that said, the chance is high that you might need to scale the servers for your load. |
I think key to understanding of what is the bottleneck in your case is gathering metrics and some debug information from your server. For example if you suspect that it's I/O (indeed, whisper doesn't like slow I/O and thrive on SSDs, that's unfortunately by design there). I would collect some basic ones, like I/O performance, io wait time, etc., check the cpu usage with breakdown by User, I/O and Sys of course, that would also give you some clues. And in case Sys usage is somewhat high, you can try to record a perf dump and maybe look at the flamegraphs based on it (https://www.brendangregg.com/flamegraphs.html). Doing so should show you where the problem is. For example if you'll see high I/O and in perf the top time taking syscalls would be related to disk (stat*, read*, write*, ...) - then that's likely would be your bottleneck. |
@bom-d-van I've run queries against the box whilst running iostats, and iowait barely ticks above 7%. carbonapi also seems to flatline. I see intermittent success in graph creation, but dont fully understand why it works at some times and not at others. I've also verified FD and filemax limits, and all seem fine. I'd be happy to scale the box, but I'm having trouble identifying whats under sufficient load to justify the work... |
Yeah, that sounds strange. Can you try figuring out the following questions and share it if it's possible? As Vladimir shared above, it's more of generic debugging or root-cause-finding process.
Off-topic: We should create some sharable/open-source-able Grafana (and friends) dashboard formats for common metrics exported by different systems and tools. This way we don't have to re-create dashboards for every systems and we can all speak the same language, see and talk the same thing. |
It seems that you are only trying to read one metric in this API call, and it fails immediately
|
@bom-d-van I think important part here is that it's But that potentially could be that request was stuck in OS TCP Socket backlog for too long (for example) and by the time it can be accepted by go-carbon it was already too late. Based on that I would suggest to try increasing the backlog. That would be And you can check the backlog for the socket by using If that won't help - it would be important to understand where timeout actually came from and why. Key might be to find in |
@Civil these timeouts are for initiating http.Server here. Based on my understanding, it should be a pure go std library/runtime thing. Are there any tcp magic that the kernel can tell user space that a request should be timed out?
@Civil yep, I think the chance is low that it's a data/whisper file issue. but it's good to confirm it. @gkramer you can also try using bpftrace or just hacking go-carbon from this place where the error is reported, just to see why the request is failing. Just to double check, is the always the same request failing or just different requests failing at different times? |
@bom-d-van depends on what you do, you can pass deadline from upstream and reuse it for specific request, for example. So it can be somewhat implied. As well as in the code no one forbid you to redefine timeout. As I've said, there is some chance that if you run out of backlog kernel buffer (see sysctl above), you might have too many requests enqueued for too long that might mean that by the time Golang have a chance to process connection, deadline already has passed, connection was closed by upstream and that would immediately result in context being canceled. Increasing backlog won't fix the underlying slowness but might reduce amount of timeouts. That is runtime sysctl so increasing the value should be relatively safe and easy test. Oh, and because of that you might have some errors in the logs that are just red herrings and steer you away from actual problematic queries. |
Just to be exploring the idea of stuck tcp connections, @gkramer can you also try to find and share the runtime/latency/timeout for the failed requests in carbonapi log? If it's indeed caused by tcp being stuck, we should be able to see the runtime to be above 120s in your example. In go net/http, a read timeout would start counting when it tries to read the header. What's more, in your carbonapi config, you have |
@bom-d-van if they are stuck - it can be that runtime will be 0.0 from go-carbon's point of view, but it will be However problem is that if in carbonapi some of the concurrency settings will be misconfigured, effect will be the same, but they'll be stuck waiting for a slot available inside carbonapi's code. So it would be relatively hard to distinguish, unless you check |
OK, so apologies for merging information from go-graphite and carbonapi, but I'm going to put everything here: Go-Graphite:
CarbonApi:
GrafanaFE:
Carbon-relay-ng (CRNG):
Re: General:
|
So I cranked up the timeline to 30 days, and that totally annihilated the machine - Iowait shoots up to 70%, RES/VIRT mem shoots past 120GB+, and the kernel then terminates go-graphite. So... brings me back to my original question:
|
@Civil @bom-d-van It does seem to me that go-graphite could benefit from optimisations re RAM and indexing...
Interested to hear your thinking, and greatly value all your assistance to date!!! |
Have you tried my recommendation config here: #479 (comment)
It's not really about how big your whisper files are in total. More like how many uniq metrics you have for the server or the system. And it seems you haven't figured it out yet? There are metrics reported by go-carbon ( For general scaling question:
This I'm not certain. You might have to benchmark it for your production load because whisper schemas and write loads varies.
I usually look at the
For this, we would appreciate if you could file a bug report in the carbonapi repo. For the logs, it's not very helpful because you didn't retrieve the ones that are connected using the
|
one more small tip: use the quoting code markdown syntax to format your log, config file, and code would make the comment easier to read. (I have tweaked your comment above). |
Yes, this was enabled immediately after your suggestion. It doesnt seem to have made a difference.
[2022-08-07T09:47:48.319Z] INFO [stat] collect {"endpoint": "local", "metric": "carbon.agents.stats01.carbonserver.metrics_known", "value": 29247434} Find: 29257864; yes, number of metrics are growing daily. This is the first time find has completed successfully in a while.
... which is above my threshold of 1m. What is the appropriate thing to do in this regard... increase cache (seems unreasonable considering how much RAM is consumed), reduce cache size (I assume this is to the detriment of performance, but will it stabilise memory consumption?), split out instance into multiple systems to better utilise zipper - and if this is the case, how should I calculate the number of instances -- or put differently, how many metrics per instance should I be aiming for? |
Apologies, I'll aim to properly brace logs in future. |
Yep, your server is certainly under heavy load as it's already dropping data based on the
Have you consider removing obsolete metrics if they no longer receive updates? Booking.com production would remove metrics that aren't updated for 3-30days. It's fairly easy to achieve with a
128GB of RAM with almost 30m metrics per instance and given the timeout issue that you are having, in my experience, you would have to scale out, or considering reduce the load, by either removing old/stale metrics, produce less new metrics, or both.
It's hard for us to give you a definitive number, you would have to go with your experiment based on your production load. For a simple starter, if you are running just one server, maybe consider make it 2 or 3 servers. I would recommend you seek inspirations in the Google SRE books. You should create some relatively sophisticated grafana dashboards using the go-carbon metrics and system metrics, so that you know what your server is like now what it looks like after expansion. also from this query example, it seems you are producing metrics with uuid or friends, this would certainly generates lots of metrics and if it's like k8s pod id, then after the pod is removed, the metrics would remain. if that's the case, you would certainly need to manually remove the obsolete metrics after some time.
|
@bom-d-van We are absolutely removing old metrics, but not quite as aggressively as you mention... i.e. after months, as opposed to <1M. We're currently in discussion with the team regarding retaining 30 days max of metrics, which will help in a material way, but I suspect that we'll still see issues due to the number of metrics - albeit not the depth. I'm also trying to motivate for splitting out the GG daemon on a per service/k8s basis, but I don't want to arrive at the same point we're at now in N months - I suspect that taking this route without having a better feel for what to expect out of the daemon (per IOps/GB RAM/GHz) up front may cause problems down the line. |
You can consider enabling the quota sub-system in go-carbon to produce per-namespace usage metrics: #420 This was the way that we proposed for my ex-employer to achieve multi-tenancy. With the usage and quota metrics, you can know and have control on how many resources a prefix/namespace/pattern consumes. And when a namespace grew too big, you can relocate the namespace to its own dedicated go-carbon instances or cluster. However, the quota sub-system itself also produces something like 16 metrics per namespace, so it's not itself free. It's a good idea to have a dedicated instance/cluster to save go-carbon metrics. That said, it's probably better to try it out after you resolved the scaling challenges for your instances.
It's a never-ending struggle if your company continues growing. That's why we got paid. ;) Also it's a common SRE/devops practice to have capacity predictions from time to time and expand or shrink the cluster, or throttle and reduce the usage.
For whisper-based Graphite storage systems, the capacity limits varies with schemas and loads, for example, minutely metric is much less expensive than secondly one. But you can use the |
also this value is relatively low and might be too easy to saturate the cache (ingestion queue) which leads to data lost on ingestion. You might want to consider go all the way to 200m or more for a server with 128GB of ram. |
@bom-d-van @Civil I've since done a major cleanup, and we've seen significant improvements in performance. Some of the steps taken:
We've since seen 'carbonserver.metrics_known' fall from > 29.9M to 4M. I've also bumped max-size to 50m for now, and will keep an eye on cache size over the next week. [An architecture built around business needs, rather than nice-to-have seems to have significantly simplified the project!] Will keep you guys updated, but we're now in a far better position thanks to all your help. Thank you both! |
Error in Log:
[2022-07-18T20:59:06.441Z] ERROR [access] fetch failed {"handler": "render", "url": "/render/?format=protobuf&from=1658156341&target=MyTarget%24JmxTimer.update_session_exception_none.999thPercentile&until=1658177941", "peer": "10.128.27.189:50684", "carbonapi_uuid": "e5737414-2d9a-4d85-b89c-a253b13380dc", "format": "carbonapi_v2_pb", "targets": ["MyTarget$JmxTimer.update_session_exception_none.999thPercentile"], "runtime_seconds": 0.000098906, "reason": "failed to read data", "http_code": 400, "error": "could not expand globs - context canceled"}
I'm also seeing 'find failed' with the following reasons:
"reason": "Internal error while processing request", "error": "could not expand globs - context canceled", "http_code": 500
"reason": "Internal error while processing request", "http_code": 500
It should be noted that some of these queries are trying to merge 80+ graphs at runtime, which may be contributing to the issue.
Please also note that we've turned off indexing, as we have >30TB of whisper data which results in enormous amounts of RAM utilisation.
Any assistance in resolving these issues would be really appreciated!
The text was updated successfully, but these errors were encountered: