Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow loki label drop down loading on Grafana #554

Open
qqu0127 opened this issue Aug 28, 2024 · 10 comments
Open

Slow loki label drop down loading on Grafana #554

qqu0127 opened this issue Aug 28, 2024 · 10 comments
Assignees

Comments

@qqu0127
Copy link

qqu0127 commented Aug 28, 2024

HI,

Background:
I deployed qryn with a clickhouse proxy. The ingestion load is about 2MB/s. In my setup, qryn is added as loki source on Grafana and that's where we mostly use for log query.

Problem:
Using the Grafana UI, I notice it takes very long time for the label dropdown options to load. Specifically, it works fine for the first label search, it seems both the label name and values are pre-fetched or indexed. However, when it comes to the second and more label filtering it takes long time just to load the label options. It timeout and cause OOM in the qryn app.
The second or more label filter is basically not working for me. I know I can use loki native query directly but that's simply not a good choice.

My findings so far:
I found the qryn log:

WITH sel_a AS (select samples.string as string,samples.fingerprint as fingerprint,samples.timestamp_ns as timestamp_ns from qryn_cluster.samples_v3_dist as samples where ((samples.timestamp_ns between 1724798773369000000 and 1724809573369000000) and (samples.type in (0,0))) and (samples.fingerprint IN (select sel_1.fingerprint from (select fingerprint from qryn_cluster.time_series_gin where ((key = 'namespace') and (val = 'my_namespace'))) as sel_1)) order by timestamp_ns desc limit 10) select JSONExtractKeysAndValues(time_series.labels, 'String') as labels,sel_a.* from sel_a left any join qryn_cluster.time_series_dist AS time_series on sel_a.fingerprint = time_series.fingerprint order by labels desc,timestamp_ns desc

Not a SQL expert, but I feel this one is inefficient. Anyone has a clue?

@lmangani
Copy link
Collaborator

lmangani commented Aug 28, 2024

Hello @qqu0127 and thanks for the report. We need further details to assist.

  • How many labels/fingerprints do you have?
  • What version of Grafana are you using?

@qqu0127
Copy link
Author

qqu0127 commented Aug 28, 2024

Hello @qqu0127 and thanks for the report. We need further details to assist.

  • How many labels/fingerprints do you have?
  • What version of Grafana are you using?

Thanks @lmangani for the reply!
There are about 30 labels, not sure about fingerprints, is there a easy way to get that?
Grafana version is 10.4.2

@qqu0127
Copy link
Author

qqu0127 commented Aug 29, 2024

And to add more context, we compared it with Grafana loki side by side with the same ingestion load and similar storage. Grafana loki works well with the label filtering, in resonable amount of time. It could be the metadata indexing it has.

@lmangani
Copy link
Collaborator

lmangani commented Aug 29, 2024

@qqu0127 thanks for the detail, but unless you have an incredible amount of data the expected performance should be the same or better than loki in most setups. Perhaps the issue is with ClickHouse. What's the load/performance and how many resources have you allocated? If you're using a Cluster have you configured the cluster settings in qryn?

@akvlad
Copy link
Collaborator

akvlad commented Aug 29, 2024

Hello @qqu0127
Please run the following requests and share the results in order to find the overall statistics of your setup:

SELECT uniq(fingerprint) from qryn_cluster.time_series_dist

select uniq(fingerprint) as uniq from qryn_cluster.time_series_dist ARRAY JOIN JSONExtractKeys(labels) as label group by label order by uniq desc limit 1;

Tnaks in advance.

@qqu0127
Copy link
Author

qqu0127 commented Aug 30, 2024

Thanks. I think I might know the cause. Some of my labels have a large value cardinatliy, like "session_id" and "trace_id", and like unbounded. Unfortunately, I can't run the number now because the old data was purged before I see the message. I cleaned up some labels and testing with it.
The clickhouse setup shouldn't be a problem, it's a 3 node cluster with modern hardware. Besides, the log query speed is fine with logQL, it's just the label loading.
I suspect that when the label drop down is clicked, it's actually fetching label values from all time instead of within the given time range.

@akvlad
Copy link
Collaborator

akvlad commented Aug 30, 2024

@qqu0127 please share the results of the requests so we can help to fix your case.

The aspect you have described is one of the design assumptions. The minimal time range Qryn respects in terms of labels, series and values requests is 1 day. If you request series of different time ranges within 1 day, the result will be the same.

@akvlad
Copy link
Collaborator

akvlad commented Aug 30, 2024

@qqu0127 after a set of improvements and benchmarks I can draw some conclusions.

GZIP compressing of the output:

The main bottleneck is gzip compressor.
After a set of upgrades the resulting speed of gzipping reaches 2MB/sec/core.

The request of 10M series (1.2G of response body):
** without GZIP **

$ time curl -vv 'http://localhost:3100/loki/api/v1/series?match%5B%5D=%7Bsender%3D%22logtest%22%7D&start=1725020556910000000&end=1725024156910000000' >/dev/null

100 1210M    0 1210M    0     0  70.7M      0 --:--:--  0:00:17 --:--:-- 74.7M
* Connection #0 to host localhost left intact

real	0m17,104s
user	0m1,170s
sys	0m1,041s

with GZIP

$ time curl -vv -H 'Accept-Encoding: gzip, deflate, br, zstd' 'http://localhost:3100/loki/api/v1/series?match%5B%5D=%7Bsender%3D%22logtest%22%7D&start=1725020556910000000&end=1725024156910000000' >/dev/null

100 62.6M    0 62.6M    0     0  2000k      0 --:--:--  0:00:32 --:--:-- 2023k
* Connection #0 to host localhost left intact

real	0m32,134s
user	0m0,062s
sys	0m0,150s

Grafana restrictions

After a set of updates qryn doesn't fail by OOM after a request of 1.2G of series
1.2G of series makes grafana UI hanging for several seconds.
Then the visible output is limited to 10000 of label values.

So I believe it's not worth to have more than 10-20K of series as they will not be reviewable anyway.

Qryn series hard limit

Since v3.2.30 Qryn has an env var configuration ADVANCED_SERIES_REQUEST_LIMIT
It is a hard limit of the series in the /series response. Feel free to use it if the cardinality goes under control.

Again, I don't see any reason to set it more than 20 000 of series as Grafana UI doesn't show them anyway.

Feel free to download 3.2.30 and try the upgrades and the new functionality

@qqu0127
Copy link
Author

qqu0127 commented Sep 4, 2024

Hello @qqu0127 Please run the following requests and share the results in order to find the overall statistics of your setup:

SELECT uniq(fingerprint) from qryn_cluster.time_series_dist

select uniq(fingerprint) as uniq from qryn_cluster.time_series_dist ARRAY JOIN JSONExtractKeys(labels) as label group by label order by uniq desc limit 1;

Tnaks in advance.

HI @akvlad , sorry it takes me some time to get back. As I said, the previous data that caused the issue was gone. I ran the query on my current cluster anyway, the results are
SELECT uniq(fingerprint) from qryn_cluster.time_series_dist
1307
select uniq(fingerprint) as uniq from qryn_cluster.time_series_dist ARRAY JOIN JSONExtractKeys(labels) as label group by label order by uniq desc limit 1;
1306.

The latency and OOM issue is resulved by removing some high cardinality labels.

@lmangani
Copy link
Collaborator

lmangani commented Sep 4, 2024

Thanks for the update @qqu0127

The latency and OOM issue is resulved by removing some high cardinality labels.

We definitely want to make this work with or without high cardinality, so if you get a chance again to try let us know

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants