Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create new metrics earlier #888

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

piotr1212
Copy link
Member

In large Graphite installations the queue's can get really long. It can take
an hour for Graphite to write all metrics in queue. New db files are created
when the metric is written, which can take too long. This separates the creation
of metrics from writing data to them and moves the creation to an earlier moment.

Whenever a new metric is received it's name is pushed to a new_metric list. The
first step in the writer loop is to check if there are new metrics received and
creates them if they don't exist on disk yet. After the creation the writer
continues as usual with writing metrics from the queue but it does not check if
the file already exists, to prevent that the check occurs twice and has impact
on IO. If the file does not exists at thie point it is logged.

Fixes: graphite-project/graphite-web#629

In large Graphite installations the queue's can get really long. It can take
an hour for Graphite to write all metrics in queue. New db files are created
when the metric is written, which can take too long. This separates the creation
of metrics from writing data to them and moves the creation to an earlier moment.

Whenever a new metric is received it's name is pushed to a new_metric list. The
first step in the writer loop is to check if there are new metrics received and
creates them if they don't exist on disk yet. After the creation the writer
continues as usual with writing metrics from the queue but it does not check if
the file already exists, to prevent that the check occurs twice and has impact
on IO. If the file does not exists at thie point it is logged.

Fixes: graphite-project/graphite-web#629
@piotr1212
Copy link
Member Author

Not sure if I missed something but I think this would fix the issues in graphite-project/graphite-web#629 but a much simpler solution than everything discussed.

I see I broke Python2, not sure if we still care?

(metric, datapoints) = cache.drain_metric()
if metric is None:
# end the loop
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only create new files as long as there are also updates to process, is that what we want?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I am following, new files are created first, on every loop, break is later.

Notice that the diff is kinda screwed, the part with the break is old code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imagine if there is more to get from pop_new_metric() than there is to get from drain_metric(): then the loop would do create-new, drain-old, create-new, drain-old, alternating, and stop when drain-old was done, before all new metrics were handled ...

but that seems impossible, there can't be more new metrics than total cached metrics, right ... but I understand how one might be unsure, when looking at this code in isolation.

Here's another question: is it possible for drain_metric() to return a new metric before pop_new_metric()? Imagine there are two new cached metrics and nothing else, A and B. A is received first, but B gets more datapoints. pop_new_metric() returns A, then drain_metric() returns B (then pop_new_metric returns B on the next iteration). The FileNotFound will be hit below, and B datapoints will be lost, I think, needlessly.

I have an alternate design idea: keep the original logic here in this loop, and instead modify drain_metric() to preferentially return new metrics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that seems impossible, there can't be more new metrics than total cached metrics, right ... but I understand how one might be unsure, when looking at this code in isolation.

Yup, that should be impossible.

Here's another question: is it possible for drain_metric() to return a new metric before pop_new_metric()? Imagine there are two new cached metrics and nothing else, A and B. A is received first, but B gets more datapoints. pop_new_metric() returns A, then drain_metric() returns B (then pop_new_metric returns B on the next iteration). The FileNotFound will be hit below, and B datapoints will be lost, I think, needlessly.

Yes, that is possible when the MAX_CREATES_PER_MINUTE limiter is hit. This is on purpose, currently the metrics get dropped as well, see the old XXX TODO part.
I'm thinking if it wouldn't be better to drain the new_metrics first before continueing with drain_metric().

I have an alternate design idea: keep the original logic here in this loop, and instead modify drain_metric() to preferentially return new metrics.

That is exactly what the naive strategy does. That one is not efficient as it does not write the metric with most datapoints resulting in small writes.

The cache doesn't know if a metric is really new (as not on disk) it can just see that it is not in the cache dict. I didn't want to add IO stuff to the cache thread to check if it is on disk, no particular reason it just felt bad to mix it. Maybe it would be better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm you're exactly right ... my idea was equivalent to the existing naive strategy ...

That is an intriguing idea, doing the exists() check just before adding to new_metrics (only if the metric is not in the cache). That would give the best signal, but might be too expensive, as you worried.

I don't have any better ideas, maybe your current method will work well in practice :)

@@ -236,6 +237,10 @@ def get_datapoints(self, metric):
"""Return a list of currently cached datapoints sorted by timestamp"""
return sorted(self.get(metric, {}).items(), key=by_timestamp)

def pop_new_metric(self):
# return first seen metric
return self.new_metrics.pop(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can raise "IndexError: pop from empty list"
(which does not appear to be handled where this is called)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it was caught later on and not logged so it got lost. Changed the code so we log exceptions and handled this.

handle indexexception
log exceptions in writer
@ploxiln
Copy link
Contributor

ploxiln commented Apr 21, 2020

Some more thoughts, hopefully helpful:

Currently (without this pull-request), state.database.exists(metric) is called for each metric as it is removed from the cache by cache.drain_metric(). When more datapoints for the metric are reported, it will be added to the cache again, and the exists() check will run again when it is removed.

If you instead called state.database.exists(metric) for each metric when it is first added to the cache, that would result in the same overall number of exists() checks. Each metric is added, remains in cache for some time, then is removed (and written). Adds and removes are symmetric, in total count. But they would be distributed in time differently (may or may not be more bursty).

Regarding the file-create-rate-limit: with the status-quo (check-on-remove), when the limit is hit, all cached data for the affected metric is lost. With the alternative (create-on-add), you can skip adding the new metric if the file-create-rate-limit is hit, and only lose the last bit of data submitted, and cause it to check again when the next bit of data is submitted. This loses as little data as possible, while honoring the file-create-rate-limit, but doing more exists() checks (maybe a lot more if the limit is hit a lot). I think exists() checks should be relatively cheap.

Finally, if the whisper file is deleted while the metric is already in the cache (by the user or some separate cleanup job), then in the create-on-add design, you lose all the cached data when it later fails to write it out. This seems like a very unlikely corner case.

So I think it could be a good idea to move the exists() check and create attempt, to just when the metric is being added to the cache. Or, instead of just one exists()/create before each drain_metric(), do all available new_metrics in a loop, before moving on to drain_metric(), and if new-file-create fails, remove that metric from the cache so it doesn't collect more before being lost.

@piotr1212
Copy link
Member Author

Yes, basically what I was thinking. Already have a patch to loop through all new_metrics before moving on to drain_metric().

@piotr1212
Copy link
Member Author

last commit needs testing. I'm quite busy at the moment, don't know when I will get around to it.

@lgtm-com
Copy link

lgtm-com bot commented Apr 22, 2020

This pull request introduces 1 alert when merging b9dbd1b into 42c1242 - view on LGTM.com

new alerts:

  • 1 for Unused import

# return self.new_metrics.pop(0)
#except IndexError:
# return None
return next(self.get_new_metrics())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like these methods are not used? the writer currently just does:

for new_metric in cache.new_metrics:

@ploxiln
Copy link
Contributor

ploxiln commented Apr 22, 2020

I have a more minimal version of this idea, with a couple of tweaks, possibly useful:

--- a/lib/carbon/cache.py
+++ b/lib/carbon/cache.py
@@ -189,6 +189,7 @@ class _MetricCache(defaultdict):
   def __init__(self, strategy=None):
     self.lock = threading.Lock()
     self.size = 0
+    self.new_metrics = []
     self.strategy = None
     if strategy:
       self.strategy = strategy(self)
@@ -253,6 +254,8 @@ class _MetricCache(defaultdict):
           log.msg("MetricCache is full: self.size=%d" % self.size)
           events.cacheFull()
         else:
+          if not self[metric]:
+            self.new_metrics.append(metric)
           self.size += 1
           self[metric][timestamp] = value
           if self.strategy:
diff --git a/lib/carbon/writer.py b/lib/carbon/writer.py
index 7b63cba..f46421b 100644
--- a/lib/carbon/writer.py
+++ b/lib/carbon/writer.py
@@ -95,24 +95,16 @@ def writeCachedDataPoints():
 
   cache = MetricCache()
   while cache:
-    (metric, datapoints) = cache.drain_metric()
-    if metric is None:
-      # end the loop
-      break
-
-    dbFileExists = state.database.exists(metric)
-
-    if not dbFileExists:
-      if CREATE_BUCKET and not CREATE_BUCKET.drain(1):
-        # If our tokenbucket doesn't have enough tokens available to create a new metric
-        # file then we'll just drop the metric on the ground and move on to the next
-        # metric.
-        # XXX This behavior should probably be configurable to no tdrop metrics
-        # when rate limitng unless our cache is too big or some other legit
-        # reason.
-        instrumentation.increment('droppedCreates')
+    # first create new metrics files so graphite-web knows they exist
+    while cache.new_metrics:
+      metric = cache.new_metrics.pop()
+      if state.database.exists(metric):
         continue
 
+      if CREATE_BUCKET:
+        CREATE_BUCKET.drain(1, blocking=True)
+        # log waitTime ?
+
       archiveConfig = None
       xFilesFactor, aggregationMethod = None, None
 
@@ -149,6 +141,12 @@ def writeCachedDataPoints():
         instrumentation.increment('errors')
         continue
 
+    # now drain and persist some data
+    (metric, datapoints) = cache.drain_metric()
+    if metric is None:
+      # end the loop
+      break
+
     # If we've got a rate limit configured lets makes sure we enforce it
     waitTime = 0
     if UPDATE_BUCKET:

@piotr1212
Copy link
Member Author

yeah, basically that but don't block on the create_bucket limiter, just continue with writing datapoints. We don't want to hang the whole writer.

@stale
Copy link

stale bot commented Jun 26, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 26, 2020
@stale stale bot removed the stale label Jun 29, 2020
bucko909 added a commit to bucko909/carbon that referenced this pull request Aug 21, 2024
Based on a patch at graphite-project#888
by ploxiln; I've fixed the blocking issue and also made it drop metrics
only if it comes to write them after continually failing to have enough
create tokens.

If carbon is configured with a create rate of 0, this deque will grow
indefinitely. In pretty much all other circumstances, behaviour should
be pretty similar.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metric queries return null data until whisper file exists
3 participants