-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create new metrics earlier #888
base: master
Are you sure you want to change the base?
Conversation
In large Graphite installations the queue's can get really long. It can take an hour for Graphite to write all metrics in queue. New db files are created when the metric is written, which can take too long. This separates the creation of metrics from writing data to them and moves the creation to an earlier moment. Whenever a new metric is received it's name is pushed to a new_metric list. The first step in the writer loop is to check if there are new metrics received and creates them if they don't exist on disk yet. After the creation the writer continues as usual with writing metrics from the queue but it does not check if the file already exists, to prevent that the check occurs twice and has impact on IO. If the file does not exists at thie point it is logged. Fixes: graphite-project/graphite-web#629
Not sure if I missed something but I think this would fix the issues in graphite-project/graphite-web#629 but a much simpler solution than everything discussed. I see I broke Python2, not sure if we still care? |
(metric, datapoints) = cache.drain_metric() | ||
if metric is None: | ||
# end the loop | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only create new files as long as there are also updates to process, is that what we want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I am following, new files are created first, on every loop, break is later.
Notice that the diff is kinda screwed, the part with the break is old code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imagine if there is more to get from pop_new_metric()
than there is to get from drain_metric()
: then the loop would do create-new, drain-old, create-new, drain-old, alternating, and stop when drain-old was done, before all new metrics were handled ...
but that seems impossible, there can't be more new metrics than total cached metrics, right ... but I understand how one might be unsure, when looking at this code in isolation.
Here's another question: is it possible for drain_metric()
to return a new metric before pop_new_metric()
? Imagine there are two new cached metrics and nothing else, A and B. A is received first, but B gets more datapoints. pop_new_metric() returns A, then drain_metric() returns B (then pop_new_metric returns B on the next iteration). The FileNotFound will be hit below, and B datapoints will be lost, I think, needlessly.
I have an alternate design idea: keep the original logic here in this loop, and instead modify drain_metric()
to preferentially return new metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but that seems impossible, there can't be more new metrics than total cached metrics, right ... but I understand how one might be unsure, when looking at this code in isolation.
Yup, that should be impossible.
Here's another question: is it possible for drain_metric() to return a new metric before pop_new_metric()? Imagine there are two new cached metrics and nothing else, A and B. A is received first, but B gets more datapoints. pop_new_metric() returns A, then drain_metric() returns B (then pop_new_metric returns B on the next iteration). The FileNotFound will be hit below, and B datapoints will be lost, I think, needlessly.
Yes, that is possible when the MAX_CREATES_PER_MINUTE limiter is hit. This is on purpose, currently the metrics get dropped as well, see the old XXX TODO part.
I'm thinking if it wouldn't be better to drain the new_metrics first before continueing with drain_metric().
I have an alternate design idea: keep the original logic here in this loop, and instead modify drain_metric() to preferentially return new metrics.
That is exactly what the naive strategy does. That one is not efficient as it does not write the metric with most datapoints resulting in small writes.
The cache doesn't know if a metric is really new (as not on disk) it can just see that it is not in the cache dict. I didn't want to add IO stuff to the cache thread to check if it is on disk, no particular reason it just felt bad to mix it. Maybe it would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm you're exactly right ... my idea was equivalent to the existing naive strategy ...
That is an intriguing idea, doing the exists()
check just before adding to new_metrics
(only if the metric is not in the cache). That would give the best signal, but might be too expensive, as you worried.
I don't have any better ideas, maybe your current method will work well in practice :)
lib/carbon/cache.py
Outdated
@@ -236,6 +237,10 @@ def get_datapoints(self, metric): | |||
"""Return a list of currently cached datapoints sorted by timestamp""" | |||
return sorted(self.get(metric, {}).items(), key=by_timestamp) | |||
|
|||
def pop_new_metric(self): | |||
# return first seen metric | |||
return self.new_metrics.pop(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can raise "IndexError: pop from empty list"
(which does not appear to be handled where this is called)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it was caught later on and not logged so it got lost. Changed the code so we log exceptions and handled this.
handle indexexception log exceptions in writer
Some more thoughts, hopefully helpful: Currently (without this pull-request), If you instead called Regarding the file-create-rate-limit: with the status-quo (check-on-remove), when the limit is hit, all cached data for the affected metric is lost. With the alternative (create-on-add), you can skip adding the new metric if the file-create-rate-limit is hit, and only lose the last bit of data submitted, and cause it to check again when the next bit of data is submitted. This loses as little data as possible, while honoring the file-create-rate-limit, but doing more Finally, if the whisper file is deleted while the metric is already in the cache (by the user or some separate cleanup job), then in the create-on-add design, you lose all the cached data when it later fails to write it out. This seems like a very unlikely corner case. So I think it could be a good idea to move the exists() check and create attempt, to just when the metric is being added to the cache. Or, instead of just one exists()/create before each |
Yes, basically what I was thinking. Already have a patch to loop through all |
last commit needs testing. I'm quite busy at the moment, don't know when I will get around to it. |
This pull request introduces 1 alert when merging b9dbd1b into 42c1242 - view on LGTM.com new alerts:
|
lib/carbon/cache.py
Outdated
# return self.new_metrics.pop(0) | ||
#except IndexError: | ||
# return None | ||
return next(self.get_new_metrics()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like these methods are not used? the writer currently just does:
for new_metric in cache.new_metrics:
I have a more minimal version of this idea, with a couple of tweaks, possibly useful: --- a/lib/carbon/cache.py
+++ b/lib/carbon/cache.py
@@ -189,6 +189,7 @@ class _MetricCache(defaultdict):
def __init__(self, strategy=None):
self.lock = threading.Lock()
self.size = 0
+ self.new_metrics = []
self.strategy = None
if strategy:
self.strategy = strategy(self)
@@ -253,6 +254,8 @@ class _MetricCache(defaultdict):
log.msg("MetricCache is full: self.size=%d" % self.size)
events.cacheFull()
else:
+ if not self[metric]:
+ self.new_metrics.append(metric)
self.size += 1
self[metric][timestamp] = value
if self.strategy:
diff --git a/lib/carbon/writer.py b/lib/carbon/writer.py
index 7b63cba..f46421b 100644
--- a/lib/carbon/writer.py
+++ b/lib/carbon/writer.py
@@ -95,24 +95,16 @@ def writeCachedDataPoints():
cache = MetricCache()
while cache:
- (metric, datapoints) = cache.drain_metric()
- if metric is None:
- # end the loop
- break
-
- dbFileExists = state.database.exists(metric)
-
- if not dbFileExists:
- if CREATE_BUCKET and not CREATE_BUCKET.drain(1):
- # If our tokenbucket doesn't have enough tokens available to create a new metric
- # file then we'll just drop the metric on the ground and move on to the next
- # metric.
- # XXX This behavior should probably be configurable to no tdrop metrics
- # when rate limitng unless our cache is too big or some other legit
- # reason.
- instrumentation.increment('droppedCreates')
+ # first create new metrics files so graphite-web knows they exist
+ while cache.new_metrics:
+ metric = cache.new_metrics.pop()
+ if state.database.exists(metric):
continue
+ if CREATE_BUCKET:
+ CREATE_BUCKET.drain(1, blocking=True)
+ # log waitTime ?
+
archiveConfig = None
xFilesFactor, aggregationMethod = None, None
@@ -149,6 +141,12 @@ def writeCachedDataPoints():
instrumentation.increment('errors')
continue
+ # now drain and persist some data
+ (metric, datapoints) = cache.drain_metric()
+ if metric is None:
+ # end the loop
+ break
+
# If we've got a rate limit configured lets makes sure we enforce it
waitTime = 0
if UPDATE_BUCKET: |
yeah, basically that but don't block on the create_bucket limiter, just continue with writing datapoints. We don't want to hang the whole writer. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Based on a patch at graphite-project#888 by ploxiln; I've fixed the blocking issue and also made it drop metrics only if it comes to write them after continually failing to have enough create tokens. If carbon is configured with a create rate of 0, this deque will grow indefinitely. In pretty much all other circumstances, behaviour should be pretty similar.
In large Graphite installations the queue's can get really long. It can take
an hour for Graphite to write all metrics in queue. New db files are created
when the metric is written, which can take too long. This separates the creation
of metrics from writing data to them and moves the creation to an earlier moment.
Whenever a new metric is received it's name is pushed to a new_metric list. The
first step in the writer loop is to check if there are new metrics received and
creates them if they don't exist on disk yet. After the creation the writer
continues as usual with writing metrics from the queue but it does not check if
the file already exists, to prevent that the check occurs twice and has impact
on IO. If the file does not exists at thie point it is logged.
Fixes: graphite-project/graphite-web#629