metrics.html

<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<title>Metrics - OpenTSDB - A Distributed, Scalable Monitoring System</title>
<link rel="stylesheet" href="css/style.css" type="text/css" media="screen"/>
<!--[if lte IE 8]>
<link rel="stylesheet" href="css/ie.css" type="text/css" media="screen"/>
<![endif]-->
<script type="text/javascript">
  var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-18339382-1']);
  _gaq.push(['_setDomainName', 'none']);
  _gaq.push(['_setAllowLinker', true]);
  _gaq.push(['_trackPageview']);
  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
  })();
</script>
</head>

<body>
<header><div id="headerbar">
  <h1><a href="index.html">OpenTSDB</a></h1>
  <nav><ul id="navbar">
    <li><a href="overview.html">Overview</a></li>
    <li><a href="getting-started.html">Getting Started</a></li>
    <li><a href="manual.html">Manual</a></li>
    <li><a href="faq.html">FAQ</a></li>
  </ul></nav>
</div></header>

<!--[if lte IE 8]>
<div class="iesucks">Warning: You're using an unsupported, archaic browser.
Get a better, modern browsing experience with
<a href="http://www.google.com/chrome">Chrome</a> or
<a href="http://www.mozilla.com/firefox">Firefox</a>.</div>
<![endif]-->
<section id="content">
<section id="metrics">
<h2>Metrics</h2>

<h3>Understanding metrics and time series in OpenTSDB</h3>

OpenTSDB is a time series database.  A time series is a series of datapoints
of some particular metric over time.  Each time series consists of a metric
plus zero or more tags associated with this metric (we'll cover tags in a
bit).  A metric is any particular piece of data (like, hits to apache) that
you wish to track over time.

<p>
OpenTSDB is also a data plotting system.  OpenTSDB plots things a bit
differently than other systems.  We'll discuss plotting in more detail
below, but for now it's important to know that for OpenTSDB, the basis of
any given plot is the metric.  It takes that metric, finds all of the time
series for the time range you select, aggregates those times series together
(e.g. by summing them up) and plots the result.  The plotting mechanism is
very flexible and powerful and you can do much more than this, but for now
let's talk about the key to the time series, which is the metric.

<p>
In OpenTSDB, a metric is named with a string, like <code>http.hits</code>.
To be able to store all the different values for all the places where this
metric exists, you tag the data with one or more tags when you send them to
the TSD.  TSD stores the timestamp, the value, and the tags.  When you want
to retrieve this data, TSD retrieves all of the values for the data range
you supply, optionally with a tag filter you supply, aggregates all these
values together how you want, and plots a graph of this value over time.

<p>
There's a bunch of things in here that we've introduced so far.  To help you
understand how things work, I'll start with a typical example.  Let's say
you have a bunch of web servers and you want to track two things: hits to
the web server and load average of the system.  Let's make up metric names
to express this.  For load average, let's call it
<code>proc.loadavg.1min</code> (since on Linux you can easily get this data
by reading <code>/proc/loadavg</code>).  For many web servers, there is a
way to ask the web server for a counter expressing the number of hits to the
server since it started.  This is a convenient counter upon which to use for
a metric we'll call <code>http.hits</code>.  I chose these two examples for
two reasons.  One, we'll get to see how OpenTSDB easily handles both
counters (values that increase over time, except when they get reset by a
restart/reboot or overflow) and how it handles normal values that go up and
down, like load average.  A great advantage of OpenTSDB is that you don't
need to do any rate calculation of your counters.  It will do it all for
you.  The second reason is that we can also show you how you can plot two
different metrics with different scales on the same graph, which is a great
way to correlate different metrics.

<h4>Your first datapoints</h4>

Without going into too much detail on how collectors send data to the TSD
<!--(this is covered in XXX)-->, you write a collector that periodically
sends the current value of these datapoints for each server to the TSD.  So
the TSD can aggregate the data from multiple hosts, you tag each value with
a "host" tag.  So, if you have web servers A, B, C, etc, they each
periodically send something like this to the TSD:

<div class="code">put http.hits 1234567890 34877 host=A
put proc.loadavg.1min 1234567890 1.35 host=A
</div>

Here "1234567890" is the current epoch time (<code>date +%s</code>) in
seconds (OpenTSDB does resolution down to the second only).  The next number
is the value of the metric at this time.  This is data from host A, so it's
tagged with <code>host=A</code>.  Data from host B would be tagged with
<code>host=B</code>, and so forth.  Over time, you'll get a bunch of time
series stored in OpenTSDB.


<h4>Your first plot</h4>

Now, let's revisit what we talked about here at the beginning.  A time
series is a series of datapoints of some particular metric (and its tags)
over time.  For this example, each host is sending two time series to the
TSD.  If you had 3 boxes each sending these two time series, TSD would be
collecting and storing 6 time series.  Now that you have the data, let's
start plotting.
<p>
To plot HTTP hits, you just go to the UI and enter <code>http.hits</code> as
your metric name, and enter the time range.  Check the "Rate" button since
this particular metric is a rate counter, and voil&agrave;, you have a plot
of the rate of HTTP hits to your web servers over time.

<h4>Aggregators</h4>

The default for the UI is to aggregate each time series for each host by
adding them together (<code>sum</code>).  What this means is, TSD is taking
the three time series with this metric (host=A, B and C) and adding their
values together to come up with the total hits by all web servers at a given
time <!--XXX add link to paper on how this algorithmically is done-->.  Note
you don't need to send your datapoints at exactly the same time, the TSD
will figure it out.  So, if each of your hosts was serving 1000 hits per
second each at some point in time, the graph would show 3000.  What if you
wanted to show about how many hits each web server was serving?  Two ways.
If you just care about the average that each web server was serving, just
change the Aggregator method from <code>sum</code> to <code>avg</code>.  You
can also try the others (<code>max</code>, <code>min</code>) to see the
maximum or minimum value.  More aggregation functions are in the works
(percentiles, etc.).  This is done on a per-interval basis <!--XXX again
reference algorithm-->, so if at some point in time one of your webservers
was serving 50 QPS and the others were serving 100 and later a different
webserver was serving 50 QPS and the others were serving 100, for these two
points the Min would be 50.  In other words it doesn't figure out which time
series was the total minimum and just show you that host plot.  The other
way to see how many hits each web server is serving?  This is where we look
at the tag fields.

<h4>Downsampling</h4>

To reduce the number of datapoints returned, you can specify a downsampling
interval and method, such as <code>1h-avg</code> or <code>1d-sum</code>.  This
is also useful (such as when using the <code>max</code> and <code>min</code>)
to find best and worst-case datapoints over a given period.  Downsampling is
most useful to make the graphing phase less intensive and more readable,
especially when graphing more datapoints than screen pixels.

<h4>Tag filters</h4>

In the UI you'll see that the TSD has filled one or more "Tags", the first
one is <code>host</code>.  What TSD is saying here that for this time range
it sees that the data was tagged with a host tag.  You can filter the graph
so that it just plots the value of one host.  If you fill in <code>A</code>
in the host row, you'll just plot the values over time of host A.  If you
want to give a list of hosts to plot, fill in the list of hosts separated by
the pipe symbol, e.g. <code>A|B</code>.  This will give you two plots
instead of one, one for A and one for B.   Finally, you can also specify the
special character <code>*</code>, which means to plot a line for every host.

<h4>Adding more metrics</h4>

So, now you have a plot of your web hits.  How does that correlate against
load average?  On this same graph, click the "+" tab to add a new metric to
this existing graph.  Enter <code>proc.loadavg.1min</code> as your metric
and click "Right Axis" so the Y axis is scaled separately and its labels on
the right.  Make sure "Rate" is unchecked, since load average is not a counter
metric.  Voil&agrave;!  Now you can see how changes in the rate of web hits
affects your system's load average.


<h4>Getting fancy</h4>

Imagine each if your servers actually ran two webservers, say, one for
static content and one for dynamic content.  Rather than create another
metric, just tag the <code>http.hits</code> metric with the server instance.
Have your collector send stuff like:

<p><div class="code">put http.hits 1234567890 34877 host=A webserver=static
put http.hits 1234567890 4357 host=A webserver=dynamic
put proc.loadavg.1min 1234567890 1.35 host=A
</div>

Why do this instead of creating another metric?  Well, what if sometimes you
care about plotting total HTTP hits and sometimes you care about breaking
out static vs. dynamic hits?  With a tag, it's easy.  With this new tag,
you'll see a <code>webserver</code> tag appear in the UI when plotting this
metric.  You can leave it blank and it will aggregate up both values into
one plot (according to your Aggregator setting) and you can see the total
hits, or you can do <code>webserver=*</code> to break out how much each of
your static and dynamic instances are collectively doing across your web
servers.  You can even go deeper and specify <code>webserver=*</code> and
<code>host=*</code> to see the full breakdown.


<h3>Guidelines when creating metrics</h3>

Right now, you cannot combine two metrics into one plot line.  This means
you want a metric to be the biggest possible aggregation point.  If you want
to drill down to specifics within a metric, use tags.

<h4>Tags vs. metrics</h4>

The metric should be a specific thing, like "Ethernet packets" but not be
broken out into a particular instance of a thing.  Generally you don't want
to collect a metric like <code>net.bytes.eth0</code>,
<code>net.bytes.eth1</code>, etc.  Collect <code>net.bytes</code> and tag
<code>eth0</code> datapoints with <code>iface=eth0</code>, etc.  Don't
bother creating separate "in" and "out" metrics, either.  Add the tag
<code>direction=in</code> or <code>direction=out</code>.  This way you can
easily see the total network activity for a given box without having to plot
a bunch of metrics.  This still gives you the flexibility to drill down and
just show activity for a particular interface, or just a particular
direction.

<h4>Counters and rates</h4>

If something is a counter, or is naturally something that is a rate, don't
convert it to a rate before sending it to the TSD.  There's two main reasons
for this.  First, doing your own rate calculation, reset/overflow handling,
etc. is silly, since TSD can do it for you.  You also don't have to worry
about getting the units-per-second calculation correct based on a slightly
inaccurate or changing sample interval.  Secondly, if something happens
where you lose a datapoint or more, if you are sending the current counter
value then you won't lose data, just resolution of that data.  The golden
rule in TSD is, if your source data is a counter (some counter out of
<code>/proc</code> or SNMP), keep it that way.  Don't convert it.  If you're
writing your own collector (say, one that counts how often a particular
error message appears in a <code>tail -f</code> of a log), don't reset your
counter every sample interval.  Let TSD to do the work for you.

<h4>Tags are your friend</h4>

In anything above a small environment, you probably have clusters or groups
of machines doing the same thing.  Over time these change, though.  That's
OK.  Just use a tag when you send the data to TSD to pass this cluster info
along.  Add something like <code>cluster=webserver</code> to all the
datapoints being sent from each of your webservers, and
<code>cluster=db</code> for all your databases, etc.

<p>
Now when you plot CPU activity for your webserver cluster, you see all of
them aggregated into one plot.  Then let's say you add a webserver or even
change it from a webserver to a database.  All you have to do is make sure
the right tag gets sent when its role changes, and now that box's CPU
activity gets counted toward the right cluster.  What's more, all of your
historical data is still correct!  This is the true power of OpenTSDB.  Not
only do you never lose resolution of your datapoints over time like
RRD-based systems, but historical data doesn't get lost as your boxes shift
around.  You also don't have to put a bunch of cluster or grouping awareness
logic into your dashboards.

<h4>Precisions on metrics and tags</h4>
The maximum number of tags allowed on a data point is defined by a constant
(<code>Const.MAX_NUM_TAGS</code>), which at time of writing is 8.  Metric
names, tag names and tag values have to be made of alpha numeric characters,
dash "-", underscore "_", period ".", and forward slash "/", as is enforced
by the package-private function <code>Tags.validateString</code>.

</section>
<footer>
&copy; 2010&ndash;2013 The OpenTSDB Authors.
</footer>
</section></body></html>