faq.html

<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<title>FAQ - OpenTSDB - A Distributed, Scalable Monitoring System</title>
<link rel="stylesheet" href="css/style.css" type="text/css" media="screen"/>
<!--[if lte IE 8]>
<link rel="stylesheet" href="css/ie.css" type="text/css" media="screen"/>
<![endif]-->
<script type="text/javascript">
  var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-18339382-1']);
  _gaq.push(['_setDomainName', 'none']);
  _gaq.push(['_setAllowLinker', true]);
  _gaq.push(['_trackPageview']);
  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
  })();
</script>
</head>

<body>
<header><div id="headerbar">
  <h1><a href="index.html">OpenTSDB</a></h1>
  <nav><ul id="navbar">
    <li><a href="overview.html">Overview</a></li>
    <li><a href="getting-started.html">Getting Started</a></li>
    <li><a href="manual.html">Manual</a></li>
    <li><a href="faq.html">FAQ</a></li>
  </ul></nav>
</div></header>

<!--[if lte IE 8]>
<div class="iesucks">Warning: You're using an unsupported, archaic browser.
Get a better, modern browsing experience with
<a href="http://www.google.com/chrome">Chrome</a> or
<a href="http://www.mozilla.com/firefox">Firefox</a>.</div>
<![endif]-->
<section id="content">
<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.3.2/jquery.min.js" type="text/javascript"></script>
<script src="misc/toc.js" type="text/javascript"></script>
<section id="FAQ">
<div id="toc"></div>

<a name="scalability"></a><h2>Scalability</h2>
<h4>Can OpenTSDB scale to multiple data centers?</h4>
Yes.  It is recommended that you run one set of Time Series Daemons
(TSDs) per HBase cluster and one HBase cluster per physical datacenter.
It is not recommended to have HBase clusters spanning across data
centers.  Instead you can use
<a href="http://hbase.apache.org/replication.html">HBase replication</a>
to replicate tables across data centers.  HBase replication is still
considered
<a href="https://issues.apache.org/jira/browse/HBASE-1295">experimental</a>
and is being actively developed at StumbleUpon with help from the rest
of the HBase community.
<p>
Right now the TSD doesn't provide any assistance to use multiple HBase
clusters together at the same time, so you can't easily plot time series
that come from different data centers.  This will be fixed.

<h4>How much <em>write</em> throughput can I get with OpenTSDB?</h4>
It depends mostly on two things:
<ol><li>The size of your HBase cluster.</li><li>The CPUs you're using.</li></ol>
If your HBase cluster is reasonably sized, it's unlikely that OpenTSDB will
max it out as the TSDs tend to be CPU bound before that happens (unless you
run many TSDs).  A TSD can easily handle 2000 new data points per second per
core on an old dual-core Intel Xeon CPU from 2006.  More modern CPUs will get
you more throughput.

<h4>How much <em>read</em> throughput can I get with OpenTSDB?</h4>
This is not very well documented right now.  The answer certainly depends on
the size of the queries (e.g. generating graphs with many millions of data
points are more expensive than if they only have a few tens of thousands,
obviously).  The read path has plenty of room for optimizations and the TSD
only does some very basic caching right now.

<h4>What type of hardware should I run the TSDs on?</h4>
There are no strict requirements.  The recommended configuration, however, is
a 4-core machine with at least 4GB of RAM, and a <code>tmpfs</code> partition
for the cache directory used by the TSD.  Having more RAM helps the TSD ride
over transient HBase outages by allowing it to buffer more incoming data
before getting to the point where it must start discarding data.

<h4>How much disk space do I need?</h4>
The answer depends mostly on the average number of tags per data point.  At
StumbleUpon, we use 4.5 tags on average and our 100+ billion data points take
only just over a terabyte of disk space (pre-HDFS 3x replication).  We use
<a href="setup-hbase.html#lzo">LZO</a> which is
extremely recommended in a production setting.  In our case, each data point
ends up taking about 12 bytes of disk space (or actually 36 if you include the
3x replication factor of HDFS).  We also find that, on average, LZO is able to
achieve a compression factor of 4.2x on our table, but your mileage will vary.
Without LZO, a data point costs roughly: 16 bytes of HBase overhead, 3 bytes
for the metric, 4 bytes for the timestamp, 6 bytes per tag, 2 bytes of
OpenTSDB overhead, 8 bytes for the value.
<p>
We expect that those storage costs, even though they're already pretty low,
will come down significantly.  Right now all the values are stored on 8 bytes,
but the storage format was designed to support variable-length encoding.  This
simply hasn't been implemented yet.
<p>
In addition to variable-length encoding, a new backwards-compatible storage
format has been designed but not implemented yet either.   We expect that this
new format will reduce the storage requirements by 5x assuming that data
points come 15 seconds apart on average (our default interval at StumbleUpon).

<a name="reliability"></a><h2>Reliability</h2>
<h4>What are the Single Points of Failure of OpenTSDB?</h4>
OpenTSDB itself doesn't have any specific SPoF as there is no central
component and you can run multiple TSDs on different machines.  The TSDs
need HBase to run, and HBase doesn't have any SPoF<sup>*</sup> either as HBase
only really needs a ZooKeeper quorum to keep serving.  A ZooKeeper quorum is
typically made of 5 different machines, out of which you can afford to lose 2
before the system goes down.  Note that although HBase has a master, it is not
actually needed for HBase to keep serving.  Not having a master running will
prevent HBase from starting or recovering from machine failures but, in a
steady state, losing the master doesn't impede on HBase's ability to serve.
<p>
<small><sup>*</sup> Fine prints: if your HBase cluster is backed by HDFS,
which is most likely the case for production clusters at the time of this
writing, then you have a SPoF because of the NameNode of HDFS.  If you run
HBase on top of a <em>reliable</em> distributed filesystem, then you don't
have any SPoF.</small>

<h4>What are the failure modes of OpenTSDB?</h4>
The TSD eventually becomes unhealthy when HBase itself is down or broken for
an extended period of time.  Right now, the TSD doesn't handle very well
prolonged HBase outages and will discard incoming data points once its buffers
are full if it's unable to flush them to HBase.  This is going to be improved.
<p>
At StumbleUpon we've had a number of cases where a collector that runs on
hundreds of machines goes crazy and generate a DDoS on the TSDs.  The TSD
doesn't do a good job at handling such DDoS situations by penalizing offending
clients, so its performance will degrade once the machine it's running on is
unable to keep up with the load.

<h4>What is the recommended deployment for OpenTSDB?</h4>
We recommend that you run multiple TSDs.  At StumbleUpon, we found it useful
to dedicate one or more TSDs for read queries (human users using the web UI to
generate graphs or to view dashboards) and let other TSDs handle the write
queries (new data points coming in from production machines).  For the
"read-only" TSDs, we recommend Varnish for load balancing.
<a href="varnish.html">Read more about Varnish and TSDs</a>.

<h4>What data durability guarantees does OpenTSDB make?</h4>
By default the TSD buffers data points for about 1 second before persisting
them in HBase (configurable via the <code>--flush-interval</code> flag).  If
the TSD was to crash without getting a chance to run its shutdown hook, you
could lose up to 1 second worth of data points.  In practice we've found this
trade off to be acceptable given the performance benefits that deferred
flushes offer in terms of write throughput.  Once a data point has been stored
in HBase, data durability is guaranteed if you're running HBase on top of a
distributed filesystem that provides the necessary data durability guarantees.
<p>
If you use HDFS, we recommend that you run
<a href="http://www.cloudera.com/downloads/">Cloudera's Distribution for
Hadoop</a> (CDH), version 3 or above preferably, as this version comes with
all the necessary patches to make HDFS less unreliable and has better
performance.

<a name="datamodel"></a><h2>Data Model</h2>
<h4>How can I increment a counter in OpenTSDB?</h4>
OpenTSDB doesn't work with counters.  It simply records <code>(timestamp,
value)</code> pairs.  Data points are independent from each other.  Say you
want to keep track of clicks on an ad in OpenTSDB.  You wouldn't send a "+1"
to the TSD for every click.  Instead, if your application doesn't already
keeps track of click counts, you'd need to increment a counter for every click
and periodically send the value of that counter to the TSD.  You can later
query the TSD and ask for the rate of change of the counter, which will give
you clicks per second.

<h4>Can I store sub-second precision timestamps in OpenTSDB?</h4>
No.  Right now timestamps are encoded on 4 bytes so this is not possible.
Note that this is not typically needed for OpenTSDB's main use case, which
consists in monitoring large clusters of commodity machines.  If you think
you really need sub-second precision, please reach out to our mailing list
for advices:
<a href="http://groups.google.com/group/opentsdb">opentsdb<!--fuckSPAM-->@googlegroups.com</a>

<h4>Can I use another storage backend than HBase?</h4>
No.  OpenTSDB was designed specifically for a storage backend that follows
the <a href="http://labs.google.com/papers/bigtable.html">Bigtable</a> data
model (a distributed, strongly consistent, sorted multi-dimensional hash map).
At the time of writing, HBase is the only such system that's both open-source
and usable in production, so the code was written specifically for HBase.
Technically it would be feasible to port the code to other systems that follow
the Bigtable data model.  Systems that differ by not storing data in a sorted
fashion (such as distributed hash tables) or that do not offer a strong
consistency guarantee will simply not work with the current design.
<p>
edit (Q1'13): Due to popular demand, the next major version of OpenTSDB
is likely to contain an interface behind which calls to HBase will be
abstracted away, in order to allow integration with other storage backends.

<a name="misc"></a><h2>Misc</h2>
<h4>How do the TSDs handle DST changes or leap seconds?</h4>
The TSD doesn't assign timestamps to your data points, your collectors do.
It is strongly recommended that you use UNIX timestamps in your collectors,
so all your timestamps will be based on
<a href="http://en.wikipedia.org/wiki/Unix_epoch">Epoch</a>.  This way
you will not be affected by timezone adjustments or DST changes on your
machines.
<p>
The TSD always renders timestamps in local time, to make it easier for us
human to understand and correlate events based on the timezone we live in.
So you should to make sure you give the TSD the correct timezone setting
(e.g. via the <a href="cli.html#Overriding_the_timezone_of_the_TSD"><code>TZ</code> environment variable</a>).
When the TSD starts,
it computes its offset from UTC and will then keep that offset forever.
In case of a DST change, for instance, it would then appear that the TSD
is 1 hour behind.  There are plans to periodically re-compute the offset
from UTC to avoid that situation, but right now you have to restart the
TSD in order to adjust the offset.  Note that this doesn't prevent the TSD
from working properly, it only affects anything that parses dates from local
time or renders them in local time.  Dashboards and alerting systems should
use relative time (e.g. "1d ago") and should thus be unaffected.
<p>
When <a href="http://en.wikipedia.org/wiki/Leap_second">leap</a> seconds
occur, UNIX timestamps go back by one second.  The TSD should handle this
situation gracefully (although this hasn't been tested yet).  Unless you're
collecting data every second, you won't notice anything except that the
interval between the two data points where the leap second occurred is one
second less than it should have been.  If you do collect data every second,
the second data point that attempts to overwrite the previous one during the
leap second will be discarded with an error message.

<h4>The graphs are ugly, can they be made prettier?</h4>
Ugliness is a subjective thing :)
<p>
There are a lot of knobs that aren't exposed yet that would allow the TSD to
generate nicer, antialiased, smoothed graphs.  It's just a matter of exposing
those Gnuplot knobs.  Also, recent versions of Gnuplot can generate graphs in
HTML5 canvas.  We plan to use this to build pretty graphs you can interact
with from your web browser.
<p>
Please contribute to help make the UI sexier.

<h4>Can I use OpenTSDB to generate graphs for my customers / end-users?</h4>
Yes, but you have to be careful with that.  OpenTSDB was written for internal
use only, to help engineers and operations staff understand and manage large
computer systems.  It hasn't been through any security review.
<p>
We don't recommend that you give direct access to the TSD to untrusted users.
If you really want to leverage the TSD's graphing features, we recommend that
you put the TSD behind a secured HTTP proxy that only allows specific requests
to go through.  Alternatively, you could use the TSD to periodically
pre-generate a fixed set of graphs and serve them as static images to your
customers.

<h4>Why does OpenTSDB return more data than I asked for in my query?</h4>
All queries specify a start time and an end time (if the end time isn't
specified, it is assumed to be "now").  OpenTSDB's goal is to plot a sensible
graph covering that time span.  However it needs to retrieve data before and
after the times you actually specified, in order to know how to properly
compute the values near the "edges" of the graph.  Because having extra values
past the times actually requested is required to draw accurate graphs,
OpenTSDB also returns the extra data based on the assumption that if you want
to plot your own graphs or make your own processing, you will also need the
extra data to get the correct behavior near the edges.
<p>
The amount of extra data that OpenTSDB attempts to retrieve is proportional to
the time span covered by your query.  It is a known usability issue that
sometimes it will return too much excessive data, when less data is actually
required to correctly plot the edges of the graph.  This is being fixed.

<h4>I don't understand the data points returned for my query</h4>
Sometimes the results to a query don't match people's expectations.  This is
often because it's not necesssarily quite obvious what steps are involved in
a query, why OpenTSDB uses interpolation, when do aggregators kick in.
There is now a dedicated page to explain this:
<a href="query-execution.html">OpenTSDB query execution</a>.

<a name="meta"></a><h2>Meta</h2>

<h4>Why was OpenTSDB written in Java?</h4>
Mostly because OpenTSDB lives around the HBase community, which is a Java
community.  OpenTSDB also started by using HBase's library, which only
exists in Java.  Eventually, however, OpenTSDB started to use another
alternative library to access HBase
(<a href="http://github.com/OpenTSDB/asynchbase">asynchbase</a>)
but sadly this one too is in Java.

<h4>Has OpenTSDB anything to do with OpenBSD?</h4>
While the author of OpenTSDB admires the work done on OpenBSD, the fact that
the name of projects are so close is just a coincidence.  "TSDB" alone was
too ambiguous, and the author miserably failed to come up with a better name.

<h4>Who is behind OpenTSDB?</h4>
OpenTSDB was originally designed and implemented at
<a href="http://www.stumbleupon.com">StumbleUpon</a> by Beno&icirc;t Sigoure.
Berk D. Demir contributed ideas and feedback during the early design stages.
<a href="https://github.com/OpenTSDB/tcollector"><code>tcollector</code></a>
was designed and implemented by Mark Smith and Dave
Barr, with contributions from Beno&icirc;t Sigoure.

<h4>How to contribute to OpenTSDB?</h4>
The easiest way is to <a href="http://help.github.com/fork-a-repo/">fork</a>
the project on <a href="https://github.com/OpenTSDB/opentsdb">GitHub</a>.
Make whatever changes you want to your own fork, then send a
<a href="http://help.github.com/send-pull-requests/">pull request</a>.
You can also send your patches to the
<a href="http://groups.google.com/group/opentsdb">mailing list</a>.
Be prepared to go through a couple iterations as the code is being reviewed
before getting accepted in the main repository.  If you are familiar with
how the Linux kernel is developped, then this is pretty similar.

<h4>Who commits to OpenTSDB?</h4>
Anyone can commit to OpenTSDB, provided that the changes are accepted
in the main repository after getting reviewed.  There is no notion of
"commit access", or no list of committers.

<h4>Why does OpenTSDB use the LGPL?</h4>
One of the most frequent "holy war" that plague open-source communities is
that of what licenses to use, which ones are better or "more free" than others.
OpenTSDB uses the <a href="http://www.gnu.org/licenses/lgpl.html">GNU LGPLv2.1+</a>
for maximum compatibility with its dependencies and other licenses, and
because its author thinks that the LGPL strikes the right balance between the
goals of free software and the legal restrictions often present in corporate
environments.
<p>
Let's stress the following points:
<ul>
<li>The LGPL is <em>not</em> the GPL.  Although based on the same text, the
way it extends the GPL has significant consequences.  Do not confuse the two.</li>
<li>The <a href="http://www.gnu.org/licenses/lgpl-java.html">LGPL is perfectly
compatible with Java</a>.  The myth that the LGPL does not work as intended
with Java is, well, just a myth, albeit a widespread one.</li>
<li>The LGPL allows you to use the code in proprietary software, provided that
you don't <em>redistribute a modified version</em> of the LGPL'ed code.</li>
<li>If you want to redistribute a modified version of the code, then your
changes must be released under the LGPL.</li>
<li>The LGPL is perfectly compatible with the
<a href="http://www.apache.org/licenses/LICENSE-2.0">ASF2</a> license.
Many people are misled to believe that there is an incompatibility because the
Apache Software Foundation (ASF) decided to not allow inclusion of LGPL'ed
code in its own projects.  This choice only applies to the projects managed by
the ASF itself and doesn't stem from any license incompability.</li>
</ul>
With this out of the way, we hope that those afraid of the 3 letters "GPL"
will acknowledge the importance of using the LGPL in OpenTSDB and will
overcome their fear of the license.
<br/>
<small>Disclaimer: This page doesn't provide any formal legal advice.
Information given here is given in good faith.  If you have any doubt, talk to
a lawyer first.  In the text above "LGPL" or "GPL" refers to the version 2.1 of
the license, or (at your option) any later version.</small>

<h4>Who supports OpenTSDB?</h4>
<a href="http://www.stumbleupon.jobs">StumbleUpon</a> supported the initial
development of OpenTSDB as well as its open-source release.
<p>
YourKit is kindly supporting open source projects with its full-featured Java
Profiler.  YourKit, LLC is the creator of innovative and intelligent tools for
profiling Java and .NET applications. Take a look at YourKit's leading
software products:
<a href="http://www.yourkit.com/java/profiler/index.jsp">YourKit Java Profiler</a>
and
<a href="http://www.yourkit.com/.net/profiler/index.jsp">YourKit .NET Profiler</a>.


</section>
<footer>
&copy; 2010&ndash;2013 The OpenTSDB Authors.
</footer>
</section></body></html>