Our blog hasn’t focused much on the geeky side of (mt), even though we’re a technology company. We’ve learned a lot over the years operating and monitoring hundreds of thousands of hosted services — sometimes learning the ‘hard way’ more than we would have liked — and we want to share some of that knowledge.

We’re living in what many are calling “the age of data”. It’s possible to collect so much information now — increasingly easily and cheaply — that we’re able to answer questions that we couldn’t even imagine asking a few years ago. It’s such a big topic in Open Source that there is a conference dedicated to it: the O’Reilly Strata Conference about it.

I’m going to share some of the tools and approaches we’ve found helpful — both what data to collect, and what to do with it.

Signal vs. Noise


The basic rule of thumb when monitoring large systems and infrastructure is to “gather everything.” Any data point that can be gathered, should be gathered. Managing that information will be cheaper than realizing you do not have the data when it’s needed along the way. However, that is also a lot of data.

(ve) & (dv) Real-time & Historical Server Status

Imagine we’re monitoring a system using collectd, a fantastic tool which used behind the scenes for our Real-time & Historical Stats feature for (dv) 4.0 & (ve). When we enable the mysql plugin, it tracks over 50 data points. On a typical server we may run 8 instances of myslqd, so there’s 400 data points to track. Add in disk system monitoring, (including latency per drive, health of the whole RAID, write cache performance, etc.) a few more daemons, network stats, memory use, CPU, load, temperatures… It can total thousands of data points on just one single server.

Expand that to hundreds of racks, full of servers. Then add in monitoring for the systems and software that run between all those servers (clustering software, networking hardware and software, load balancers…) It gets pretty intimidating. That is just servers and the network! There’s also hundreds of things you can track about the applications that run on those servers. You can track how they’re all performing, not just from inside the data centers, but from places all around the world.

You’re a human being. Your time is precious, and there’s a pretty small limit on the amount of information you can make sense of at one time. The raw data isn’t useful on it’s own. We have it but without some methods of filtering and transforming the data, it won’t help much. It is information, but most of it is “noise.”

Not only is the volume of the data an issue, the data itself can often lead you to ask more questions than give you answers. For example: The load average on a server is 10.

  • Is that too high? If not, how high would be too high?
  • What’s causing it?
  • How long has it been that way?

“Signal” is what we want. Signal is actionable data. Signal is data given context and made meaningful. It is data which, as much as possible, leads to actions instead of more questions. Unlike the load average example, signal can sometimes just be a threshold. “Is the disk failing or not?” and “Is the CPU at capacity or not?” become useful questions. This kind of signal can bring values to our attention.

Start With the Users


The most powerful idea we’ve found, however, is to start from the user’s experience. Forget load averages or CPU usage. How are users experiencing your system? Is it up, down, fast or slow?

One way we do this for our web hosting products is with a very simple Apache module: mod_log_firstbyte. It tracks the time it takes to start sending data back to the user, which is a large part of their experience with the service. This tells us how much time it took to load up the app, hit the database if it needed to, read files off the storage systems, etc.

That’s a great thing to have at the top of the list. “Are we satisfying you right now or not?” It’s a simple question and answered using real data, not a sample application. It means we know if there is a problem somewhere in the infrastructure, even if nothing else is setting off alarm bells.

From there, it is straightforward when getting better and smarter about the things which tend to impact that, the most important metric.
So… how do you get started?

The Data Stack

There are hundreds of tools and projects which can help us out, open and closed, and it seems as if there are new entries every few days. I recently gave a presentation about this topic and highlighted a few including collectd, graphiteOpenTSDB, reconnoiter, esper, and protovis.

These tools and projects tend to provide similar kinds of services. It’s helpful to think about the “data stack” as:

  • Collect
  • Transport
  • Process
  • Store
  • Present

Collect is how we get the data in the first place. Running collectd on a server is a good way to do it, as is rsyslog/syslog-ng, or something which polls SNMP data.

Transport is how data is actually shipped around. The syslog family has long been “the” way to do this, but now there are a lot of alternatives over things like web protocols, multicast/unicast/broadcast mechanisms, or something custom, like flume from Cloudera.

Processing is where so much of the interesting work is happening these days. This might be something simple like aggregation (taking a lot of data points and merging them using something like an average or a percentile) but it’s increasingly moving toward event correlation (such as with Esper) or machine learning (Apache Mahout). This layer is where more sophisticated “signal generation” can happen. Imagine getting an alert that reads something like “The SLA on our web service is critical, because the storage backend is running 4x slower than normal, because 3 of the disks have an error rate above normal, because the temperature in the server is 15 degrees higher than it should be.”

Down another other road are the incredibly powerful analysis tools, with R leading the pack.

Storage is another area that’s exploding with new tools and approaches. Historically, your choices were to spread your data around a lot of servers, or use a centralized (expensive) SAN or NAS. Things like Hadoop‘s HDFS (which enables OpenTSDB) let you get the best of both worlds — cost, availability, and scalability — while generally making the data more accessible at the same time.

Presentation isn’t any less dynamic or impressive. With the increasing power of the browser to make great UI’s and graphics, you’ve got incredible control over how (and what) you display — and making that interactive. Two similar projects that both enable amazing things are D3 and Protovis. Check out the examples for a taste of what’s possible.

A lot of tools and services are capable of providing some or all of these services.

One we use a lot is an open source project from Orbitz called Graphite. It handles processing, storage, and presentation really elegantly, and it performs amazingly well on modest gear. It is easy to add new things to monitor, and you can quickly craft up a new view on your data by just clicking around. In other words, real humans, not just developers and admins, can use it. The latest release even includes a dashboard builder so you can share commonly useful views. It’s easy use with automation — it’s got a simple url syntax that lets you get data out in a raw format for further analysis in something like R.

The Data Stack at work

Let’s walk through an example end-to-end use case: our internal (gs) Grid Server metrics.

Collection, Transport and Storage

We use Perl for a lot of our core plumbing, so we have a daemon which extracts SNMP data from the servers and devices and sends it to Graphite.

Raw log data (including the mod_log_firstbyte-decorated Apache logs) is also collected with syslog-ng.

Processing (Round 1)

Syslog-ng has a great feature: the program destination. This lets you filter, then feed the interesting entries to the STDIN of a custom script. This script can analyze the incoming log data in soft real-time.

For example, from the web logs, we…

  • Send alerts to Nagios when the error rate is high
  • Send the end-to-end backend performance (mod_log_firstbyte) information to graphite
  • Track the top 100 consumers of
    • CPU
    • Filesystem I/O
    • hits/sec
    • bandwidth

…for quick troubleshooting. Other logs are watched for a variety of things, such as automatically detecting and blocking brute forcing attempts.

Were we to rebuild this system today, it would probably use Esper, but the perl parsing/processing scripts work incredibly well. They process about 5k lines/second on the average logging server, and use < 2% CPU and 100MB RAM doing so. They fall under the “ain’t broke/don’t fix” protection clause.

Sometimes it makes sense to process at the edge. It’s much cheaper to create this kind of signal from a raw stream of data, sending just the signal on, rather than sending the raw data upstream for post-processing. Some of those decisions become tradeoffs about how much you need to know and how soon. It may make sense to keep the full, raw data around in HDFS for overnight processing, but also get a nice realer-time, less complex analysis as well.

Processing (Round 2) and Presentation

While all the data would be browsable in Graphite, it would be pretty hard for people to find what they needed. This is where Protovis comes in handy, as it lets us make a custom displays — taking averages and 95th percentiles and displaying them in one unified screen. (Suitable for throwing up on a big monitor or projector, too.)

What’s also great about this particular display is that it helps explain what’s going on and correlate causes with effects. If performance is ok, we know everything else is pretty much fine. If it starts to slip just 10% about normal, it’s obvious. It’s also generally obvious what other part of the backend system is causing that, because that’s going up (or down) abnormally as well.

This isn’t all the way to the machine learning level, where you’re notified what the root cause of a problem is, but it’s effectively close. Issues that used to require watching 8 different browser tabs and shelling into 4 different systems to correlate can now be identified literally at a glance. This kind of view helps leverage the 75% of your brain that’s dedicated to processing visual information.

Conclusion

As I may have said a few times, this is a field that is developing rapidly in just about every way possible. By focusing on what really matters to your users, getting the data in front of your team in a way that makes them able to react and understand quickly, you can make incredible advancements in the quality of what you’re doing. The rich world of the open source data stack is making it possible with a fraction of the commercial cost or in-house development overhead.

We’re going to be talking about some of the changes we’ve made “under the hood” of the Grid Server, and it’s great to be able to tell those stories with data. Keep your eyes open for that!

About The Author


Joshua Barratt joined (mt) Media Temple in 2005, where he currently serves as CTO.

His world is a blend of Computer Science, Software Development, UNIX Administration, and the people and processes that tie it all together. He’s built anything from large clustered systems to embedded real-time motion control for special effects, and almost everything in between.

His current (healthy) obsession is with bringing Lean ideas to development and operations, championing and creating tools which enable everyone to make rapid, high quality decisions.