[my][home][toon]
previously known as [cold][wet][durham], [dirty][grimy][london],[busy][shiny][toon],[frantic][crowded][south]

Open source monitoring tools

November 29th 2009 in interweb, technology

If you run a bunch of Linux boxes, or even just the one to be honest, there are a few tools which you probably know about by now, but will help you keep on top of your system.  Linux has a lovely habit of just pploughing along quite happily for years (whilst not Linux I’ve logged into Solaris boxes and seen uptime’s of 7 years before now – 7 years since the last reboot!).

However, long running and low admin overhead means that when things do go wrong its often due to something which you might have noticed if you were in and about the system more often.  I use the following things, ordered here by “response” time.

Munin

Munin is a system resource monitor with memory.  Running top, iostat, memstat etc will tell you whats going on now (and sar will tell you what’s been going on), but munin does allt his, and with pretty pictures.  For a bunch of common system resources it tracks two graphs with different time scales, eg. Disc usage by day, and disc usage by week.

What’s the point in 2 graphs?  It lets you spot trends over time more easily.  Disc space is a great example – it creeps up and up, but it’s only when it runs out completely that you tend to notice.

I stick my nose into Munin a couple of times a week on both my work and personal systems.  It lets me get the gist of whats going on, and hopefully helps you be a bit more proactive, rather than just reactive, to you system problems.

Nagios

Nagios is extremaly well known as a reactive monitoring solutions.  It’s been around a while, has plugins to monitor just about everything, and like munin, is installable as a node or a client.  Your node will collate all the info from the clients you install, meaning you only need to check one place.

Nagios will tell you when things are going wrong, either based on “up or down’ness” or triggers (eg. disc space has dropped under a certain level).  It supports groups for support users, messaging different people based on rota etc.  It’s built to support huge infrastructures and does.

It is however not brilliantly easy to install and get working, though because it is based on a template system, once you’re up and running, if you’ve been sensible about the templates, you can deploy it easily to hundreds of machines almost as easily.

Splunk

Munin helps you prevent things going wrong, nagios lets you know when they’ve gone wrong, and Splunk lets you work out what the hell happened (and sometime lets you spot the problems before hand).  I’ve used splunk on personal boxes, though not at work (yet). Splunk collates log files and lets you organise they chronologically (putting it simply).

Why’s that useful?  Well you might know that everything went to hell in a handbasket at 6pm, but where to you start to track down what was going on?  Splunk will let you spot that at ten to 6 Apache was flaking out, and that at quarter to six mysql was grumbling too, and that at half five you started getting lots of network traffic with dropped packets etc.  It lets you spot the chain of events, a seriously difficult thing to do when you might have 10 log files.  Just because Apache spazzed out at 6:10pm, you might not be looking in your syslog at five.  Splunk will lead you in the right direction.

There are loads of other things out there that are amazing too, but these are a good start.  If you’re running Oracle databases, have a look at Big Brother for example.

Possibly Related Posts:





required



required - won't be displayed


Your Comment:

One of the things that has bugged me for ages is how to track how many people  are subscribing to my RSS feeds.  Short of using Feedburner or other such tracking systems, there was no way I could do it on my own..

Then, in the shower this week (where I have many of my best [...]

Previous Entry

Flickr continues to blow my mind – it really does. This article about their code management and versioning actually blew me away. I know continuous integration can work (I imagine it can completely fail too unless your dev staff have their heads screwed on)

I’ve done some really easy go-lives (of simple things) and [...]

Next Entry