Munin is a great tool. If you can script it, you can monitor it with munin. Unfortunately, however, munin is slow; that is, it will take snapshots once every five minutes, and not look at systems in between. If you have a short load spike that takes just a few seconds, chances are pretty high munin missed it. It also comes with a great webinterfacefrontendthing that allows you to dig deep in the history of what you've been monitoring.

By the time munin tells you that your Kerberos KDCs are all down, you've probably had each of your users call you several times to tell you that they can't log in. You could use nagios or one of its brethren, but it takes about a minute before such tools will notice these things, too.

Maybe use CollectD then? Rather than check once every several minutes, CollectD will collect information every few seconds. Unfortunately, however, due to the performance requirements to accomplish that (without causing undue server load), writing scripts for CollectD is not as easy as it is for Munin. In addition, webinterfacefrontendthings aren't really part of the CollectD code (there are several, but most that I've looked at are lacking in some respect), so usually if you're using CollectD, you're missing out some.

And collectd doesn't do the nagios thing of actually telling you when things go down.

So what if you could see it when things go bad?

At one customer, I came in contact with Frank, who wrote ExtreMon, an amazing tool that allows you to visualize the CollectD output as things are happening, in a full-screen fully customizable visualization of the data. The problem is that ExtreMon is rather... complex to set up. When I tried to talk Frank into helping me getting things set up for myself so I could play with it, I got a reply along the lines of...

well, extremon requires a lot of work right now... I really want to fix foo and bar and quux before I start documenting things. Oh, and there's also that part which is a dead end, really. Ask me in a few months?

which is fair enough (I can't argue with some things being suboptimal), but the code exists, and (as I can see every day at $CUSTOMER) actually works. So I decided to just figure it out by myself. After all, it's free software, so if it doesn't work I can just read the censored code.

As the manual explains, ExtreMon is a plugin-based system; plugins can add information to the "coven", read information from it, or both. A typical setup will run several of them; e.g., you'd have the from_collectd plugin (which parses the binary network protocol used by collectd) to get raw data into the coven; you'd run several aggregator plugins (which take that raw data and interpret it, allowing you do express things along the lines of "if the system's load gets above X, set load.status to warning"; and you'd run at least one output plugin so that you can actually see the damn data somewhere.

While setting up ExtreMon as is isn't as easy as one would like, I did manage to get it to work. Here's what I had to do.

You will need:

  • A monitor with a FullHD (or better) resolution. Currently, the display frontend of ExtreMon assumes it has a FullHD display at all time. Even if you have a lower resolution. Or a higher one.
  • Python3
  • OpenJDK 6 (or better)

First, we clone the ExtreMon git repository:

git clone https://github.com/m4rienf/ExtreMon.git extremon
cd extremon

There's a README there which explains the bare necessities on getting the coven to work. Read it. Do what it says. It's not wrong. It's not entirely complete, though; it fails to mention that you need to

  • install CollectD (which is required for its types.db)
  • Configure CollectD to have a line like Hostname "com.example.myhost" rather than the (usual) FQDNLookup true. This is because extremon uses the java-style reverse hostname, rather than the internet-style FQDN.

Make sure the dump.py script outputs something from collectd. You'll know when it shows something not containing "plugin" or "plugins" in the name. If it doesn't, fiddle with the #x3. lines at the top of the from_collectd file until it does. Note that ExtreMon uses inotify to detect whether a plugin has been added to or modified in its plugins directory; so you don't need to do anything special when updating things.

Next, we build the java libraries (which we'll need for the display thing later on):

cd java/extremon
mvn install
cd ../client/
mvn install

This will download half the Internet, build some java sources, and drop the precompiled .jar files in your $HOME/.m2/repository.

We'll now build the display frontend. This is maintained in a separate repository:

cd ../..
git clone https://github.com/m4rienf/ExtreMon-Display.git display
cd display
mvn install

This will download the other half of the Internet, and then fail, because Frank forgot to add a few repositories. Patch (and push request) on github

With that patch, it will build, but things will still fail when trying to sign a .jar file. I know of four ways on how to fix that particular problem:

  1. Add your passphrase for your java keystore, in cleartext, to the pom.xml file. This is a terrible idea.
  2. Pass your passphrase to maven, in cleartext, by using some command line flags. This is not much better.
  3. Ensure you use the maven-jarsigner-plugin 1.3.something or above, and figure out how the maven encrypted passphrase store thing works. I failed at that.
  4. Give up on trying to have maven sign your jar file, and do it manually. It's not that hard, after all.

If you're going with 1 through 3, you're on your own. For the last option, however, here's what you do. First, you need a key:

keytool -genkeypair -alias extremontest

after you enter all the information that keytool will ask for, it will generate a self-signed code signing certificate, valid for six months, called extremontest. Producing a code signing certificate with longer validity and/or one which is signed by an actual CA is left as an exercise to the reader.

Now, we will sign the .jar file:

jarsigner target/extremon-console-1.0-SNAPSHOT.jar extremontest

There. Who needs help from the internet to sign a .jar file? Well, apart from this blog post, of course.

You will now want to copy your freshly-signed .jar file to a location served by HTTPS. Yes, HTTPS, not HTTP; ExtreMon-Display will fail on plain HTTP sites.

Download this SVG file, and open it in an editor. Find all references to be.grep as well as those to barbershop and replace them with your own prefix and hostname. Store it along with the .jar file in a useful directory.

Download this JNLP file, and store it on the same location (or you might want to actually open it with "javaws" to see the very basic animated idleness of my system). Open it in an editor, and replace any references to barbershop.grep.be by the location where you've stored your signed .jar file.

Add the chalice_in_http plugin from the plugins directory. Make sure to configure it correctly (by way of its first few comment lines) so that its input and output filters are set up right.

Add the configuration snippet in section 2.1.3 of the manual (or something functionally equivalent) to your webserver's configuration. Make sure to have authentication—chalice_in_http is an input mechanism.

Add the chalice_out_http plugin from the plugins directory. Make sure to configure it correctly (by way of its first few comment lines) so that its input and output filters are set up right.

Add the configuration snippet in section 2.2.1 of the manual (or something functionally equivalent) to your webserver's configuration. Authentication isn't strictly required for the output plugin, but you might wish for it anyway if you care whether the whole internet can see your monitoring.

Now run javaws https://url/x3console.jnlp to start Extremon-Display.

At this point, I got stuck for several hours. Whenever I tried to run x3mon, this java webstart thing would tell me simply that things failed. When clicking on the "Details" button, I would find an error message along the lines of "Could not connect (name must not be null)". It would appear that the Java people believe this to be a proper error message for a fairly large number of constraints, all of which are slightly related to TLS connectivity. No, it's not the keystore. No, it's not an API issue, either. Or any of the loads of other rabbit holes that I dug myself in.

Instead, you should simply make sure you have Server Name Indication enabled. If you don't, the defaults in Java will cause it to refuse to even try to talk to your webserver.

The ExtreMon github repository comes with a bunch of extra plugins; some are special-case for the place where I first learned about it (and should therefore probably be considered "examples"), others are general-purpose plugins which implement things like "is the system load within reasonable limits". Be sure to check them out.

Note also that while you'll probably be getting most of your data from CollectD, you don't actually need to do that; you can write your own plugins, completely bypassing collectd. Indeed, the from_collectd thing we talked about earlier is, simply, also a plugin. At $CUSTOMER, for instance, we have one plugin which simply downloads a file every so often and checks it against a checksum, to verify that a particular piece of nonlinear software hasn't gone astray yet again. That doesn't need collectd.

The example above will get you a small white bar, the width of which is defined by the cpu "idle" statistic, as reported by CollectD. You probably want more. The manual (chapter 4, specifically) explains how to do that.

Unfortunately, in order for things to work right, you need to pretty much manually create an SVG file with a fairly strict structure. This is the one thing which Frank tells me is a dead end and needs to be pretty much rewritten. If you don't feel like spending several days manually drawing a schematic representation of your network, you probably want to wait until Frank's finished. If you don't mind, or if you're like me and you're impatient, you'll be happy to know that you can use inkscape to make the SVG file. You'll just have to use dialog behind ctrl+shift+X. A lot.

Once you've done that though, you can see when your server is down. Like, now. Before your customers call you.