Monday, June 2, 2008

SysMon: System Monitoring Framework

My old team used to be responsible for looking after a fairly complex IT system that supported a wide range of activities in the company... everything from surveys, logistics, marine, drilling operations, emergency response, pipeline operations and maintenance to things like planning and permitting. The low-res screen capture above may give you an idea of the physical and logical complexity.

The system comprised of a large number of Windows, Unix and Linux servers communicating over wired LAN, microwave WAN (for remote sites), GSM/GPRS cellular networks, the internet, satellite links and radio telemetry. In some cases it also made use of unusual protocol stacks such as bi-directional multicasting (for "battle-net" like technology) and video streaming. The production environment was also replicated to varying extents to development / testing / emergency response fall-back (ER) and disaster recovery (DR) modes.

I won't bore you with the details but the system integrated spatial and non-spatial data, weather station sensors, vehicle, vessel and helicopter tracking, planning, real-time positioning, real-time subsidence monitoring, GPS reference stations etc etc etc... and so had quite a large real-time or time critical component.

It was necessary therefore that we had some means of periodically monitoring performance and up-time of the systems and services. We looked at various solutions including Nagios / Big Brother etc - but in the end we built our own very extensible and simple solution using VB6 because we had some specific logical tests that we wanted to run.

The great thing about the solution we came up with was that it was extremely flexible and extensible. It allowed custom tests (in the form of scripts and plugins) to be run even though most of the tests could be handled straight "out of the box". You could even aggregate a number of sub tests into overview tests and perform logical query tests on databases.

Guys in the team could be notified of failures via email or SMS allowing them to respond rapidly to problems and at the end of every month we could produce a graphical report for our clients showing system up-time and performance.

I've included the user-guides here so you get the idea:
User Guide Object Model Service Engines

One day I'll get around to an open-sourced .NET version! - Stay tuned.