View Full Version : Would you use a nagios based monitoring service?
ffeingol
05-01-03, 12:26 PM
We are currently using Nagios to monitor our servers and a few other servers. We are considering turning this into a service and I'd like to see if there is any interest in it.
If you're not familar with Nagios it's an Open Source monitoring tool. It can check virtually anything on your server that is listening on a port (i.e. web, smtp, ssh etc.) as well as monitoring load, partition free space etc (if you are willing to run a small daemon on your box). There is also a web interface to check on status, add comments, turn off checks etc.
So my basic question is:
If this was offered as a very low priced service, would you use it?
What would you be willing to pay?
We are also considering a "branded" solution to monitor all the servers for hosts that offer dedicated/colocated boxes. The "branded" solution would allow you to have a url like "monitor.you-domain.tld" so your clients do not know that you are using a service.
Thanks for any/all input,
Frank
Chicken
05-01-03, 03:56 PM
Are you thinking of offering this to hosts to check their servers, or to end-users to check their sites, or both? Seems like it mostly checks servers? Far as I can tell, the main problem with any monitoring service (not problem, but weakness rather), is that it has to be run on an extremely redundant network(s). I other words, the monitoring service is only as reliable as the connection to the things that are being monitored. Hope that makes sense (?)
thebyp2
05-01-03, 04:18 PM
so what your saying there chicken is that the service has to be on a redundant network because if the redundant network was not in place its very redundancy would negate the need for a service?
Chicken
05-01-03, 04:42 PM
Well, what comes to mind are the worthless monitoring services that notify users that their site/server is down, due to network problems between the monitoring server and the site/server being monitored. That needs to be limited as much as possible. One route may be having problems, but that doesn't necessarily mean the site/server is 'down' and the user should be notified.
DizixCom
05-01-03, 04:58 PM
I agree with Chicken. If it isn't a distributed monitor then it isn't worth a dime.
I have a completed design for a distributed system that is partially implemented, but since I'll likely never finish it (I have a tendancy to start large projects and never complete them, you should see my house...) I'll explain partly how it works.
So a quick overview. It's based on a chain of command, with three levels of ancestry. The director (level 1), is the management node that takes reports from various field offices (level 2) who take reports from field agents (level 3). The field agents are really just scripts or applications that perform very specific tasks and report the results to their field office, immediately. The field office journals these reports and submits them to the director on a scheduled interval. Since the reports are journaled they will always eventually make it to the director who then decides whether or not they are stale and takes appropriate action. When the director recieves several similar reports from different field offices an event is triggered and another component not in the chain of command kicks in. We call this component the task response force. Based on the information (intelligence?) gathered from the director, the task response force will, well, respond using either reactive or proactive measures, depending on the incident.
To define a proactive measure, consider that one of the field agent tasks (FAT's) is to watch server logs for possibile intrusions. If a report is filed with 100% confidence, the action may be to isolate that server and prevent any further connections by bringing up a firewall requiring physical access for recovery. This could be very dangerous and cause a good deal of downtime, but it might also save your neck.
A reactive measure may simply issue a warning and page the approprate individual, send an email, whatever.
That's it in a nutshell. I have a lot of the field agents active right now, but they don't report to anyone and they aren't intelligent to score a report so there is no redundancy or confidence level determined. It's ok for my inhouse use, but the real deal is where I'd like to end up some day.
If someone with a lot of time on their hands beats me to it, just give me free monitoring for life for the motivation and I'll be happy. :)
ffeingol
05-01-03, 07:55 PM
Chicken et. al.,
This service would be more targeted at hosts than at end users. That part is basically a cost thing. After cc fees there would just not be any margin to monitor one website for an end user.
The target market would be more smaller hosts (although we'd be happy to do large hosts).
At least for now this would not be a redundant/correlated service. Yes, there will be some false alarms. We have chose to put the monitoring box is a very good data center with excellent network providers.
We are current monitoring about 35 boxes and 140 services across those boxes. From the feedback I've recieved from the people that we monitor we are not sending out a lot of false alarms.
Please keep the comments coming.
If you'd be interested in a months trial, just drop me a PM and we'll work on getting it setup.
Frank
Are you talking full SNMP monitoring, or just the basics (HTTP, SMTP, PING, POP3, etc)?
ffeingol
05-02-03, 03:45 AM
allan,
It is not SNMP monitoring. But it's not just basica port checking either.
Each of the checks in Nagios is "smart". The web one for examples issues a get requests and checks that a 200 status code is returned. So it issues a request and checks for a good return code.
For the more "advanced" checks, you have to run a small daemon on your box. That allows Nagios to submit the checks to run directly on your box. There is quite a bit of secure built into this daemon (via a config file). It will only talk to the monitoring server (via an IP check) and it will only execute checks that are in it's config file.
If you're willing to run this daemon, then Nagios can monitor your server load, swap space, partition free space, mysql and a bunch of other things.
Another difference with Nagios is that it defines warning and critical levels. Using the web check for example, you can say that a warning level is a 2 second response but critial is 10 seconds. So if the response to the check is < 20 seconds, it's ok. 2 seconds to < 10 seconds is a warning and > 10 seconds is critical. So you'll get notices that services are running, but running very slow.
I hope this helps a bit more.
Frank
vBulletin v3.5.4, Copyright ©2000-2012, Jelsoft Enterprises Ltd.