Network Monitoring
We have designed a custom monitoring system that watches over our network 24/7 to ensure that things run smoothly and that any problems are promptly attended to. It consists of several components that work together as follows:
- A special dedicated server on our network performs the following checks at least once every 5 minutes, and sends a text message to at least two administrators via pager, cell phone, or PDA if anything is wrong:
- Our connections to the outside world are tested by attempting to contact various external networks around the world.
- All servers, routers, switches, and other networked devices are "pinged" to see if they are on-line.
- All servers are checked to see if their CPU load has risen above a certain acceptable level (which is hand-tuned for each server).
- All disk drives in all servers are checked to see if they are close to being full.
- File servers perform various tests to ensure that disk drives and RAIDs have not failed.
- All services on all servers are checked to see if they are running. That is, the monitoring system will connect to the SMTP, POP, NNTP, HTTP, SQL, DNS, and other server processes to make sure they are working.
- Modem lines are checked to see if they are approaching the maximum capacity of dialup users.
- Multiple temperature sensors in the machine room are read to guard against climate control failure.
- "All's well" messages are sent to the administrators at certain times of the day to ensure that the monitoring system is working.
- Our connections to the outside world are tested by attempting to contact various external networks around the world.
- An on-site backup server checks the server above every 5 minutes to make sure it's still operating properly and that any text messages are actually being sent. This server is able to send text message via its own hardware and telephone line.
- Off-site servers located right in the homes of the administrators perform a subset of these tests from outside our network, also sending text messages if any of them fail.
- Every time a server is rebooted for any reason, a text message is sent to the administrators. Should a machine ever restart on its own due to a crash, we'll know about it immediately.
- Individual servers perform self checks every 5 minutes to ensure that any required services are still running, and restart them if they have stopped. They also test the temperature of their CPUs, the speed of their fans, etc.