I’ve been looking for a nice, simple method for verifying that all key services and dæmons are running on my UNIX servers. It’s pretty rare for a service to die, but, when it does, I want it restarted as soon as possible and I want to be notified about it! I’ve looked at process supervisors like daemontools and runit that are designed to run continuously and monitor the dæmons they start, but they tend to require a little more effort to maintain than I like. Normally, the sysadmin has to write the start-up script that sets up the environment and starts the dæmon, but it also must somehow coerce the dæmon to not do it’s normal double-fork and disassociate from it’s parent process and foreground terminal. I mainly just want a way to check for a process that’s not running and run the appropriate command to restart it.
CFEngine 3 seems to be a little closer to what I want and after reading through the excellent Learning CFEngine 3 book from O’Reilly, I think I’ve finally figured out the right recipe. All I want to do is to specify a process to look for and, if that process is not running, to specify a command to run that will restart the process. I would also like a report if a process ever needs to be restarted since that normally represents an abnormal event. Here’s the basic configuration I have to monitor a few services:
any:: "services[ssh][name]" string => "OpenSSH"; "services[ssh][process]" string => "/usr/sbin/sshd"; "services[ssh][restart]" string => "/usr/sbin/invoke-rc.d ssh restart"; web|dns:: "services[ldap][name]" string => "OpenLDAP"; "services[ldap][process]" string => "/usr/sbin/slapd"; "services[ldap][restart]" string => "/usr/sbin/invoke-rc.d slapd restart"; "services[bind][name]" string => "BIND"; "services[bind][process]" string => "/usr/sbin/named"; "services[bind][restart]" string => "/usr/sbin/invoke-rc.d bind9 restart"; web:: "services[apache][name]" string => "Apache"; "services[apache][process]" string => "/usr/sbin/apache2"; "services[apache][restart]" string => "/usr/sbin/invoke-rc.d apache2 restart";
And that’s it! The above says that everyone must be running OpenSSH, servers web and dns must be running LDAP and BIND, and server web must be running Apache. It also gives the name of the process to look for and the command necessary to restart the process if it’s not running. I just repeat the same three lines for each service that I want to monitor and place the correct classes in front to select which servers will run those services. Here’s the full file to run this:
body common control { bundlesequence => { "services" }; inputs => { "cfengine_stdlib.cf" }; } bundle agent services { vars: any:: "services[ssh][name]" string => "OpenSSH"; "services[ssh][process]" string => "/usr/sbin/sshd"; "services[ssh][restart]" string => "/usr/sbin/invoke-rc.d ssh restart"; web|dns:: "services[ldap][name]" string => "OpenLDAP"; "services[ldap][process]" string => "/usr/sbin/slapd"; "services[ldap][restart]" string => "/usr/sbin/invoke-rc.d slapd restart"; "services[bind][name]" string => "BIND"; "services[bind][process]" string => "/usr/sbin/named"; "services[bind][restart]" string => "/usr/sbin/invoke-rc.d bind9 restart"; web:: "services[apache][name]" string => "Apache"; "services[apache][process]" string => "/usr/sbin/apache2"; "services[apache][restart]" string => "/usr/sbin/invoke-rc.d apache2 restart"; any:: "services" slist => getindices("services"); processes: "$(services[$(services)][process])" restart_class => "service_$(services)_restart"; commands: "$(services[$(services)][restart])" classes => if_notkept("service_$(services)_failed"), ifvarclass => "service_$(services)_restart"; reports: "$(services[$(services)][name]) is not running, restarting..." ifvarclass => "service_$(services)_restart"; "$(services[$(services)][name]) failed to start!" ifvarclass => "service_$(services)_failed"; }
All the configuration of each service takes place entirely in the vars: section. The processes: section takes care of searching for each process from the services array and declaring a class if the process was not found. The commands: section runs the appropriate command to restart the service if the corresponding class was set by processes:. Normally, CFEngine is silent, but the reports: section will generate output that goes into an email and log file if the process needed to be restarted and if there were any errors restarting the process.
This solution isn’t perfect, unlike a real process supervisor, CFEngine does not get immediate notification when a process dies, much less knowing the cause of death, but it does offer reasonable response time. My example also doesn’t handle any rate limiting or limiting how many attempts there are to restart a dæmon, but CFEngine does have some amount of built in rate limiting. Overall, I’ve found this solution simple to maintain and scale, and it’s now live and running on my servers.