I have a linux production server that has a number of background/ daemon process running on it that I want to monitor. Unfortunately some of these daemons are third party processes so have no access to their source and many of them utilize a specific technique of listening on its own fifo file for control events and in particular these daemons require graceful shutdowns to ensure overall system integrity so kill -9 <pid> is not an option.
Although this environment is very stable, I've had issues when adding some new functionality/process (or the other environments with which it interacts changes) that cause one / two of these daemons to crash which then cascaded issues with other related daemons. I therefore needed a way to monitor these processes with a watchdog but had my fingers burnt quite badly after trying daemon tools. Had it not have been for the hard drive clone I made my fingers would have been burnt right off :)
After searching the web for issues relating to daemon tools I came across the following link and think I may be victim to this is similar way:
Have any of you guys experienced this and if so what is the best options to take. So far I started writing my own watchdog utility which seems to work fine on a dev machine but would like some feedback/input from this community about it first before I dare put it live. Is there a preferred place to upload the source code for this tiny watchdog utility for the other members of this site to scrutinize/assess.
This utility also seems to overcome the double forking issues that other daemon handlers don't seem to deal with but I could just be naive while testing my own utility.
although more work can be done on this utility like incorporating a retry interval and limit to apply per process as well as the ability to individually start a specific process, the utility currently allows for an instance of itself to be started to watch the watchdog as well as stopping/killing specific processes.
usage: wdog <-000_setup>
wdog <-000_run> [-cold]=def or [-hot]
wdog <watch_list_file> [watch_list_file_othr] [-cold] or [-hot]=def
wdog <-stop_svc> [!] or [*] or [process_name] or [pid=x]
wdog <-kill> [!] or [*] or [process_name] or [pid=x]
where <watch_list_file> is path of file with list of processes to be watched
and [watch_list_file_othr] is path to other wdog's <watch_list_file>
and <-000_setup> is switch for creating 000 setup
and <-000_run> is switch for running wdog as 000 instance
and [-cold] is switch for starting wdog in cold mode
and [-hot] is switch for starting wdog in hot mode
cold mode => remove tmp data so as NOT to use it.
hot mode => keep tmp data so as to use it.
and where [x] == process id
and [*] == all processes
and [!] == all processes including wdog processes.
Made a few changes to this utility to incorporate a start_svc option which can be used to start a specific process if it has been stopped/killed and also added additional fields to the [csv] file format to incorporate a retry interval (referring length of time in milliseconds between retrying to start process), a retry limit (referring to maximum number of times to attempt a restart for the process) and a run flag (1 ==> process should be run, 0 ==> don't run process).
usage: wdog <-000_setup>
wdog <-000_run> [-cold]=def or [-hot]
wdog <watch_list_file> [watch_list_file_othr] [-cold] or [-hot]=def
wdog <-start_svc> [!] or [*] or [process_name]
wdog <-stop_svc> [!] or [*] or [process_name] or [pid=x]
wdog <-kill> [!] or [*] or [process_name] or [pid=x]
where <watch_list_file> is path of file with list of processes to be watched
and [watch_list_file_othr] is path to other wdog's <watch_list_file>
and <-000_setup> is switch for creating 000 setup
and <-000_run> is switch for running wdog as 000 instance
and [-cold] is switch for starting wdog in cold mode
and [-hot] is switch for starting wdog in hot mode
cold mode => remove tmp data so as NOT to use it.
hot mode => keep tmp data so as to use it.
and where [x] == process id
and [*] == all processes
and [!] == all processes including wdog processes.