Normally a Linux daemon should work flawlessly and if it's not, one should try to solve the problem at its root cause. However sometimes this won't work. Either you cannot find the cause or you don't have the time to find it, but you have to make it work despite of crashes of the software. I had this situation in an Exim setup and came up eventually with a small script that I run from crontab every five minutes to assure the continuous running of Exim.
For a quick-start here's the script:
#!/bin/sh
COMMAND="/etc/init.d/exim restart"
DAEMON=exim3
PORT=25
OUTPUT=/var/log/exim/exim_errors.log
CDATE=`date "+%Y-%m-%d %H:%M:%S"`
# Note: PATH must be set so start-stop-daemon is included!
# Otherwise the exim start-stop script won't find it. :-o
ENV="env -i LANG=C PATH=/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/bin"
PROC_FAILURE=0
TCP_FAILURE=0
if ! (ps -A | grep -i ${DAEMON} > /dev/null 2>&1); then
echo "${CDATE}: the ${DAEMON} process is not running!" >> ${OUTPUT}
PROC_FAILURE=1
fi
# note: redirect stderr of netstat to the null device due to a bug in netstat
# (it produces sometimes "warning, got bogus <msg> line" messages on stderr)
if ! (netstat --listening --numeric-ports --numeric-hosts --tcp 2> /dev/null | grep -iE "[^0-9][0-9]+\.[0-9]+\.[0-9]+\.[0-9]+:${PORT}[^0-9]" > /dev/null 2>&1); then
echo "${CDATE}: nothing is listening on TCP port ${PORT}!" >> ${OUTPUT}
TCP_FAILURE=1
fi
(test ${PROC_FAILURE} -gt 0 || test ${TCP_FAILURE} -gt 0) && ${ENV} ${COMMAND} >> ${OUTPUT} 2>&1
exit 0
Save this script in a file, eg. "${HOME}/bin/exim_monitor.sh". There're some variables at the start of the script that you should adjust to your environment. Then you should add this script to your crontab.
Eg. Running the following command as root will add a new line to your root-crontab scheduling the monitoring script to run every five minutes:
(crontab -l 2> /dev/null; echo '0,5,10,15,20,25,30,35,40,45,50,55 * * * * ${HOME}/bin/exim_monitor.sh') | crontab -
The script does the following:
- It checks for a running process named "${DAEMON}" (specified in the script) and checks whether some process is listening on TCP port "${PORT}". The script logs the result of these checks in the file "${OUTPUT}".
- If any of the above checks failed, then the script tries to restart the daemon by running "${COMMAND}" and of course logs the output to "${OUTPUT}".
Comments
simple
echo quit | telnet localhost smtp | egrep -q '^220 ' || /etc/init.d/exim restart
Not tested, though ))
and if telnet is not installed? ;-)
# echo quit | telnet localhost smtp
sh: telnet: command not found
postfix
Postfix has a master daemon that restarts each of its subsystems after they were used a configured number of times.
i'm not the sysadmin
However I must admit: your telnet-based solution is most simple and elegant. (It works in a bit different way as mine, but it meets the goal that I aimed at.)