Monitoring and automatic restart of Exim after a crash

Normally a Linux daemon should work flawlessly and if it's not, one should try to solve the problem at its root cause. However sometimes this won't work. Either you cannot find the cause or you don't have the time to find it, but you have to make it work despite of crashes of the software. I had this situation in an Exim setup and came up eventually with a small script that I run from crontab every five minutes to assure the continuous running of Exim.

For a quick-start here's the script:
#!/bin/sh

COMMAND="/etc/init.d/exim restart"
DAEMON=exim3
PORT=25
OUTPUT=/var/log/exim/exim_errors.log
CDATE=`date "+%Y-%m-%d %H:%M:%S"`
# Note: PATH must be set so start-stop-daemon is included!
#       Otherwise the exim start-stop script won't find it. :-o
ENV="env -i LANG=C PATH=/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/bin"

PROC_FAILURE=0
TCP_FAILURE=0

if ! (ps -A | grep -i ${DAEMON} > /dev/null 2>&1); then
  echo "${CDATE}: the ${DAEMON} process is not running!" >> ${OUTPUT}
  PROC_FAILURE=1
fi

# note: redirect stderr of netstat to the null device due to a bug in netstat
# (it produces sometimes "warning, got bogus <msg> line" messages on stderr)
if ! (netstat --listening --numeric-ports --numeric-hosts --tcp 2> /dev/null | grep -iE "[^0-9][0-9]+\.[0-9]+\.[0-9]+\.[0-9]+:${PORT}[^0-9]" > /dev/null 2>&1); then
  echo "${CDATE}: nothing is listening on TCP port ${PORT}!" >> ${OUTPUT}
  TCP_FAILURE=1
fi

(test ${PROC_FAILURE} -gt 0 || test ${TCP_FAILURE} -gt 0) && ${ENV} ${COMMAND} >> ${OUTPUT} 2>&1

exit 0

Save this script in a file, eg. "${HOME}/bin/exim_monitor.sh". There're some variables at the start of the script that you should adjust to your environment. Then you should add this script to your crontab.
Eg. Running the following command as root will add a new line to your root-crontab scheduling the monitoring script to run every five minutes: (crontab -l 2> /dev/null; echo '0,5,10,15,20,25,30,35,40,45,50,55 * * * * ${HOME}/bin/exim_monitor.sh') | crontab -

The script does the following:
  1. It checks for a running process named "${DAEMON}" (specified in the script) and checks whether some process is listening on TCP port "${PORT}". The script logs the result of these checks in the file "${OUTPUT}".
  2. If any of the above checks failed, then the script tries to restart the daemon by running "${COMMAND}" and of course logs the output to "${OUTPUT}".

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

simple

echo quit | telnet localhost smtp | egrep -q '^220 ' || /etc/init.d/exim restart

Not tested, though Smile))

and if telnet is not installed? ;-)


# echo quit | telnet localhost smtp
sh: telnet: command not found

Smile

postfix

On a second thought - use postfix.

Postfix has a master daemon that restarts each of its subsystems after they were used a configured number of times.

i'm not the sysadmin

Actually I'm not the one who maintains the server ... I'm a "user" here. Smile I just came up with the idea how to make Exim work, but I won't spend don't know how many hours reading Postfix docs to migrate from Exim. :->

However I must admit: your telnet-based solution is most simple and elegant. Smile (It works in a bit different way as mine, but it meets the goal that I aimed at.)