So at work they implemented some 'SMS Squelching' methods to interact with a 'mail > sms' gnokii script to try and 'squelch' megaloads of SMSs that come through from Nagios all at once. It's a bunch of perl and very much specific to working with a serial-attached Nokia and gnokii. The way it essentially worked was that if an SMS got queued to mail2smsgnokii/gsm's spool, the timestamp was compared with the last sms that got sent and if the length of time in between SMS was inside a threshold (say 30 minutes), the SMS would not be sent.
The intention of this is in cases where, say, a server goes under intense load, but is still pingable, so the HOST check does not switch to a CRITICAL state. However, all the services running on the host become unresponsive (NRPE checks timing out due to the load). Suddenly one ends up with about 15 or more SMS for each service, when really only one was needed to alert you to an issue. It could be any service, so it's not an easy thing to handle with the inbuilt service dependencies (I guess. But I haven't ever really tried to work out the check dependencies in nagios)
We recently provisioned a Linode as a third-party vector that is designed to mimic a 'regular user' in that there is no deeper layer through VPN tunnels and NRPE etc to perform more low-level diagnostics of servers from Nagios. The Linode Nagios is there to just ping and hit HTTP etc. For one or two sites, there are regex scripts that look for keywords on certain pages. You can read about how I wrote those scripts in this earlier post.
Since this Nagios server is not in our regular Xen farm, we can't attach a Nokia to it. Instead we got some Clickatell credits and to hell with it (yes, if the network is down on the Linode, it won't be able to push out SMS, but it is not the Linode's network we care about, it's the network elsewhere that we're asking it to check)
Clickatell provide a standard API for pushing out SMS via a number of means, one of which being a standard HTTP request that can be performed simply with programs like curl. I've implemented such Nagios set ups with Clickatell like this before, but this time I was required to re-implement the 'SMS Squelching' technology that was done with gnokii/perl elsewhere.
Here's the bash script I wrote to handle it:
#!/bin/bash
#
# Sends an sms out via whatever means (written for use with clickatell api on toot)
#
# Get the mobile number from stdin
mobile=$1
# Get the message from stdin
message=$2
# Admin e-mail
admin_email='wakeme @ up . xyz'
# Touch a file with a timestamp of 30 minutes ago
touch -d '-30 minutes' /tmp/timestamp_$mobile
# Squelch message
function send_squelch {
`echo -e "SMS to ${mobile} was squelched.\n\n\nMessage follows:\n\n\n${message}" | mail -s 'SMS SQUELCHED' $admin_email`
}
# Add some sms sending methods
function send_clickatell {
# Replace the spaces with the + so that clickatell can use it
message_cleaned=`echo $message | sed s/' '/'+'/g`
curl "http://api.clickatell.com/http/sendmsg?user=xxxxxxxxx&password=xxxxxxx&api_id=xxxxx&to=$mobile&text=$message_cleaned"
}
# If the timestamp's timestamp of 30 minutes ago is more recent than the last notification that was sent, then send the sms
if [ /tmp/timestamp_$mobile -nt /tmp/last_notification_$mobile ]; then
send_clickatell
else
send_squelch
fi
# reset the timestamp of the last_notification state
touch /tmp/last_notification_$mobileSo the logic here is that a dummy file gets touched with a timestamp of 30 minutes ago. The file is then compared with a 'last_notification' file's timestamp. If the timestamp file is newer than the last notification file, the SMS is sent. If not, an e-mail is sent to the admins telling them an SMS was meant to be sent but got squelched, with the information in it, so that we still get the record of it but not the SMS.
Then the last notification file is touched again, changing the timestamp, so that even SMS that get squelched trigger the notification time. Next time an SMS comes around, the timestamp will get compared with that last time an SMS was attempted to be sent. If 30 minutes have not passed since the last SMS, they'll continue to be squelched.
I put the clickatell-specific stuff into a function that I can then call in the conditional, so that if I need to implement a similar thing elsewhere with a different provider, I can just add another function for that provider and change the function call in the conditional. No further modifications required. In fact I can already think of ways to abstract it even more, but I only just wrote this just now :)
Note I append the mobile number to the files, so that if there are, say, 4 people getting the SMS, there isn't a case of one person getting the SMS but the other three having theirs squelched. In other words, this check is performed per sysadmin.
In Nagios, we define the host and service commands to work with this file above (placed in /usr/local/bin/sendsms) in commands.cfg like so:
# 'notify-by-sms' host command definition
define command{
command_name host-notify-by-sms
command_line /usr/local/bin/sendsms $CONTACTPAGER$ "Alert : $HOSTNAME$ : $HOSTSTATE$ : $HOSTOUTPUT$" > /dev/null
}
# 'notify-by-sms' service command definition
define command{
command_name service-notify-by-sms
command_line /usr/local/bin/sendsms $CONTACTPAGER$ "Alert : $SERVICEDESC$ for $HOSTNAME$ : $SERVICESTATE$ : $SERVICEOUTPUT$" > /dev/null
}And obviously these commands are defined per (host|service)_notification_commands in the contact definitions.
Let me state that I'm not altogether a fan of the squelching idea, as I find it dangerous (if one false positive alert comes through that can actually be ignored, but quickly following it is a REAL problem on a different server that is absolutely critical, you won't get the SMS, so you'd better have one eye watching your e-mail at all times!). But if you were wondering how to set up something like this that isn't using your own homegrown sms gateway, this is one way.
Post new comment