How to add a timeout to a Zabbix alert trigger
I’ve replaced Cacti and Nagios with Zabbix to monitor the company infrastructure. Nagios is nice but there are some things you can only monitor using agents as Linux’s SNMP support is just too fragile. It’s also nice having monitoring and charting built into one easy to configure web application as opposed to drowning in config files.
My one pain point was state flapping. A service can go unresponsive for a fraction of a second under heavy load and come back right again. All to frequently my inbox would get swamped with problem…ok…problem…ok….problem…ok messages. By the time I’d log in everything had settled down. It’s not a perfect situation, but doesn’t warrant getting out of bed to repair either. Nagios has a flapping detection function but Zabbix doesn’t.
What I’ve found is that the AVG function can be used to average the status values (“1″ or “0″) over a given period and only trigger an alert if it drops below the alert value. For example
{server01:net.tcp.port[, 80].avg(30)}<1
will only trigger a second alert if the web server stays responsive for at least 30 seconds between outages. Also
{server01:net.tcp.port[, 80].avg(30)}=0
Will only trigger an alert if the service stays down for more than 30 seconds at a time. Further you can combine the two in
{server01:net.tcp.port[, 80].avg(60)}<0.5
to only trigger an alert if the service stays down for more than 30 seconds and only trigger an OK if it stays back up for 30 seconds.
About this entry
You’re currently reading “How to add a timeout to a Zabbix alert trigger,” an entry on gary’s web sofa
- Published:
- 5.31.10 / 3pm
- Category:
- Internet, Linux, Software, Technology
- Tags:
1 Comment
Jump to comment form | comments rss [?] | trackback uri [?]