How to configure 1 minute polling on Observium

1 Minute polling for Observium, quite an interesting and useful feature if you ask me. Especially with large capacity interfaces such as 25/40/100gbit interfaces, there’s tiny differences which could be interesting to spot, and you just can’t see those when you are using 5 minute polling. Another interesting used case for example are your uplinks, and your core links in your network.

By default Observium is not setup to do this, and it requires a couple of changes. Now luckily, it’s not that hard to change this, as the configuration settings for the RRDs are a bit hidden, but setup in such a way, that you can actually change this.

First before we get into the details on how to actually do this there are a couple of things you’ll need/want to know:

N E E D   T O    K N O W :

  • These changes are an all or nothing setup. Meaning, all your things will be polled on a 1 minute interval, that is sensors, ports, cpu’s, memory, everything. No possibility to do just a couple of things on 1 minute intervals
  • You NEED to do this with an empty RRD directory / new installation, as RRD files need to be recreated with the proper ‘time brackets’ configured in them
  • I have not tested it on an existing installation, so i don’t know what will happen, you could probably loose your data, or at least corrupt it. I simply don’t know enough of RRDtool what will happen.
  • Make sure all your devices finish their polling within 1 minute
  • It’s NOT, and i repeat NOT an officially supported thing. Observium (Adam Armstrong) recommends against it
  • There will be a 5time increase in load on your server
  • Your disk IO will increase by 5 times as well. Keep that in mind if your are for example runnning your RRD directory on an SSD, as this will wear out 5 times as fast

Configuration
Ok, let’s see how to configure this:

In the default configuration thats deployed once you have downloaded it, the RRD settings are configured for:

  • 7 days of 5 minutes
  • 62 days of 30 minutes
  • 120 days of 2 hours
  • 4 years of 1 day

Now these settings are perfectly acceptable for me, except the first one ofcourse. We’ll want that to change to 1 minute.

You’ll need to make changes in 2 places:

  • Observium configuration file
  • Crontab

Your configuration is stored in, and you can lookup them up in the file /opt/observium/includes/defaults.inc.php

It looks like this:


/ Default Poller Interval (in seconds)
$config['rrd']['step'] = 300;
/ 7 days of 5 min 62 days of 30 min 120 days of 2 hour 4 years of 1 day
$config['rrd']['rra'] = "RRA:AVERAGE:0.5:1:2016 RRA:AVERAGE:0.5:6:2976 RRA:AVERAGE:0.5:24:1440 RRA:AVERAGE:0.5:288:1440 ";
$config['rrd']['rra'] .= " RRA:MIN:0.5:6:1440 RRA:MIN:0.5:96:360 RRA:MIN:0.5:288:1440 ";
$config['rrd']['rra'] .= " RRA:MAX:0.5:6:1440 RRA:MAX:0.5:96:360 RRA:MAX:0.5:288:1440 ";

Now you could modify the RRD settings in this file, but that’s not really helping if the default settings get adjusted when you upgrade your Observium instance. So change it in the file /opt/observium/config.php

Change 2 things in that piece of text:

  • the rrd step size from 300 to 60
  • the first RRA value from RRA:AVERAGE:0.5:1:2016 to RRA:AVERAGE:0.5:1:10080

Once this has been changed, change your cronjob (/etc/cron.d/observium) to run poller-wrapper / poller every minute

That should be it! Enjoy your 1 minute resolution in Observium!

510-996-7217

Observium SYSLOG alerting

I maintain an updated version on Github: 484-954-8351

Observium has builtin syslog alerting since release r7970. This text will explain on what it does, how it works, and how you configure it.

Something that was missing in the entity based alerting system, is now fixed with this brand new feature of Observium.

You can now perform realtime alerting where as the conventional entity based alerting only would trigger every 5 minutes, because it is poller based.

Now this new feature is not going to replace the existing entity alerting, as it just serves a different purpose, as it just allows you to catch issues from your devices, which are not possible to catch with the existing entity based alerting system. Examples are: Duplicate IPs/MAC addresses, or OSPF messages.

Syslog alerting allows you to send out notifications from syslog messages that are produced by your devices. You can match on those with Regular Expressions

Syslog alerting in observium perfectly integrates with the existing contact system, so it allows you to notify via the usual channels, E-mail, Slack, Pagerduty, XMPP, webhook, etc.

For a complete overview of transport methods, see: Alerting Transports

How to set it up

First make sure you have configured syslog to integrate with Observium. The documentation for doing this, can be found here: Syslog Integration

If you are running r7970 or later you will find 2 new entries in the global menu:

  • Syslog Alerts
  • Syslog Rules

screenshot1

Let’s start with creating a useful syslog alert rule, that triggers an alert when there is a duplicate mac address found on a Cisco device:

  • First click on “Syslog Rules” in the global menu
  • Then click on “Add Syslog Rule”

screenshot2

You will then be presented with the following screen, where you have to configure the details of the syslog alert rule:

  • Rule Name: This defines the name for the actual rule, this is just an administrative reference
  • Message: This is the message that will end up in the notification send out
  • Regular Expression: This is where you configure the actual rule to match syslog content against

screenshot3

Syslog Rules are built using standard PCRE regular expressions.

There are many online resources to help you learn and test regular expressions. Good resources include regex101.com, Debuggex Cheatsheet, regexr.com and Tutorials Point. There are many other sites with examples which can be found online.

A simple rule to match the word “duplicate” could look like:

/duplicate/

A more complex rule to match SSH authentication failures from PAM for the users root or adama might look like:

/pam.+\(sshd:auth\).+failure.+user\=(root|adama)/

Useful syslog alerts

Here are a couple of alerts you could implement which come in pretty handy:

screenshot4

Sending out notifications

To actually send out notifications, you will have to associate the syslog alert rule with the contact. To do this, edit the contact that you have configured and add the syslog rule association:
screenshot5

Select from the drop down a syslog alert rule, and click “+ Associate”. Once you have done this, the association is completed

If you associate it to an email contact, the notification will look like this:
screenshot5

(318) 909-2986

Custom SNMP OIDs in Observium, how cool is that! Since release 7175 of Observium Professional, it is now possible to have these custom SNMP OIDs monitored and graphed. This is quite handy, as writing individual device support for devices that are not recognized by Observium itself can be hard for people, as you need to dive into PHP, and Observiums codebase, and the code you might turn out might not be as beautiful as the Observium Dev team likes it to see, so it won’t be merged.

Now some people can perfectly live with a device being added, and shown up as a “Generic Device”, and have all the basic funtionality of the standard MIBS. This works to a certain extend off course. Now what if you just want to add something weird (SNR, RF levels for example) that is not supported by Observium out of the box, wether it be a standard graph, or an entity type that is not supported, this is now possible!

So here is how it works:

Click on the “Custom OID” menu in the global menu (the one with the globe), and you’ll get an overview of the custom OIDs configured by you.

custom oid menu

Then click on the “Add Custom OID” option in the right top corner to get the input fields:

custom oid input fields

There’s a couple of fields that you’d have to fill in:

Numeric OID: This is the OID in SNMP that you want to monitor
Text OID: Right now, this doesn’t do much, but it’s advisable to use the MIB translated name for this
Value type: Specific wether the value you are retrieving is a GAUGE or a COUNTER (for information on what this is, i advise you to read the RRD documentation on the differences between the two.
Description: This is what the name of the entry is in the overview
Display Units: This is the text that will be displayed in the legend of the graph that’s being created

After you’re done with filling in these fields, submit it, and you’ll be returned to the overview of custom OIDs. By now, this isn’t doing anything yet.
What you have to do is associate the custom OID with a device of your choosing. This association is pretty nifty, as it allows you to only write a custom OID once, and the associate with an unlimited amount of devices:

associate custom oids

Once this is done, and you have the correct OID being polled, and after 15minutes of polling, you should be able to see graphs in the Custom Graphs part of the device:

graphs

(830) 235-9512

This page is outdated as of Feb 26 2015, as state sensors have been rewritten. I keep an updated document on:
/github.com/mgmoerman/docs/blob/master/observium-alert-checkers.md

Observium straight out of the SVN repository (if you bought the subscription) doesn’t come with alert-checkers, which is unfortunate, as you need to figure out how this alerting system works by trial and error. Goal of this blog post is to give some examples of generic alert-checkers, and provide some more explanation on Metrics & Attributes, and some of the values that go with it. This document is off course not complete, and can always be improved. Please give me feedback to improve this.

Observium has a very powerful way of using entity types & check conditions to do alerting. But you do need to know how this is implemented.

There is some documentation on the Observium site itself, which is useful to read:

Creating an alert checker

Let’s go through the steps that are involved to actually create/add an alert checker in Observium

Entity type

First of all when you create an alert,you’ll need to pick the ‘entity’ type for what you are building the alert for. An entity type is nothing more than a “thing” for which you would like to see alerts.

These are the ones that are available as of 12/12/2014:

  • Device
  • Memory
  • Storage
  • Processor
  • BGP Peer
  • Netscaler vServer
  • Netscaler Service
  • Toner
  • Port
  • Sensor

They kinda speak for them selves, if you want alerts on things that go on with ports, pick ports, if you want something that has to do with a sensor, pick that one. Device is a very generic one, and will just give you status things on wether it’s up/down and it’s uptime and the response time for ping/snmp, the entity type Device has nothing to do with Ports or Sensor on the device itself, for alerting for that, pick actually Ports or Sensor

Alert Checker details

Once you picked the entity type, there’s a couple of more things that need to be filled in but these are simple, pick a name for the alert, and pick a message you want to be included once an alert is sent out.

Use Alert Delay to set the amount of poller runs that a condition of your alert checker should persist until it actually starts alerting. This could be useful when for example you’re creating a check for processor usage, but you don’t want to be alerted on every CPU spike that is happening. If you set a delay of say, 2, it’ll take 2 poller runs for actually alerting (providing the condition for which you are checking hasn’t changed off course)

Send Recovery button is self explanatory, and the Severity is currently not in use

Checker Conditions

Then we come to the Checker Conditions, this is where you actually implement the check for a specific entity.

It’s important to know what Metrics & Attributes are, see the overview below for a complete list of Metrics & Attributes

When filling in the fields for Checker Conditions, you use the Metrics mentioned in this page.

These need to be single lined entries, you can put as much in there if you want but you usually have one to check for a single condition, or two, for example to check an upper and lower limit. Use the boolean to switch between ANY or ALL of these conditions to match.

A single line consists of three values:

  • the actual metric
  • a “test” (le, ge, lt, gt, ne, match and notmatch)
  • a value

Associations

In these input fields you’ll create the first association rule, in other words, which subset of the entity type you selected needs alerting based on the conditions specified in the previous pane. When initially creating an alert checker, it allows for ony 1 association rule. Once it’s added, you can later on add more association rules to it.

These association rules are made from a “device association” and an “entity association”. First input field you’ll do your device matching, based on the attributes for devices. Second input field you’ll do your entity matching, using the attributes for the entity type you want to associate it with (this can off course be different then the condition you’re checking for)

This works in sort of the same way as the Checker Conditions. It uses the same line method (metric,test,value), however with some exceptions:

  • instead of using metrics, you’ll be using attributes
  • you can’t use a device attribute twice in the same association rule, so for example multiple “hostname match bla” statements with in the same association rule won’t work
  • for a single device association line, you can have multiple entity association lines

That last exception allows for more specific filtering, for example, you would want to match against all sensor classes (sensor_class) that are of type “state”, but when that nets you to many results, you can add a match for it’s description (sensor_descr), or you’d want to match all ports of type (ifType) ethernetCsmacd, but you only want certain ones with a specific description (ifAlias)

Example alerts

If you scrolled down here to just copy/paste some alert-checkers, perfectly fine, but don’t complain if they don’t work, PLEASE read how these work above.

The following is a set of very useful alert checkers:

Alert Entity type Check Conditions Check Conditions boolean Device match Entity match
Device down Device device_status equals 0 ANY * *
Processor usage is above 80% Processor processor_usage greater 80 ALL * processor_descr match processor
Memory usage is above 70% Memory mempool_perc greater 70 ALL * *
State sensor is in ALERT state! Sensor sensor_event equals alert ANY * sensor_class equals state
Fanspeed is above or under treshold Sensor sensor_value greater @sensor_limit
sensor_value less @sensor_limit_low
ANY * sensor_class equals fanspeed
Temperature is higher then 50 degrees Sensor sensor_value gt 50 ANY * sensor_class equals temperature
Traffic exceeds 85% Port ifInOctets_perc ge 85
ifOutOctets_perc ge 85
ANY * ifType equals ethernetCsmacd
BGP Session down BGP Peer bgpPeerState notequals established ANY * bgpPeerRemoteAs equals 41552
Storage exceeds 85% of disk capacity Storage storage_perc ge 85 ANY * storage_type equals hrStorageFixedDisk
Port has encountered errors or discards Port ifInErrors_rate gt 1
ifOutErrors_rate gt 1
ANY * ifType equals ethernetCsmacd
Port is enabled, but operationally down Port ifAdminStatus equals up
ifOperStatus notequals up
ALL * ifType equals ethernetCsmacd

Per entity overview of Attributes , Metrics and their values (if any)

Device

Metrics Values
device_status 0 = down, 1 = up
device_status_type reason for down, ‘snmp’/’ping’
device_ping response in ms
device_snmp response in ms
device_uptime in seconds
device_duration_poll in seconds
Attributes Values
hostname Self explanatory, this is the hostname for the device
os  cisco,asa,junos,linux,printer, generic, etc.
For an up-to-date list see /opt/observium/includes/definitions/os.inc.php
type network,server,workstation,storage,voip,firewall
sysName Derived through SNMP
sysDescr Derived through SNMP
sysContact Derived through SNMP
hardware Derived through SNMP
serial Derived through SNMP

Port

Metrics Values
ifInOctets_rate & ifOutOctets_rate number
ifInOctets_perc & ifOutOctets_perc 0-100 percentage
ifInUcastPkts_rate & ifOutUcastPkts_rate number
ifInErrors_rate & ifOutErrors_rate number
rx_ave_pktsize & tx_ave_pktsize
ifOperStatus up/down
ifAdminStatus up/down
ifSpeed interface speed derived through SNMP in mbit
ifMtu number
ifDuplex full/half
Attributes Values
ifSpeed interface speed in a mbit number
ifAlias the interface description
ifDescr Location of the interface, (blade, slot, etc)
ifName
ifType name of interface as described by IANA, see /www.iana.org/assignments/ianaiftype-mib/ianaiftype-mib
ifPhyAddress MAC address of the interface
port_descr_type
port_descr_descr
port_descr_speed
port_descr_circuit
port_descr_notes

Memory

Metrics Values
mempool_free
mempool_perc 0-100 percentage
mempool_used
Attributes Values
mempool_descr
mempool_mib
mempool_index

Processor

Metrics Values
processor_usage 0-100 percentage
Attributes Values
processor_descr
processor_type
processor_oid

Storage

Metrics Values
storage_free
storage_perc 0-100 percentage
storage_used
Attributes Values
storage_descr
storage_type
storage_mib
storage_index

BGP Peer

Metrics Values
bgpPeerState established
bgpPeerAdminStatus
bgpPeerFsmEstablishedTime
Attributes Values
as_text
bgpPeerRemoteAs
bgpPeerRemoteAddr
bgpPeerLocalAddr
bgpPeerIdentifier

Sensor

Metrics Values
sensor_value number
sensor_event up, warning, alert, down
Attributes Values
sensor_descr
sensor_class voltage, current, power, frequency, humidity, fanspeed, temperature, dbm, state
sensor_type
sensor_index
poller_type possible types: snmp, agent, ipmi

Toner

Metrics Values
toner_current
Attributes Values
toner_descr

Netscaler vServers

Metrics Values
vsvr_state
vsvr_bps_in
vsvr_bps_out
Attributes Values
vsvr_name this matches vsvr_fullname except when longer then 32chars, it becomes a randomstring
vsvr_fullname
vsvr_label
vsvr_ip
vsvr_ipv6
vsvr_port
vsvr_type
vsvr_entitytype

Netscaler Services

Metrics Values
svc_state
svc_bps_in
svc_bps_out
Attributes Values
svc_name this matches vsvr_fullname except when longer then 32chars, it becomes a randomstring
svc_fullname
svc_label
svc_ip
svc_port
svc_type

Routing issue root cause analysis using the NLNOG/RING

I wrote this bit of text a couple of months ago, and I thought I’d publish it here as well.

To whom it may concern,

Several weeks ago we (eBay Classifieds Group) encountered an issue with some customers coming from Denmark (more precise, TDC customers), having issues reaching websites in the eBay Classifieds Group network. These issues were showing as a slow website, and packetloss to our network. This lasted for some time, but it didn’t escalate in time to me, so by the time it did, the issue was already gone.

Now I haven’t been following the connected member list for the NLNOG/RINGproject, but Job Snijders pointed me out that TDC does have a RING node!

Now Job showed me during that weekend while we had some drinks, his new cool tool on the ring, it was even better then what I did pitch to him some moons ago. Not just latency monitoring, but the NLNOG/RING project keeps track of the number of hops, and keeping archives of traceroutes. And it all presents it in a very nice interface.

First, looked up the actual issue at hand:

6846391824

From 13:40 there is an increased jitter, and packet loss visible!

So, let’s check out that cool graph that displays number of hops history:

(561) 400-2559

And we see increase in number of hops, now let’s take a look at the actual ‘traceroute’ history:

/amp.ring.nlnog.net/trace_detail.php?src=ring-ebayclassifiedsgroup01&dst=ring-tdc01&date=2012-08-01

Take a small look at 13:45 on August 1st, hey… why has the traceroute from ECG towards TDC changed into going over the AMS-IX platform instead of the usual Level3 path? We see the real cause at 14:00, the number 3 hop has become Novatel in Bulgary, now a quick search in my mailbox reveals this:

Conclusion:

It seems Novatel was connected to the AMS-IX the day before, my idea is that they accidentally leaked their NTT routes via the AMS-IX routeservers, and had their NTT link congested by doing so.

If you have any questions regarding the use of the tool, or question about this article, don’t hesitate to contact me: Maarten Moerman