Sophie: nagios-check_mk-1.0.37-2mdv2010.0 noarch

nagios-check_mk-1.0.37-2mdv2010.0.noarch.rpm

     Monitoring with check_mk

Mathias Kettner, January 22nd, 2009
mk@mathias-kettner.de
www.mathias-kettner.de
----------------------------------------

1. The basic principle of check_mk

The classical approach for integrating service and host checks into
Nagios is by specifying small external programs ("plugins") to be run
by Nagios at periodic points of time. In case of OS monitoring the
plugin contacts a remote daemon running on the target machine in order
to fetch one item of information. This could be information about
one specific partition, a network interface or simply the current
memory utilisation. The most common daemon for monitoring of Linux und
UNIX is the NRPE (Nagios Remote Plugin Executor). Variants are SNMP or
SSH.

check_mk takes a different approach - with some crucial advantages as
we will see later. The basic idea of check_mk is to fetch *all*
information about a target host at once. For each host to be monitored
check_mk is called by Nagios only once per time period (e.g. once per
minute). It contacts a small daemon called "mknagios" on the target
machine, which outputs all relevant information about the host -
regardless of which items are actually being monitored. That daemon
does not get any parameters and does not need to be configured.

check_mk now processes that information and extracts all items that
have been configured for monitoring and checks them against configured
levels (the configuration is done in one file main.mk on the
Nagios server). It then sends the check results (OK/WARNING/CRITICAL
plus performance data, if any) to Nagios via passive service
checks. Nagios just processes these passive service checks just like
active ones.

One of the main advantages of this approach as shown so far is a
massive cut down of needed CPU ressources - both on the monitoring
host and on the target host. Consider an average number of 100
different services per host. In the classic approach with NRPE each
check period 100 seperate processes have to be started on the Nagios
host, 100 TCP-connections to be built up and down and 100 plugins have
to be executed locally on the target machine. check_mk in contrary
just needs one single plugin to be started, one single TCP connection
to be built up and one plugin to be run locally on the target host.
The estimated speed up is at least a factor of 5 to 10.

This allows us to implement a much larger number of checks per time
period on the monitoring server and at the same time save CPU
ressources on the target machines.


2. Implemention of the plugin and the agent

The plugin check_mk is installed only on the monitoring server itself
and implemented with Python. The local agent on the target hosts is
implemented as shell script and run via inetd or xinetd. That way not
binary code is needed. Currently Linux and Solaris are
supported. Porting the script to other Unices is straight forward.
The linux version of the script has successfully been tested and used
on SLES 9 and SLES 10, both on 32 and 64 bit architecture.



3. Automated inventory

check_mk provides further advantages beyond the actual monitoring.
One key feature is the automated inventory function. It scans one,
several or all hosts for items not yet monitored and automatically
configures them for being monitored. This way new partitions, database
instances, network interfaces, multipath devices, raid arrays and
other items will automatically added to the monitoring and cannot be
forgotten. Also adding one new host or even a list of hosts to the
monitoring environment is very easy. check_mk can do this because of
the nature of the mknagios-agent, which always sends a complete list
of all items of the host the are relevant for monitoring.


4. Automatic creation of Nagios configuration

For all items checked by check_mk the Nagios configuration files
are created automatically on demand. Nagios parameters can
be customized manually by making use of the Nagios template
mechanism.


5. Direct RRD updates

check_mk supports graphical statistics over all checks that produce
values that can be measured. It does this by using PNP4Nagios and
round robin databases. For sake of performance check_mk does *not*
hand over the data for the RRDs to Nagios but directly enters the values
into the RRDs itself. This saves a significant amount of CPU ressources.


6. Service aggregations

check_mk lets you configure "service aggregations". If you use that
feature then for each host H there will be created a host H-summary.
Groups of services in H will be aggregated to one single service in
H-summary. 

This way it is for example possible to group all services beloning to
a specific database instance into one single aggregated service for
that instance in H-summary. The aggregated service is considered to be
CRITICAL, if at least one of the underlying services on H is
CRITICAL. That way it is much easier to get an overview over the
overall state of the hosts. The aggregated services also can be used
for notifications.


7. Clusters

If you monitor services running on HA-clusters you have the problem
that the monitoring server does not know which of the physical nodes a
service is currently running on. If the cluster does have a service ip
address then you can use that for monitoring. Some types of clusters
do not have a service address, however. 

check_mk supports monitoring such clusters by fetching the data from
all pysical nodes and checking if the specific service is running on
at least one of them. In Nagios such clusters appear as artificial
hosts with no ip address.


8. SNMP monitoring

check_mk also supports the "fetch all at once" principle for
monitoring of networking devices via SNMP - with all the advantages
explained above.  This includes an automatic inventory of Ethernet and
FibreChannel switches for used and unused ports.


9. Integration into Nagios

From Nagios' point of view check_mk is just one plugin next to others.
It is straight forward to mix check_mk based and classical checks on
one Nagios host.