Skip to main content

Raymii.org Logo (IEC resistor symbol)logo

Quis custodiet ipsos custodes?
Home | About | All pages | RSS Feed | Gopher

Essential Monitoring checks

Published: 20-03-2018 | Author: Remy van Elst | Text only version of this article


Table of Contents


In this article I'll provide a list of checks I consider essential formonitoring and why they are usefull. It's on different levels, ranging from yourapplication (health checks), to operating system (disk usage, load) and hardware(iDrac, disks, power). Use it as a starting point when setting up yourmonitoring.

These checks can be setup in many different monitoring systems and with manydifferent technical solutions. One solution might be Nagios/Icinga2 withnrpe/nsca, one might be PRTG with snmp or one might be Pingdom. This list ismeant for you to look at and think, this could be usefull in our setup, thenimplement it in your own system.

If you have no monitoring at all, I recommend looking into Icinga2 or Nagios,but that's just personal preference. Read an install guide and check outall my monitoring articles here.

If you like this article, consider sponsoring me by trying out a Digital OceanVPS. With this link you'll get $100 credit for 60 days). (referral link)

This is quite a lengthy article, but using the table of contents you can skipdirectly to the sections relevant to you.

What to monitor and when to alert?

The short answer is monitor as much as possible, escalate as few times aspossible.

To expand on that, from the perspective of a systems administrator, more metricsand checks are better. Just as more logging and more documentation. Checks andmonitoring on every level and aspect of your environment help to diagnose issuesearly on and make sure they don't happen again.

Monitoring helps when doing a post-mortem. To be able to see what failed, why itfailed and what other things were affected is the holy grail during incidentmanagement. It saves time for you as a sysadmin, and thus downtime for yourusers. Not only because you can pinpoint problems faster due to not having todig everywhere, but also because you can focus on actual work instead offirefighting because everything is always broken.

You however don't want to be alerted for every issue. Many checks are allowed tofail or reach a certain threshold before things get critical. Environments Ibuild are always redundant and highly available, so I don't want to be awakendbecause one disk somewhere is nearly full. I don't even want to be awakened whentwo of the three datacenters / amazon regions are burning down, as long as theapplication / service is still working. The next business day I do want to lookat my dashboards and see that something went wrong, but as long as the actualservice was working, don't bother me. That's why I build high available systems,so that any component can fail without impact.

I do want to know when issues arise, so email alerts for all notifications arefine. I can work on those alerts when I have the time, and implement structuralfixes instead of hotfixes.

Whenever a more critical component fails, I'd like to get a text message /Pushover. One example is when a database reports one or more clustermembers have failed, or when a cluster file system (like DRBD or GlusterFS) arein a degraded state. That is a failure that the cluster should be able totolerate, but does require fixing. I can still decide when I fix that, but somealerts have a bit more urgency then just an email.

When the application or service provided to the users fails, I want to becalled. When I'm called it's urgent and on fire, must be fixed now. In thearticle I'll go in to different levels of alerting more per check.

Summarizing the above, I like to use three levels of alerting:

You can delegate different types of alerting to more junior members of yourteam. By doing that, they are learning about the environment, you have less workto do and your documentation is checked as well.

By decreasing the amount of alerts that require direct action, you also create acalmer work environment, thus resulting in happier sysadmins.

Metrics

Nagios and many other monitoring systems support saving history of alerts andmetrics. Nagios calls this performance data. Using a third party tool likenagiosgraph, or more modern, Grafana, you can turn these metrics intographs and dashboards.

Here is an example of the perfdata from the check_http plugin sent to statsdand graphed via Grafana:

In this graph, of the past 6 months, you can see 1 extra VIP (virtual IP, theloadbalancer high available IP) being added and this specific cluster havingperformance issues for about 10 days, as the response time goes from an averageof 200 ms to 800 ms.

Using metrics and easy to view graphs like these allow you to quickly get anoverview and historical state of your checks. They help to search down'hunches', but can also confirm that everything is running smoothly for the pastmonths.

Graphs are especcialy usefull for latency checks, like disk performance, load orresponse times.

Application

With application, I mean the software you run on top of your server (or servercluster). This can be a website, CMS like Wordpress, but also an applicationservice like jboss, custom software or whatever. The thing that is providingservice to your users, or part of the service (like an API gateway, tyk).

Health check

The most important check of them all, the application health check. This is acheck that your application handles, and based on the result of that check, inmost cases, gives an HTTP 200 OK or a HTTP 500 internal server error. The check,which you must probably write yourself, checks all essential things theapplication needs to work correctly.

In the case of an ERP system it could be the local database, the vendor stockmanagement API and the payment API.

In case of a media (video) system this could be the the backend which houses allthe actual content, the DRM service and the customer database.

This check is, in my opinion, one of the few checks that should alert you byphone (or something that wakes you up at night). If this check fails, so willthe application and thus the service to your users. If you've correctly setupyour high available redundant cluster any one part of it can fail withoutalerting you, but if the application health check fails, you should be woken up.Using the other monitoring you have (but which only alerts via email forexample), you're able to quickly find and fix the issue at hand.

HTTP/HTTPS check

In the case of a web application, a check on the HTTP(s) status code. 200 meansall is well, anything else probably means something is wrong.

Example output from Nagios:

HTTP OK: HTTP/1.1 200 OK - 11595 bytes in 0.001 second response time

If this check is triggered, your webserver has an issue. Check the logs for moreinformation. Examples I've seen are the application server (like PHP, mod_pythonor Passenger) not working correctly, or someone making a mistake in a.htaccess file or nginx config.

Brute force login

If your application deals with login and user accounts, it is great to know whena brute force is going on. You must also have brute force protection, read theOWASP top 10 2017 for more tips on that. Even though, a high load due tobrute force attempts (logins can't be cached thus hit your database every time)is not a nice thing to have.

Just as the health check this is one of those checks that you must probablywrite yourself. Combine it with something like fail2ban for automatic blockingafter a set amount of failed attempts.

API test

If your application is dependent on external API's, it's a good thing to checkthe endpoints. Try to make a check that mimics something your application woulddo. Alert when it fails, your external provider might have an issue, or changedits API. This will help you diagnose issues quickly before your applicationfails.

This check could be integrated in the applicaton health check. I know of atelevision station which checks all their external content provider API's thisway so they know when the providers fail to meet their SLA's.

Certificate expiry

Again just for web applications, check when your certificates expire. Exampleoutput from Nagios:

OK - Certificate 'www.raymii.nl' will expire on Fri 29 Jun 2018 12:59:00 AM CEST.

Let this check alert a month in advance to cover for vacation or validationissues with the certificate. If you use Let's Encrypt, make it 7 days or thelike, since that's more automatic.

If you have an Extended Validation certificate, let it alert 2 to 3 months aheadto cover any validation issues.

Application functional tests

This one falls in the same category as the health check. You can have many ofthese with lower alerting levels. This check does one thing as part of yourapplication. For a webshop, this could be placing an order and checking if theconfirmation email is received. And of course removing that order afterwards.

For a certificate authority, this could be requesting a certificate and checkingthe serial and expiry date of the given certificate.

For a cloud provider, this could be creating a VM, attaching an extra disk,checking if the extra disk works, if the IP pings, and then removing the VM.This way multiple parts of the system are touched in a way users also would do.Not all problems are caught this way, for example, if you have multiplehypervisors, one could have an issue but the check could create the VM on ahypervisor without isssues, but that should be caught by your detailedmonitoring of that specific hypervisor.

Here is some example output of one of the checks my current work does on theirhypervisor platform:

OK: All tasks completed: Created instance with uuid $uuid: Pinging the server on ip $ipaddr successfull for 20 times after 34 attempts: SSH connection to $ipaddr successfull: Created volume with uuid $uuid: Attached volume successfully: Volume mounted in instance successfully: Deleted instance: Deleted volume

When the check fails, for example when the volume service (cinder) schedulerhad an issue:

ERROR: Created instance with uuid $uuid: Pinging the server on ip $ipaddr successfull for 20 times after 34 attempts: SSH connection to $ipaddr successfull: Created volume with uuid $uuid: Task attachVolume failed: Timeout waiting for $uuid to transition to in-use: Deleted instance: Deleted volume

Another functional check we have is how many usable public IP's are allocated.If that falls below a certain number that can be the cause of a non-workingservice (as in your VM has no IP):

OK: 8924 Usable Public IP's

When a certain threshold is reached:

WARNING: 800 Usable Public IP's

That way we know when it's time to add a new subnet to Neutron.

A functional check can also include a latency check, for example, if you run afile service that download times on files are not lower than X. In that case,the definition of a working service requires X speed, and the functional testcan help you with that.

As you can see, functional checks can be as wide and varied as possible. Theyare valuable because they 'hit' multiple levels of your stack and can helpdiagnose problems earlier. They are not the holy grail, because they aregeneric. In the case of the VM creation check, we can see that somewhere in thevolume part there is a failure. Dedicated monitoring of that service showed thatthe scheduler stopped, but it we didn't had that monitoring finding the actualcause would be difficult. Not to say that it is not a valuable check, it shouldnot be the only one to rely on.

Services

With services I mean applications providing a service on top of the operatingsystem. For example, Active Directory, MSSQL, your database, web server, loadbalancer or cluster software. Anything that is not directly the hardware oroperating system, but also not your application or service itself.

The lines are a bit blurry, because a process check, is that OS or service? Ilist those under OS, but they could just as well fall under services. If yourservice is providing a database, then that would fall under 'Application' hereabove, but if the database is a component of your stack, it falls underservices.

I'll list a few of the checks I often utilize below, but of course, they are notlimited to just the ones below. As with this entire article, take what you needand adapt it to your environment.

haproxy backends and frontends

My loadbalancer of choice, because of the features, stability and monitoring /statistics. Read this article for a nice overview why haproxy is bettermost of the times.

In haproxy you define frontends and backends. A frontend can be the point ofentry for your website, the backends can be multiple webservers. The frontendcan also be a TCP port and the backend a collection of redis/mysql servers. Thischeck, applicable for all load balancers, shows you which frontends and backendsare down (reported by haproxy):

Check haproxy OK - checked 10 proxies.

Failure:

Check haproxy CRITICAL - server: appserver:app1 is MAINT (check status: layer 4 check OK): server: mysql:mysql_master is DOWN (check status: connection error): server: mysql:mysql_backup is DOWN (check status: connection error): BACKEND: mysql is DOWN:

This check is not usefull when you have an active / passive setup, then it willalways alert because the passive systems are down. A two node redis cluster withhaproxy in front as entry point will have one active (MASTER) node and one down(SLAVE) node. The redis sentinels do a haproxy API call to switch the backendwhen required, but never are two redis servers up at the same time (the slave isread only until promoted to master, then the other master drops all it's dataand resyncs with the new master, as a slave).

I've written a wrapper around this check that only escalates an alert when allbackends are down. Alerting is done via email otherwise, only DOWN and notMAINTENANCE, since maintenance is deliberate.

DNS

If you run your own local resolvers your servers are dependent on them, so theymust return correct results. Not only monitor the actual service itself, butalso it if returns correct results. Using this Nagios check, we specify thehostname and the expected IP's in return. This check works with all recordstypes (A, AAAA, MX, TXT etc.) thus can also be used to check if your MX recordsare still there and such. Very usefull in an environment where multiple peopledo DNS changes. Sometimes errors are made, like that BIND used ; as a commentcharacter instead of #, and you don't want to have two hours without DNS.Response time can also be alerted on. Example output:

DNS OK: 0.007 seconds response time. identity.stack.cloudvps.com returns 89.31.101.74, 89.31.101.76

Alerting can be email only. It can happen that the alert is a false positive,because of actual DNS changes.

MySQL, Galera (or Postgres)

The database of choice for many people, single server, master-slave or master-master replication or a full blown Galera cluster. To monitor, I often create aseperate user (named monitoring). The Nagios plugin example output is listedbelow:

Uptime: 1381613 Threads: 4 Questions: 4459389 Slow queries: 16 Opens: 12935 Flush tables: 1 Open tables: 400 Queries per second avg: 3.227

This check put in a graph looks like this. Usefull to have when a newversion of your application is deployed and there is a tenfold increase inqueries and cache misses.

If you run a replication or cluster setup, it is important to check that aswell. The replication checks I use looks like this:

OK - Waiting for master to send event, replicating host 10.0.0.100:3306

Failure:

CRITICAL - Replication problem: Slave IO not running!

For Galera I use this check which alerts if there are problems with theamount of nodes in the cluster.

In my use cases, I don't let this check escalate, since the application healthcheck catches this as well and that check already escalates.

Corosync

Corosync (and pacemaker) is the cluster resource manager. This service ensuresthat clustered resources and resource groups only run on the correct nodes andensure failover when required. One of the most important parts in a clustersetup. It can cluster almost anything, but I most often use it for drbd(filesystem), postgres (database) and high-available OpenStack instances withMasakari.

We have different checks on corosync, the generic one checks the status of allrings, if stonith is enabled, if maintenance mode is not enabled and ifthere are no failed actions:

OK - ring 0: Ok, ring 1: Ok, failed: Ok, stonith: Ok, maintenance: OkOK:     status  = ring 0 active with no faultsOK:     status  = ring 1 active with no faultsOK: CRM masterOK: No failed actions present...OK: Stonith is enabled...OK: Maintenance Mode is inactive...

It alerts with a warning when maintenance mode is enabled:

WARN - ring 0: Ok, ring 1: Ok, failed: Ok, stonith: WARN, maintenance: WARNOK:     status  = ring 0 active with no faultsOK:     status  = ring 1 active with no faultsOK: No failed actions present...WARNING: Stonith is disabled...WARNING: Maintenance Mode is active...

When the cluster starts failing, it will escalate:

CRIT - ring 0: Ok, ring 1: Ok, failed: CRIT, stonith: CRIT, maintenance: CRITOK:     status  = ring 0 active with no faultsOK:     status  = ring 1 active with no faultsCRITICAL: could not connect to CRM...CRITICAL: could not connect to CRM...CRITICAL: could not connect to CRM...

One extra check I enable specificaly to catch maintenance errors is a check on moved resources:

OK: Manual move is inactive...

With crm resource move, if you leave out the expiry time, the resource willnever run on that node until it is cleared/unmoved (manpage). If youforget to clear a resource and the other nodes fail, even when this node ishealthy, the cluster will not move it there resulting in an outage. This checkreminds you that you have uncleared resources, so that when you've finished yourmaintenance you don't forget to clear the resource.

Application server (php-fpm, passenger, unicorn etc)

When running web applications with a scripting language like Python (django),Ruby (Rails) or PHP often there is an application server involved. Often this isphp-fpm for PHP or passenger for Ruby on Rails and Python. For Passengerthere is a nice check here that reports user data as well.

If for whatever reason your application server fails, your application healthcheck will go off and it will be escalated. If you have a loadbalanced setup,the other servers will take over and your application health check should notalert or escalate, but you still want to know that this one server is havingproblems. Most often, this check will go off and the haproxy check describedabove will also go off.

GlusterFS

GlusterFS is a clustered file system. It allows multiple servers to have thesame filesytem, both via NFS, SMB or it's own filesystem module. When yourapplication is not cluster-ready or is heavily dependent on local filesGlusterFS is usefull. Or when you have to provide high-available file sharingservices. It is a tad bit slower than nfs with drbd, but provides morefeatures (like bitrot detection, geo-replication and advanced cluster controls).

This plugin for recent versions of GlusterFS (> 3.7) and this puginfor older versions are both good. The number of bricks, daemon status, volumestatus, diskspace and healing status are checked.

I don't let this check escalate, since the application health check will alsoescalate when its filesystem disappears.

nfs/drbd

NFS and DRBD, combined with CoroSync, is in my opinion the only fast sharedfilesystem. It's simple, stable and many operating systems talk NFS. DRBD is'RAID 1' over the network. A three node setup has always been a bit more workto setup, but with version 9 this is now very easy and built in.

The check checks if the device is connected and in sync/consistent. Output:

DRBD OK: Device 0 Connected UpToDate

When doing a resync, for example after a resize:

DRBD WARNING: Device 0 SyncSource UpToDate

Failure:

DRBD CRITICAL: Device 0 SyncTarget InconsistentDRBD CRITICAL: Device 0 WFConnection UpToDate

I don't let this check escalate, since the application health check will alsoescalate when its filesystem disappears.

Redis

Redis is a fast, in memory, key-value store. It can be clustered, but only 1node can be the master at any given time. The docs on clustering areexcellent. Slaves are read-only and Sentinels monitor the state of the clusterand can trigger failovers.

My monitoring of redis is limited to this check on the active master and aprocess check of redis and redis-sentinel on the slaves and sentinels.

Quorum

Any cluster setup has to have a Quorum component. Simply explained, the Quorumis the number of nodes required to keep the cluster running. If the Quorum isnot met, the cluster stops functioning to prevent split-brain situations.

A quorum must be an odd number, minimum of three. If you have a five nodecluster, the Quorum can be three. If you have a three node cluster, the quorummust be three. Two node clusters cannot have a quorum, since if one node failsthe other is unable to verify that itself is failed or the other node is failed.You can disable the Quorum and run a 2 node cluster, but that will give a highrisk of split brain and related issues.

Do note that the above is very much a simplified explanation. Microsofthas a nice article covering the Quorum in a Hyper-V failover cluster setup.

If you run any type of cluster, be sure to monitor the quorum, whichever waypossible.

Ceph

Ceph is the best thing for high-available storage since sliced bread.Wikipedia has the best description, so I'm not going to cover it here.

The checks I run on a ceph cluster are checks of the different components(mon, mds and osd). The OSD check, basically all the disks underneathCeph:

Ceph OSD down: 0

Failure:

Ceph OSD down: 8

On the storage servers itself there also runs a check on the actual OSD devices.When all are available, the check outputs:

DISK OK

Inclusing performance data on how much space is used per OSD:

Label                       Value       Max         Warning     Critical/var/lib/ceph/osd/ceph-5    640.77 GiB  3.63 TiB    2.72 TiB    3.08 TiB/var/lib/ceph/osd/ceph-76   602.44 GiB  3.63 TiB    2.72 TiB    3.08 TiB

When a disk fails:

DISK CRITICAL - /var/lib/ceph/osd/ceph-86 is not accessible: Input/output error

The ceph cluster health is also checked. If all is well, not much output:

HEALTH_OK

When a disk has failed, the health check output more information that is used indebugging:

HEALTH_WARN 239 pgs degraded: 41 pgs stuck unclean: 239 pgs undersized: 100 requests are blocked > 32 sec: 1 osds have slow requests: recovery 337342/53695956 objects degraded (0.628%): 1/213 in osds are downpg 2.9dd is stuck unclean for 302.205938, current state active+undersized+degraded, last acting [189, 187]pg 2.8d5 is stuck unclean for 304.844369, current state active+undersized+degraded, last acting [178, 35]pg 2.840 is stuck unclean for 304.032332, current state active+undersized+degraded, last acting [189, 214]pg 2.70b is stuck unclean for 317.764231, current state active+undersized+degraded, last acting [179, 178]pg 2.6dd is stuck unclean for 331.966147, current state active+undersized+degraded, last acting [197, 8]pg 2.73b is stuck unclean for 303.225730, current state active+undersized+degraded, last acting [187, 178]pg 2.5f2 is stuck unclean for 307.264681, current state active+undersized+degraded, last acting [175, 173]pg 2.5a3 is stuck unclean for 322.817106, current state active+undersized+d

When a disk is replaced and the cluster is rebalancing:

HEALTH_WARN 87 pgs degraded: 1 pgs recovering: 86 pgs recovery_wait: 87 pgs stuck unclean: recovery 445/53359560 objects degraded (0.001%)pg 2.8d5 is stuck unclean for 421.504447, current state active+recovery_wait+degraded, last acting [168, 178, 35]HEALTH_WARN 51 requests are blocked > 32 sec: 2 osds have slow requests30 ops are blocked > 65.536 sec on osd.1707 ops are blocked > 32.768 sec on osd.1708 ops are blocked > 65.536 sec on osd.1906 ops are blocked > 32.768 sec on osd.1902 osds have slow requests

Since Ceph runs the block devices that other servers use to provide storage andservices, if there are major failures, more checks will trigger. If one disk, orone storage node fails, Ceph handles it without problems, so no escalation.

Combined with the hardware monitoring (below in the article), I do know when adisk fails or has badblocks and thus needs replacement.

Operating System

The Operating System, the part that runs your applications and services. Can beWindows, Linux, BSD or anything else, everything can be monitored. The belowchecks are geared towards Linux and BSD but the concepts are applicable to otheroperating systems as well.

Load

Linux load average should be no higher than the amount of CPU cores in yoursystem. Load is not the same as CPU usage, this article explains in detailwhat Linux load average means. The 'amount of CPU cores' rule of thumb is what Iuse for alerting on the 15 minute load avg.

When all is well:

OK - load average: 1.51, 1.47, 1.47

Too much on a 2 core system (15 min avg):

CRITICAL - load average: 10.19, 5.10, 3.49

I don't escalate this check, since the application health check and other checks(loadbalancer) will go off as well.

CPU usage

The percent of CPU usage and the 7 other states are usefull indicators ofarising problems. If your machine is using 100% CPU all the time, something iswrong. Incidental bursts are often not a problem, since the capacity is meant tobe used. The performance data of this check is usefull, for example in virtualenvironments to catch CPU steal (a hypervisor that is overloaded). Thisarticle explains the CPU statistics in detail.

Perfdata captured:

Label   Valueid      61.4%   # CPU idle timeusage   38%     # usage %us      27.6%   # user spacesy      7.1%    # kernelsi      3.4%    # software interruptswa      0.5%    # idle while waiting on I/Oni      0%      # niced processeshi      0%      # hardware interruptsst      0%      # CPU steal, how long the virtual CPU has spent waiting for the hypervisor to service another virtual CPU running on a different virtual machine.

I only alert on high wait or CPU steal on virtual machines, since that is anindication of something wrong with the hypervisor or storage. No escalation,since other checks will escalate if needed.

RAM and swap

High memory usage on your server, just as CPU usage, can be an indication ofthings going wrong. Not a reason to alert just on this check, but this is a niceone to graph. Memory leaks are easy to find that way.

Example from a hypervisor that is not yet in use:

Memory: OK Total: 773557 MB - Used: 11197 MB - 1% used

When doing a load test on a hypervisor:

Memory: CRITICAL Total: 386939 MB - Used: 383085 MB - 99% used!

Swap, if enabled, is also checked. When all is well:

OK - Memoryusage within acceptable levels

If too much swap is used:

CRITICAL - Memoryusage above critical value of 90%

Generally you don't want your swap to be used, especcialy on hardware with lotsof RAM. On VM's with limited amount of RAM, swap can be usefull, but be sure toset vm.swappiness sysctl to something low.

Ping

Ping can be twofold, most of the time you do a ping check from your monitoringserver to the instance you are monitoring. Good for measuring latency andchecking if the host is up or not:

PING OK - Packet loss = 0%, RTA = 3.41 ms

Issues arise when there is lots of latency or packet loss:

PING CRITICAL - Packet loss = 100%PING CRITICAL - Packet loss = 80%, RTA = 5.80 msPING CRITICAL - Packet loss = 0%, RTA = 211.72 ms

Ping checks are used by default in Nagios to check if a host is up or not. If aping check fails I let that send a text message / pushover. Not critical, butstill requires more attention than just an email. Do make sure to rate limityour texts, otherwise a datacenter failure results in hundreds of texts.

Ping can also be used inside of a cluster. Most of my high-available clusterhosts have a ping check to the load balancer and router of that cluster, tomeasure internal network latency. Those checks do not alert, but are justgraphed into dashboards to get an overview of latency.

TCP Sockets

Linux TCP network connections are going through different states. The most oftenfound states are LISTEN, CLOSED and ESTABLISHED, however during openingand closing connections different other states can appear. It's not bad to haveconnections in that state, but having lots of connections in a particular statecan be a sign of congestion or something that requires performance tuning.

TCP Socket state flow

When a connection is opened:

When the client wants to terminate the connection:

The server then has to terminate the connection as well:

There's also a three-way handshake alternative available, which occurs when theserver sends its ACK+FIN in one single packet (server goes from ESTABLISHEDto LAST-ACK in a single step).

This plugin for Nagios works great for monitoring the different socketstates. Busy loadbalancers have a higher limit, and alerting is not enabled.This is just one of those checks that helps get insight in your environment. Ifound out that one client was doing over 5000 redis connections for a singlepage load due to wrong configuration in their application for example.

This is a nice read on tweaking and tuning servers for high performanceand this article explains how linux network states work in more detail.

With ss you can check all tcp sockets and using shell tools sort them by stateand

This is a busy server which runs haproxy to redirect everything to redis.Redis is currently also running on that machine.

ss -t -a | awk '{print $1" "$5}' | sort | uniq -c | sort    [...]     42 ESTAB 10.0.0.21:6379     45 ESTAB 127.0.0.1:6378   2585 TIME-WAIT 10.0.0.102:mysql  20679 TIME-WAIT 127.0.0.1:6378  20686 TIME-WAIT 10.0.0.21:6379ss -t -a | awk '{print $1}' | sort | uniq -c | sort      1 SYN-SENT      1 State     21 LISTEN    237 ESTAB  42850 TIME-WAIT

Logged in Users

When you use a configuration management system, you often never have to login toa server by yourself to install or update stuff. Only when troubleshooting anissue, but for that you should have your central log cluster. I'm not a big fanof this check, but there are situations where it is usefull. A jumphost, VPNserver, remote desktop machine, etc. Example check output:

USERS OK - 0 users currently logged in

The default limit is set to 6, but I have it set to 2. No alerting.

On hypervisors, loadbalancers and routers this is a usefull check to have. Theiptables firewall keeps track of connections in the conntrack table. If it'sfull, new connections experience issues.

Example output:

OK conntrack table is 0% full and timeout is 14400

Example output when issues arise:

WARN conntrack table is 14% full and timeout is 14400INFO: tcp connections for src 192.81.222.236 reached CRIT 271464 > 262144INFO: tcp connections for dst 10.200.10.9 reached WARN 137531 > 65536INFO: tcp connections for dst 172.32.173.163 reached WARN 137442 > 65536

More info on conntrack can be found here. This is the check I use.

On dedicated firewall appliances this kind of check can also be usefull(Fortigate, Sophos, pfSense etc).

Disk space and inodes

One of the more basic checks, because servers act weird when their disks arefull:

DISK OK - free space: / 23150 MB (60% inode=93%): /home/web/domains 30471 MB (74% inode=99%): /glusterfs/xvda3 30472 MB (74% inode=99%):

Warning:

DISK WARNING - free space: / 35606 MB (93% inode=94%): /mnt/export 49504 MB (10% inode=97%):

Make sure to tweak the alert value to your specific setup. By default it's 10%,but on a 4 TB disk, that is 400 GB and that can be way to much to already alert.I generally set it to 10 GB.

Do note that this check can also go off when you've ran out of inodes.Then you have too many files, probably a boatload of small ones (cache anyone).The used disk space can have lots left, but the inodes (df -i) can be all usedup. This check checks both space and inodes.

Disk stats, IO wait and IOPS

check_diskstat, check_iostat and check_iowait are just likethe connection sockets checks usefull to graph and get insight in theperformance of your environment. Good to kick your VM provider if their platformhas issues with IO performance, especially when you're able to show what statsare normal.

check_diskstat:

OK - summary: 0 io/s, read 0 sectors (0kB/s), write 816 sectors (1kB/s), queue size 0 in 300 seconds

check_iostat:

OK - I/O stats: Transfers/Sec=649.9 Read Requests/Sec=5.7 Write Requests/Sec=644.2 KBytes Read/Sec=264.4 KBytes_Written/Sec=8341.75

check_iowait

OK - Wait Time Stats: Avg I/O Wait Time (ms)=10.05 Avg Read Wait Time (ms)=11.27 Avg Write Wait Time (ms)=0.50 Avg Service Wait Time (ms)=0.54 Avg CPU Utilization=1.16

No escalating, just graphing.

Process checks

The total amount of processes on a machine varies in my experience. A busymachine can have over a thousand running processes, sometimes more depending onthe hardware. The average VM has around 300-600 running. Another check to giveyou more insight in the environment:

Processes running: 10 : Number of context switches last second: 56749TOTAL PROCS OK: 629 processes
Specific processes

For essential programs on a system, for a webserver for example apache ornginx and for a database server mysql, you want to have a simple check tosee if the process is running. This check will not give you any insight on ifthe process actually works, funcionality wise, it could be as dead as doornail,but it will make sure you know when it is not running.

PROCS OK: 3 processes with command name 'keepalived'
Zombie processes

Zooming in further in process checks, zombie processes are things youdon't want on your system. Check for them:

PROCS OK: 0 processes with STATE = Z

Alert, but not escalate. Zombie processes often are caused by disk IO issues orfailure so your other checks will go off as well.

Time (and NTP)

Correct time on your server is important for correct funcionality and logging.Make sure to always use NTP for correct timekeeping and a check to showwhen you are drifting (time difference):

OK - NTPd Health is 100% with 3 peer(s).Thresholds:  Health (60%, 40%): Peers (2, 1)------------------------------------------------------Received 100% of the traffic from +192.170.92.1Received 100% of the traffic from +192.1.254.130Received 100% of the traffic from *192.1.254.50

If you use TOTP in your application, a time difference of more than 30seconds can be an issue already since the generated codes are invalid and yourusers are not able to login.

If you work with short-lived certificates, time difference can also cause hugeproblems since valid certificates can be considerend invalid due to servers nothaving the correct time.

Uptime

System uptime, it's not a contest on who has the largest. High available systemcomponents can all be brought down to be updated or checked without problems forthe service provided. Regular updates, newer kernels, disk checks (fsck) andgeneral cruft cleanup are reasons use to reboot machines every once in a while.If a machine reboots itself without being instructed, that can be an issue, asign of hardware failure or a disruption at your VM provider. Check the uptime:

System Uptime - up 458 days, 5 Hours, 43 Minutes

Alert when less than 1 hour or more than 100 days. After a hundred days it'stime to apply the updates and do a reboot. Perfect task for your intern orjunior member.

cronjobs

Cronjobs can be an important part of your system. Maybe you use one to run yourbilling, to run a queue of some sort or to do backups. Whatever the cronjob is,it can be important to monitor if it ran correctly. There are no ready-madeplugins to do this, but what I use is a combination of a logfile and thecheck_file_age plugin. Let's say I have a cronjob that runs at midnight andtakes about two hours. The cronjob must log to a file. I set up thecheck_file_age to alert if that logfile is older than an hour with acheck_period just after the cronjob is finished. That way, the check runsafter the cronjob should have finished and only alerts if the logfile has notbeen updated.

Log parser

Let's say you have an application that logs when something goes wrong, but thatsomething that goes wrong does require action or is a preface to a largerfailure. You might want to monitor that logfile and alert if that line shows upin there.

One of my usecases is to check for authentication failures in an internalapplication. Not regular authentication where someone enters a password, butiSCSI authentication. One of the pieces of software that places config filessometimes has an error where the passwords are incorrect. The manufacturer isunable to fix this bug so we monitor the iSCSI servers:

OK - There are only 0 instances of "\*\*\*ERROR\*\*\* auth failed" in the last 60 minutes - Warning threshold is 10OK - There are only 0 instances of "Listen queue overflow" in the last 60 minutes - Warning threshold is 1

Any one of those lines in the log means trouble. Not directly, but every twohours when the software restarts.

This plugin is the one I use.

Hardware

When you manage physical hardware it can be easy to forget that hardwarerequires special checks, especially when you are used to just check virtualmachines. But, hardware has many extra quirks. I often say, hardware is stupidjust go to the cloud, as a joke. Hardware requires maintenance, but withwarranty and a reputable provider that is not an issue at al. Even better if youmade your applications redundant, then hardware failure is no problem.

Where in VPSes you just check the OS and software side of things, with serverhardware you also check, well, the hardware. So think, RAM (the actual DIMM's),disks (bad blocks), RAID config (physical disks, controllers, virtual disks,battery), network cards (uplink, bonding), temperature (CPU, disks), power(reduntant PSU status, power usage).

If you have special hardware like network devices, KVM switches, ATSes (Autotranser switch, to make a device with only single PSU power redundant) or APC's(remote controlled power bars) you also want to check those.

This overview lists examples of all the hardware checks. As with the entirearticle, take from it whatever you like and integrate it in your setup.

iDrac/ILO/BMC/ipmi

Many terms for the same thing, the out of band access to your server. Sometimeswith a different NIC, sometimes shared on the internal NIC, sometimes with alicense (Dell/HP) or with all the features (Supermicro). Provides remote accessto the server when it's off (or on), often with console access and otherutilities like firmware upgrades and hardware status.

Dell has the OpenManage tools, for Supermicro you can use ipmitool.The Dell iDrac can also be monitored via the XML API on the iDrac webserveritself.

Both HP and Dell support SNMP for their OOB management.

One way or another, it can provide insight in the server. I use it to monitorRAM DIMM's, PSU status and usage, RAID status and in the case of Dell thegeneric omreport chassis output.

Whenever something breaks, we create an RMA request and get it covered in thewarranty.

The main reason to monitor your out of band access is that you want to make sureit works, because when you need it (there is a problem with the hardwareitself), you don't want to find out that it has been unavailable or not working.

In my case I've written custom checks for the OOB-hardware checks. Some parseomreport commands, others parse web pages or XML files. Some use SNMP.

This is an example of the SNMP Dell DIMM check:

Memory 1 (DIMM Socket A1) 32.0 GB/2400 MHz: ENABLED/OK [26, Hynix Semiconductor, S/N: 2B425762]Memory 2 (DIMM Socket A2) 32.0 GB/2400 MHz: ENABLED/NONCRITICAL [26, Hynix Semiconductor, S/N: 2B425743]Memory 3 (DIMM Socket A3) 32.0 GB/2400 MHz: ENABLED/OK [26, Hynix Semiconductor, S/N: 2B425579]

Dimm A2 needs a replacement.

The general omreport check:

OK - Fans: Ok, Intrusion: Ok, Memory: Ok, Processors: Ok, Temperatures: Ok, Voltages: Ok, Hardware_log: Ok, Batteries: Ok, Power Supplies: Ok, Power Management: Ok

If it goes off, check the part that is not Ok and investigate further.

Ports

The main ports I monitor are network uplinks, both RJ-45 and Fiber. Some talkethernet, some talk Fabric, but all of them must be up, often in the correctVLAN. Both in the server as on the router/switch side.

On the switch/router side you can check port status, but also the error countson a port. If those are rising, it's probably time to replace your optic or SFPmodule.

If you are in an office situation with port security enabled, whichbasically means only select MAC addresses can connect to a port, or with802.11x enabled (authentication to get on the network) on, monitoring thatis a big help. If a port suddenly gets blocked you want to go and find out why(intruder or human mistake)? Or maybe your developer set up virtualbox with hisVM in Bridged mode. Or someone forgot their 802.11x password, or it expired.

Or maybe there are intruders and you catch them before you hit the news.

It also can be good to monitor traffic flow and alert if a port is doing muchmore traffic than it regularly does. That can be a hassle to setup correctly, itwill give lots of false positives. But, there are cases where it is usefull.Perhaps to find an employee leaving their computer on over the weekend to piratethe latest movoie.

Generally there are checks for major-brand switches and routers available. Atool like Observium can be a great addition to your general monitoringsetup.

On the server level you can check if your network bond is still functioning:

OK: bond0 - Bonding Mode: IEEE 802.3ad Dynamic link aggregation - enp341s0f1 (a9:36:9f:0e:d6:5a/up/10000Mbps) - enp342s0f0 (a9:36:9f:0e:d6:30/up/10000Mbps) - enp341s0f0 (a9:36:9f:0e:d6:58/up/10000Mbps) - enp342s0f1 (a9:36:9f:0e:d6:32/up/10000Mbps) - eth1 (24:6e:96:7c:f1:58/up/10000Mbps) - eth2 (24:6e:96:7c:f1:5a/up/10000Mbps)

Failure might indicate a NIC problem or an issue with the switch.

Disks, RAID, ZFS and controllers

Storage is stupid because it breaks often. If you've got over 10.000 spinningdisks then one breaks at least once a day. Not a problem in my case sinceeverything is redundant, both RAID as wel as Ceph, and we've got a dedicated RMAguy who replaces them. There are a few checks I like to have on my disks andraid sets. Some are mentioned above in the Ceph section already.

Raid array load

Some storage vendors report array load. We've got dedicated arrays for swapspace and regular disks, the swap volume alerts are an indication when a VPS isswapping heavily:

CRITICAL: vpsvg-935=20 vpsvg-937=60 vpsvg-936=35 vpsswap-967=98(>95)
Failed disks

Either a disk is suddenly out of the RAID array, Ceph detects an error or thedisk reports Bad Blocks. All reasons to replace it. Here are different checks wehave for failed disks. First from an HP machine:

RAID ERROR - Arrays: OK:1 Bad:1 - Disks: OK:13 Bad:0RAID WARNING - HP Smart Array Recovering:  Smart Array P420i in Slot 0 (Embedded) array A logicaldrive 1 (136.7 GB, RAID 1, OK) array B logicaldrive 2 (93.1 GB, RAID 1, Interim Recovery Mode)

Dell, disk 14 has Bad blocks:

WARNING - ID=0:1:0 Status=Ok, ID=0:1:1 Status=Ok, ID=0:1:2 Status=Ok, ID=0:1:3 Status=Ok, ID=0:1:4 Status=Ok, ID=0:1:5 Status=Ok, ID=0:1:6 Status=Ok, ID=0:1:7 Status=Ok, ID=0:1:8 Status=Ok, ID=0:1:9 Status=Ok, ID=0:1:10 Status=Ok, ID=0:1:11 Status=Ok, ID=0:1:12 Status=Ok, ID=0:1:13 Status=Ok, ID=0:1:14 Status=Non-Critical, ID=0:1:15 Status=Ok, ID=0:1:16 Status=Ok, ID=0:1:17 Status=Ok,

The controllers and battery:

vdisk OK: Controller0=Ok/Ready [ Battery0=Ok/Ready Vdisk0=Ok/Ready 0 [ 0:1:0=Ok/Online 0:1:1=Ok/Online 0:1:2=Ok/Online 0:1:3=Ok/Online 0:1:4=Ok/Online 0:1:5=Ok/Online 0:1:6=Ok/Online 0:1:7=Ok/Online ] ]

If this check goes to Noncritical then you need to upgrade the firmware.OpenManage detects old firmwares and alerts.

OK - Controller  PERC H730P Mini state is Ready and Controller status is Ok

Temperature and fans

Disk temperature should not be to high, CPU and general system temperatureshould not as well. Monitoring these values allows you to detect errors in thecooling system of your datacenter or fan failure. SNMP is used often for thesechecks.

System Board Inlet Temp: 22.0 C ENABLED/OKSystem Board Exhaust Temp: 38.0 C ENABLED/OKCPU1 Temp: 46.0 C ENABLED/OKCPU2 Temp: 44.0 C ENABLED/OK

Fan status and speed:

System Board Fan1A: 8120 RPM - ENABLED/OKSystem Board Fan2A: 8120 RPM - ENABLED/OKSystem Board Fan3A: 8240 RPM - ENABLED/OK

This model of server has fans that can go up to 9000 RPM. My check is set toalert when it's OVER 9000!

We have powerbars in our datacenter racks, and we use those to monitor theambient temperature in the datacenter. When it goes over 25 degrees we let allalerts go off and escalate, since we had a major incident once where the coolingsystem in the datacenter failed. Using SNMP we monitor the APC power bar for thetemperature.

SNMP OK - Temperature: 231 tenths of degrees celcius

Even if someone works on the rack or leaves the doors open the temperature staysstable under 25 degrees.

Power redundancy and status

A rack often has a maximun amount of power you can draw, for example, 16 A or 32A. Using this check you can monitor that a server doesn't go above a certainthreshold that will make you use to much power.

OK - Power Consumption in under the warning level psu statusPU 1: ENABLED/OK, RedundancyStatus: FULL, SystemBoard Pwr Consumption: 210 W

We also check the power redundancy status. In Dell servers it requires aconfiguration setting to have the power supplies redundant. Not sure why, butsometimes this changes (after a firmware update) and you don't want to have aoutage because one feed went down.

One last thing we check is the CMOS battery. Also unsure why, but these thingsbreak and can cause strange issues. If this check alerts, we send an RMA andreplace the battery.

System Board CMOS Battery: ENABLED/OK [PRESENCEDETECTED]
Tags: articles, hardware, health, icinga, monitoring, nagios, ubuntu