Server Monitoring Best Practices

Why You Need Server Monitoring

Monitoring plays a crucial role in any IT environment. It notifies you when things go wrong and provides a general idea of the health and capacity of your infrastructure. But monitoring can also play a more proactive role. If done correctly, monitoring can find the root cause of an issue and help you avoid issues before they impact customers. How and why you monitor will depend on your infrastructure and company’s needs, but in an increasingly dispersed world, it’s more important than ever.

Nearly every IT environment needs server monitoring. It’s the most common and established type of monitoring, so tried and true methods of server monitoring are widely available and adequate.

But looking forward, technological changes such as hybrid/cloud-first environments, containerization, or highly distributed systems all need monitoring tools built for the modern era. And modern monitoring solutions include many interesting features. In this post, you’ll learn what modern server monitoring looks like and the benefits it brings.

Server Monitoring Basics

Before we dive into modern server monitoring best practices, we should ask ourselves if server monitoring is still needed nowadays. With many companies moving to the cloud, containers, or serverless environments, does it still make sense to monitor plain old servers? Well, yes, it does.

Moving to the cloud doesn’t change much: you don’t own the servers, but you still use them from your cloud provider. And after moving to containers, you still need to deal with servers at the end of the day, since containers usually run on servers. There are instances where the containers run on servers you don’t have access to, but in most cases, you’ll still have the underlying machines.

Long story short—there are still plenty of reasons to think about server monitoring.

Typical Metrics

Typically, server monitoring includes data like server resource consumption, uptime, response time or latency, and a few others. No matter what’s running on particular servers, you want to know a few things.

First, and usually most important, is the server running or not?
Second, what is the resource consumption? You want to know if the server is under- or over-utilized, so you can downgrade or upgrade accordingly.
Third, you want to get a few different metrics like network saturation, disk throughput, and swap usage to understand if the server is slowing down your application.

The Catch

So far, so good. However, if you take a typical, old-school approach to server monitoring, you’ll get several different metrics and usually some predefined alerts. But it will just be raw data. You’ll need to figure out which metrics are important, which you care about, and what alerts are important to you.

For example, if you run a customer-facing application on a particular server, you don’t want it to run at 100% CPU usage because that will probably increase the latency for users. But if you have a farm of big data servers, it’s normal for them to constantly run at 100% CPU.

In another example, you may think getting an alert every time a server stops responding is a good idea. Well, once again, it depends. If you don’t expect servers to restart on their own then yes, you’d want an alert. But if you have a component designed to automatically restart the servers, where security patches are available, and your application is prepared for such occurrences, then you don’t need an alert. It will only create noise without any value.

Best Practices

Understand the Importance of Advanced Metrics

As we mentioned before, when it comes to server monitoring, most monitoring tools will produce a handful of basic metrics by default. These can include things like CPU, RAM, and disk usage and uptime. However, there are a few less-common metrics that can provide important information not easily discernible from basic metrics.

Consider the following examples:

High CPU usage can be due to an application doing what it’s supposed to do. Or, it can be due to poor disk performance, resulting in the CPU spending time waiting for data.
The amount of sent and received bytes per second is usually shown by default. But this only tells you how much data is being sent; it won’t tell you if there are any problems with the data.

More advanced metrics like network retry packets or latency will give you a more complete picture. So, while basic metrics are useful, you shouldn’t rely entirely on them. Instead, determine what really matters to you and add a few advanced metrics to your server monitoring.

Check Your Servers From More Than One Place

A practice common in application monitoring is also a good practice for server monitoring: having a backup solution. It’s common to think about server monitoring as “Install tool x, configure some dashboards, and you’re done.” Well, if you run important workloads on your servers and want to avoid downtime at all costs, consider backup monitoring separate from included monitoring with your hybrid/cloud infrastructure.

Your original monitoring solution can fail. And you won’t be notified that there’s something wrong with your servers if the monitoring solution itself crashes. A second, third-party monitoring solution doesn’t need to be complex or expensive. It can be a simple script that only checks a few basic metrics, such as if the servers are running. But the point is to have a backup solution, so you can sleep peacefully.

Create Custom Checks

Many monitoring tools give you plenty of information right out of the box. It’s easy to think you don’t need to do much because everything is preconfigured. And while some of the information is indeed fine out of the box, each company has its own specifications and custom components. Again, high CPU usage can be good or bad depending on what the server is used for. Always make sure you’re monitoring what’s important for you. If you don’t get that with the standard configurations, then you need to create custom checks.

Understand How Your Monitoring Solution Works

Now, this is something many people don’t think about, but understanding your monitoring solution is also important. Do you know if your monitoring tool pulls data from your servers, or do the servers push data to your monitoring tool? Does your monitoring require a lot of CPU or RAM? What happens when data gets lost due to network issues? Will your monitoring tool guess the values or will it report errors? Can it handle spikes in traffic? How does it scale? Knowing how your monitoring tool works can help you avoid situations where it crashes or fails to alert you when needed.

Write Runbooks for Common Issues

Typically when a monitoring system alerts you about some problem you have the following options: fix it ad hoc, determine the issue is “not important/rare,” or decide it’s not up to you to fix it (for example, it’s something your cloud provider needs to address). Sometimes, if you know how to temporarily fix the problem, you won’t be addressing the root cause. Example: your application crashes and you don’t know why, so you restart it and it works fine.

Although it’s not commonly practiced, in such instances it’s important to write runbooks. Something simple like restarting an application may not be so simple, as some applications need all their components to be restarted in a specific order. Don’t let other engineers find the workaround every time. Time is money, and the sooner the issue is fixed (either permanently or temporarily), the better for your users. Writing runbooks for common issues helps your teams resolve problems faster.

Summary

You may think that, in a new containerized world, server monitoring plays a less important role. And while application and cloud monitoring is, in fact, more valuable these days, it doesn’t mean you should keep your server monitoring old-school. Like all technology, server monitoring evolves and has its own best practices.

In this post, you learned how to improve your server monitoring.

If you want to put these best practices into action, try SolarWinds® Observability, a modern monitoring tool designed to adapt to your infrastructure and understand the different types of workloads that may be running on your servers.

This post was written by Dawid Ziolkowski. Dawid has 10 years of experience as a Network/System Engineer at the beginning, DevOps in between, Cloud Native Engineer recently. He’s worked for an IT outsourcing company, a research institute, telco, a hosting company, and a consultancy company, so he’s gathered a lot of knowledge from different perspectives. Nowadays he’s helping companies move to cloud and/or redesign their infrastructure for a more Cloud Native approach.