Guide to Infrastructure Monitoring

By Staff Contributor on March 24, 2023

What Is Infrastructure Monitoring?

More and more development happens in the cloud these days, and any new application you build today will likely be a cloud-native one.

Though cloud-native development undeniably offers many advantages and flexibilities, these environments are also prone to complexities (such as containerization) and high resource utilization. This brings us to monitoring. You’ve got to keep the entire infrastructure under control, and this requires comprehensive monitoring.

In this post, I’ll introduce infrastructure monitoring and provide some pointers on how to get started with it.

What Is the Goal of Infrastructure Monitoring?

As with every type of monitoring, the purpose of infrastructure monitoring is to understand the state of your system in more depth. You want to know what’s happening at every layer and how the system is performing. Additionally, you want to understand which kind of errors are occurring and if you’re nearing capacity.

In the past, monitoring infrastructure—the things running underneath the application—was the domain of SysAdmins managing on-premises servers. Nowadays, the situation is much more dynamic. Even the simplest application relies on a multitude of different pieces: there’s managed infrastructure and elastic scaling, not to mention short-lived elements like lambdas. Because of this, managing all these moving pieces has to be complemented with a modern monitoring approach.

Different Layers of Abstraction

Infrastructure covers different layers requiring a specialized style of monitoring. Let’s go over some of them and explore which metrics merit tracking.

Low-Level Infrastructure

Let’s start from the bottom. The most basic infrastructure includes things like storage or raw instances. You might know this approach as infrastructure as a service (IaaS). It’s comparable with the infrastructure in an on-premises setup.

The monitoring for this layer focuses on more fundamental metrics, like memory usage or disk usage. To clarify, there are two relevant perspectives to consider:

  • Individual numbers for single instances
  • An aggregation over a complete fleet of servers showing general trends

Orchestration Platforms

Many people prefer to operate higher-level abstractions instead of raw servers. In this case, you’re treating the cloud as a platform as a service (PaaS).

Container technologies such as Docker are the most widely adopted solution in this space. If you’re using containers, it pretty much means Kubernetes. Kubernetes is a big beast, one you can’t treat as a black box, lest you risk losing the overview of what’s going on in your clusters.

What do you monitor here? You can treat the worker nodes as raw instances and apply the metrics from before. On top of this, there are higher-level concepts. I’m talking about abstractions like pods, services, or deployments. Additionally, you want an aggregated view of the state of all the deployments in the cluster. Imagine the interactions between different deployments cause issues. This something you can only diagnose if you have an aggregated picture.

Other Infrastructure

A modern cloud provider offers so many services it’s easy to lose track of everything running. There are managed resources like load balancers and computing primitives like lambdas.

Even if you’re going for the simplest architecture possible, I can promise you’re still going to end up provisioning some additional infrastructure. It bears repeating: infrastructure monitoring is about not letting these small pieces fall through the cracks.

These managed services tend to expose metrics on their own. As an example, the number of invocations is extremely relevant for a lambda.

Native Monitoring vs. a Monitoring Solution

Cloud infrastructure comes with monitoring included. For instance, AWS has CloudWatch. You get a lot of mileage out of it. However, exporting these metrics to a centralized monitoring solution has additional advantages.

First, collecting every type of monitoring in the same place simplifies the operational burden for people who want to visualize metrics. Let’s say you consider developers, operators, and business stakeholders as interested parties. In this case, a unified tool will be more familiar and probably easier to maintain.

Second, tracing becomes much easier. You can follow the request’s flow throughout the entire journey. You’re able to jump between infrastructure, application monitoring, and maybe even a user-oriented view. If you’re trying to give your teams more end-to-end ownership, this is an invaluable aid. 

Though it’s possible to integrate different monitoring solutions to do the same, it takes significant effort. It’s easy to underestimate the convenience of a one-stop solution. This is why you might want to use a dedicated tool like SolarWinds Observability which is designed to prevent a lot of headaches and let you focus on your core business.

Getting Started

I’m sold! you might scream. But I don’t know where to start. Let’s talk briefly about this. Using monitoring in infrastructure requires three steps:

  • Collecting the data. You delegate this part to your cloud provider, which already has solutions to capture all the events related to their offerings.
  • Aggregating it. This is about connecting your monitoring provider to the native sources so you ingest all the metrics available.
  • Creating visualizations. Once the data is available, you use the regular tooling your monitoring solution offers to draw insights.

Above all, don’t succumb to the temptation of doing this by clicking around on a bunch of different UIs. There’s a better alternative.

Use a Code-Driven Approach

This is a crucial point. Handling infrastructure works much better if you leverage the principles behind infrastructure as code (IaC). In the end, it’s all about versioned, repeatable changes reflected as code.

Monitoring as code is a natural evolution of this concept. For instance, Terraform has providers for many monitoring solutions. This is an excellent starting point to automate your monitoring setup. All the steps I mentioned above are prime candidates for automation.

As a bonus, your code serves as a hedge. Let’s say your stakeholders are concerned about vendor lock-in. If your setup revolves around versioned code, it’s much easier to conceive migrating to a different monitoring tool or a cloud provider. Avoid building an ill-advised abstraction layer when you don’t need it.

The ultimate goal is to programmatically augment your infrastructure with bespoke monitoring capable of visualizing the results in a convenient way.

Let’s Not Forget the Alerting Part

Monitoring starts with collecting metrics about a system. Then, you typically collect the data in dashboards to aggregate these metrics. It’s a good start, but it’s not enough.

Highly available systems need reactive alerts capable of notifying operators so they can act swiftly whenever a problem manifests. This applies to infrastructure as well. Infrastructure fails.

Converting monitoring into alerts is a natural way of keeping one single source of truth. And it reduces duplication, too.

Try It Out

Infrastructure is getting more and more complex as highly available, globally distributed deployments become the norm. This increase in complexity requires you to understand the systems you build more thoroughly. Thus, high-quality monitoring is paramount.

 Monitoring systems covering the broadest possible spectrum make life easier for the people interested in having a comprehensive view of a system. Though cloud providers have monitoring offerings, a centralized tool makes sense if you want to combine these metrics with other types of monitoring. If you wish to get a deeper understanding of your system, give SolarWinds Observability a try.

This post was written by Mario Fernandez. Mario develops software for a living—then he goes home and continues thinking about software because he just can’t get enough. He’s passionate about tools and practices, such as continuous delivery. He’s also been involved in front-end, back-end, and infrastructure projects.

Related Posts