MTTR: Definition and More
Efficient incident management starts with tracking and improving the right metrics of success. There are quite a few of them (MTTR, RTO, etc.), but it’s important to know which align to your business’s objectives for application performance.
In this post, we’ll cover the definition of MTTR, the benefits of tracking it, and how it’s calculated in practice. We cover the factors that differentiate it from related metrics, so you can effectively lower your MTTR, and how application performance monitoring (APM) tools boost your efficiency.
What Is MTTR?
MTTR is a basic maintenance KPI. It indicates how long it takes, on average, to fix repairable items or bring systems back online.
MTTR stands for “mean time to repair” or “mean time to recovery.” Typically, MTTR includes not only the time for repair but also any time needed for testing. Only when systems are fully operational—or equipment is fully fixed—can you stop tracking the time.
Why Is MTTR Important?
MTTR is a sign of how efficient your organization is in diagnosing and responding to issues. Low values of MTTR usually translate into better user experience, higher customer satisfaction, and improved business outcomes, since it implies less time of system outage. User experience is a good predictor of revenue for an organization, so it’s critical to maintain a fast and available website.
But there’s another reason why MTTR is so important. A low MTTR shows your organization’s incident response measures are working in a healthy and efficient way.
How Is MTTR Calculated?
As we’ve seen, the R in MTTR can stand for both “repair” and “recovery,” which means MTTR isn’t a single metric, but two. So, we’ll show you how to calculate both versions of MTTR, starting with mean time to repair.
Calculating Mean Time to Repair
MTTR, when R stands for Repair, typically applies to physical equipment that’s repairable. Calculating it consists of a few steps:
- Find out how much time the organization spent on repairs during a given period.
- Find out how many equipment repairs were done during the same period.
- Add up the total time and divide by the number of repairs.
Suppose over the course of a quarter, your organization has spent 13 hours fixing a device that malfunctioned twice. In this case, the mean time to repair corresponds to 6.5 hours for that specific piece of equipment.
Calculating Mean Time to Recovery
Calculating mean time to recovery is also simple:
- Find out the total downtime over a certain period.
- Find out the number of incidents during that period.
- Divide the total downtime by the number of incidents.
Let’s say over the certain time period, a given API from your organization was down for two hours in a total of three incidents. Since two hours equals 120 minutes, the mean time to recovery here is 40 minutes for that specific API.
A Brief Observation on Calculating MTTR
Being able to calculate this KPI indicates a few positive signs for your organization: your organization documents incidents, including the number and timestamp of occurrences, and your organization carefully tracks downtime and equipment malfunctions.
What’s the Difference Between MTBF and MTTR?
MTBF is another important metric of success when it comes to incident response. It’s often confused with MTTR, but they’re still different metrics.
MTBF stands for “mean time between failures.” It measures how long it takes for certain devices or systems to fail. When it comes to MTBF, the higher its value, the better for your organization. MTBF is the opposite of MTTR in this regard.
Calculating MTBF consists of the following steps:
- Find out the total number of hours in a given period.
- Find out the number of failures that occurred over the same period.
- Divide the total number of operational hours by the number of failures.
To sum it up: MTBF represents the reliability of systems and devices. On the other hand, MTTR indicates your organization’s efficiency in repairing said systems.
What’s the Difference Between RTO and MTTR?
RTO, or recovery time objective, is yet another metric of success related to fixing things.
RTO indicates the maximum tolerable amount of time a given device or system can be out of work. For instance, if you say the RTO for a given system is five hours, after that time, the outage or malfunction will start to significantly—or even catastrophically—harm the business.
RTO is an expectation, whereas MTTR is calculated after the fact. MTTR should always be way below the RTO for every critical system.
How to Keep Your MTTR Low
- Document incidents. You can’t accurately track MTTR if you don’t document every incident. This includes performing a postmortem after every outage or device failure. Documenting incidents can help you find their root cause, which can lead to less time to detect the incident the next time—or even the chance to prevent it altogether.
- Measure. You can’t improve what you don’t measure, so the next big step for lowering your MTTR is keeping track of the trends in performance of your application.
- Adopt modern engineering practices. The better way to improve MTTR is simply getting better at detecting and fixing issues and outages. Make sure your organization knows and follows the best practices when it comes to infrastructure management and monitoring, besides other engineering best practices.
- Get alerted. You can’t fix issues you don’t know about. So, don’t wait for customers or other teams to tell you there’s a problem—be the first to know with personalized alerts that link to documentation of how to fix the specific problem. You can set up these alerts with an APM tool.
- Adopt the correct tools. Education and best practices are essential, but so are adequate tools. If you want to be able to learn about issues as soon as they happen, you need an infrastructure health monitoring solution such as SolarWinds® Observability.
Know Your MTTR and Bring It Down
In this post, you’ve learned about one of the best-known metrics related to incident response: MTTR.
As you’ve seen, the R in MTTR can mean both “repair” and “recover,” depending on the context. In the former version, the metrics refers to repairable items; in the latter, it’s all about getting systems back online.
In both versions, MTTR is an essential KPI for organizations that want to ensure their incident response strategies work efficiently. They gain improved user/customer satisfaction, which reverts to the organization’s bottom line.
When you’re ready to adopt a full stack observability offer, we invite you to take a look at SolarWinds Observability solution for infrastructure monitoring and APM. Solarwinds Observability helps you reduce MTTR by:
- Monitoring infrastructure and application metrics side-by-side, reducing the time it takes to identify what part of the stack is failing and helping you quickly get to the root cause.
- Quickly pinpointing issues and presenting the most likely cause of a performance problem with auto-instrumented root causes.
- Enabling maximum observability from metrics, to traces, and down to logs with cohesive end-to-end monitoring.
- Incorporating custom metrics to combine business metrics side-by-side with system metrics, allowing you to see and measure the impact infrastructure and application performance have on your business performance.
- Get started with a free 30-day trial.
This post was written by Carlos Schults. Carlos is a consultant and software engineer with experience in desktop, web, and mobile development. Though his primary language is C#, he has experience with a number of languages and platforms. His main interests include automated testing, version control, and code quality.