Monitoring Software: Build vs. Buy

Chances are good that you’ve encountered the “build versus buy” dilemma at some point in your professional career. With the increased popularity of hosted IT and developer-facing SaaS, the decision to outsource IT functions or build and maintain them in-house is more complicated than ever. Unfortunately, many decision makers still allow their decisions to be driven by cost, ego, or emotion rather than framing the decision in the proper context of how (or if) it provides value for their business. In this post I’d like to take a look at some of the factors that should play a part in how you decide which tools to include in your monitoring strategy.

Inflection Points

Change is unavoidable. The software choices you make today will need to be reevaluated as your business grows (or shrinks, but let’s hope that’s not in your plans). Every software architecture reaches one or more inflection points as it evolves over time. The lifecycle of the average web company’s monitoring software stack looks something like this:

What causes us to reach these inflection points and transition from one stack to another? Certainly not (or at least not exclusively) a lower price tag, because there’s always going to be a significant switching cost associated with replacing existing systems. There are a variety of triggers that can be the driving force of change here:

Performance limitations of the incumbent software
Functionality desired in the challenger software
Employee turnover reducing in-house expertise (the “bus factor” problem)
Lack of visibility or influence with vendor roadmap
Generally speaking, startup businesses that recognize the need to monitor and trend their software stack gravitate towards small, tightly scoped products that they already have familiarity with. Small engineering firms are more likely to use developer-focused services like Pingdom, New Relic, or Librato, while shops with an established Operations presence generally prefer open source offerings like Nagios and Graphite. At this stage in the growth cycle, most businesses realize the urgency to focus on one’s core competencies, rather than trying to reinvent the wheel on functionality that doesn’t provide an actual return on investment.

As the business grows, processes and philosophies become more formalized. The compulsion grows to build these monitoring systems in-house, especially as the company is financially and logistically able to attract, afford, and retain the personnel tasked with building and maintaining these systems. This is frequently driven by management’s desire to “own the data”, although it’s just as often a byproduct of employee ego and the NIH (Not Invented Here) syndrome.

At this point in the lifecycle of a monitoring architecture it’s crucially important to take an objective look at whether building these systems provides a competitive advantage for the business. Will these projects help you serve your customer base more effectively? Do they offer parity (or a competitive edge) over your competitors’ offerings? Are your time and resources better spent developing these systems than developing your primary product or service?

If you can’t answer these questions positively, you’ll be better served looking at your Buy options than attempting to build the systems yourself. Rest assured, if your business continues to grow, you will eventually reach an inflection point where building these systems becomes a necessity. Examples abound in our industry, such as Twitter’s Observability team and Netflix’s Insight Engineering group. But until you reach a scale where building is your only choice, it’s wise to regard building whole monitoring systems with skepticism and prudence.

Composable Systems

Ten years ago, monitoring systems were largely independent, monolithic affairs. Any sort of modular design was constrained to the vendor that built the components (e.g., IBM, Microsoft, HP). Today, the impact of Services Oriented Architectures is felt throughout open source and commercial monitoring software and services.

Functional responsibilities are now routinely handled by discrete composable units: check and state engines such as Nagios and Pingdom, metrics collection agents including collectd and Diamond, aggregation handlers such as StatsD, time-series and visualization stacks including Graphite and Librato, incident management services like PagerDuty and VictorOps.

Now that third-party integrations have become a competitive differentiator, developers are able to target stable interfaces and documented APIs. The end result is a huge win for customers; successful patterns and best practices have evolved, driving innovation further and faster than ever before.

One of the great benefits of this composable movement is that our monitoring architectures are increasingly flexible. Consumers are no longer tied to a single vertical stack or vendor. It’s commonplace to deploy an open source collection agent like collectd reporting measurements to a hosted graphing and alerting service like Librato, which in turn fires notifications to PagerDuty, before being escalated or relayed to the responsible teams’ ChatOps channels in Slack or HipChat.

This also means that our inflection points are no longer tied to a monolithic stack (or it’s weakest internal link). We can pick and choose individual components according to their performance, cost, or fit for the specific task required.

When to Build

The simple answer to the question “When should I build?” is: when it provides a competitive advantage to your business. This is somewhat of an abstract notion, but it’s an important litmus test, and one that is probably easier to answer than you might initially suspect. Each of the Build advantages listed below contribute to this overarching question, but it’s important to consider them as a whole rather than leaping to a decision based on one or two criteria.

Data Privacy

Many companies start with the belief that their monitoring data is unique and proprietary. In most cases, this is simply false. We tend to blur the lines between product- or customer-driven data, the kinds usually reserved for detailed analysis, with the type of metrics and telemetry gathered from our systems and applications. In most cases these are vastly different types of data: analytical data tends to be irregular and useful under a microscope. On the other hand, time-series data, the sort that we collect from our platforms and services, are generic, regular, often repetitive, and more valuable when considered in the aggregate. However, this is not always the case.

As time-series storage systems evolve, features such as dimensionality (aka tagging or labels) drive us to associate more contextual data with our metrics. This flexibility over traditional flat metric namespaces affords us the opportunity to describe our data in new and powerful ways. If you find yourself leaking proprietary information or attributes into your metrics, then building may be a safer alternative for you, at least until (if) it becomes technically feasible to encrypt large volumes of metrics at scale.

Scale

Any sufficiently large IT architecture will reach a tipping point where it’s no longer feasible to outsource your monitoring workload. A bottleneck is reached: it could be that you have too many measurements for the provider to handle, or perhaps the vendor can’t scale their user authentication system to enterprise levels (e.g., lack of SAML or LDAP support). In this case you may not have an option but to find an alternate solution that meets your changing needs.

Ironically, most businesses will never reach this scale although many tend to prepare for it as if it were an inevitability. The Googles and Facebooks of the world are few and far between: companies with the size and scale where in-house monitoring is a necessity. In fact, unless you’re running tens of thousands of systems, approaching the scale of Netflix or Twitter, you probably haven’t reached an inflection point of scale where it becomes a significant consideration.

One caveat regarding scale: avoid the temptation to conflate technical scaling challenges with their escalating costs. There are tremendous costs associated with (and unique to) building your own monitoring architecture at scale. Beyond all of the development and operational costs, there are significant costs tied up in training, maintenance, enhancement, support, and security of your in-house systems.

Fit and Flexibility

Your software architecture may reach a state of complexity where building a tailored system is your most likely path to success. Although many software monitoring vendors are friendly to Cloud and containerization (e.g., Docker), they have to target the most common software stacks and deployment strategies. Building a monitoring stack in-house would allow you to optimize for your existing designs and respond to changes quickly in your environment. Note that even in these cases, it may make sense to take a hybrid approach, outsourcing the parts of the architecture where you have neither the time nor resources to design and build. Many of our customers at Librato leverage the service as a metrics platform, focusing their attention on areas that align closer with their core business proposition.

Disadvantages

There are certainly disadvantages to both building and buying approaches, but the Build negatives tend to be easily underestimated (or ignored), especially when your Operations Engineers are already comfortable integrating and building internal software components.

Care and Feeding

Development and maintenance costs are merely the tip of the iceberg in terms of standing up a custom monitoring architecture. A variety of related processes will weigh heavily against your technical debt and need to be factored into your roadmap and cost analysis: design and specification, recruiting (talent), testing, refactors, deployment, training, enhancements and bug fixes, and of course, scaling to meet demand (assuming your project is successful enough to drive adoption).

Infrastructure Costs

In an ideal situation, you already have preexisting investments in hardware and infrastructure that can be leveraged. Worst case, you’ll have to go through the specification and acquisition processes with your hardware vendors. Either way, you still have to manage the upgrade cycle as these assets reach end-of-life or lease cycles. This is traditionally the most obvious benefit to outsourcing any IT service to a hosted provider, since the infrastructure acquisition and upgrade costs are absorbed (or at least very lightly amortized into the service cost) by the vendor.

Employee Turnover

In a job market where good talent is already a hot commodity and competition among employers is fierce, employee turnover is a real concern for businesses wanting to build out their monitoring systems. Losing an established leader among your development team not only places a significant burden on your remaining team and impacts your rollout schedule, but it’s that much more difficult to find engineers with a specialization or interest in monitoring services.

If you decide to push forward with a Build project, make sure that each team member involved is comfortable working on more than one component. It’s common for engineers to focus on a particular area of interest (e.g., “the time-series person”), leaving you subject to a very low “bus factor” on mission critical infrastructure..

When to Buy

As you’ve probably guessed by now, the easy answer to “When to buy?” is: when building isn’t an absolute necessity. Believe it or not, this realization took me years to acknowledge; I’ve spent years in the open source and monitoring communities as a developer, maintainer, and (some seem to think) thought leader, so my career has been tightly coupled to my successes building and scaling large, custom monitoring systems.

Fortunately I’ve also met and become friends with countless monitoring software users and administrators through the Monitorama conference. These experiences have helped me to better understand the various use cases and architectures where companies are employing either Build and Buy strategies (or a hybrid of both). In hindsight, I was certainly guilty of building when buying might have been the more appropriate choice; hopefully this article will help others recognize the anti-patterns and criteria to make a more informed decision the next time around.

But just so we’re clear, outsourcing your monitoring stack isn’t the right choice (in many cases) becausebuilding is bad, but because there is a number of good reasons why Buy is often the more sane option.

Continuous Enhancements

The competition among hosted Monitoring-as-a-Service providers is intense, driving innovation in breadth of features and overall usability. And because so many of these products focus on a particular segment of the composable monitoring system design (e.g., time-series monitoring versus incident management and response), iterations tend to be appropriately scoped according to customer demand.

Domain Expertise

One thing that businesses often fail to consider (heck, I’ve been guilty of this myself) when comparing Build and Buy is the domain expertise you gain when outsourcing your monitoring functions to a provider that deals with their particular service domain day in and day out. Because they’re building multi-tenancy systems at scale, these vendors will have already encountered, gained an understanding of, and (hopefully) overcome many of the hurdles that you’re likely to hit at some point in your growth cycle.

This particular benefit also insulates you from the bus factor scenario, since you’re no longer having to staff up your own domain experts in-house.

Less Expensive

To be fair, I could write entire books on comparing the costs of Build versus Buy. Regardless, the costs of subscription SaaS services are generally very straightforward and easy to understand. These expenses typically come out of your operating budgets (rather than capital expenditures), offering accounting advantages over building your own infrastructure and custom services from scratch.

Onboarding

Because the Buy option already exists, you don’t have to wait for use cases and workflow patterns to emerge. Support, developer, and user documentation are already formalized, making it possible for your entire team to ramp up immediately; when building internally, your teams may block on waiting for subject matter experts to emerge.

The user experience for outsourced services is almost universally better than corresponding OSS counterparts. Most vendors understand the need for proper designers when building these services for tech professionals, and a good design lead is often one of the first hires for any new product line.

Disadvantages

Although outsourcing is a solid choice in many situations, the fact that many of these services are designed for the most common use cases can also mean that they aren’t optimized for situations or workflows unique to your business. Be wary of products designed for specific workflows, or for large monolithic suites intended to serve a generic audience.

Make sure your service providers have structured channels for product feedback and support. It’s inevitable that your business will change as it grows; you’ll want to partner with vendors that can grow with you. There are few things as frustrating as working with a “black box”, and it can certainly feel this way sometimes with uncooperative vendors. Whether it’s a lack of transparency into their roadmap, bad support, or just a poor attitude towards customer requests, even the best products can turn into a dead end when a vendor neglects (or takes advantage of) their customer relationship.

All bias aside, I’m very proud of our track record at Librato. We have a top-tier support team that works virtually around the clock to help customers leverage our platform and find solutions for their metrics and monitoring workload.

Conclusion

Unless you’ve determined that building your own tailored monitoring architecture provides your business with a competitive advantage, it’s more likely than not to distract you from your own core competencies. Most businesses in a period of growth (or retraction) will inevitably experience their individual inflection points. It’s my hope that this article will provide you with enough examples that you’ll be able to identify the signals and make a rational decision when that time comes.

If you have more questions about this topic or feedback on this post, I encourage you to reach out to me on Twitter or email me here at Librato.

Inflection Points

Composable Systems

When to Build

Data Privacy

Scale

Fit and Flexibility

Disadvantages

Care and Feeding

Infrastructure Costs

Employee Turnover

When to Buy

Continuous Enhancements

Domain Expertise

Less Expensive

Onboarding

Disadvantages

Conclusion

Related Posts