娇色导航

Our Network

Rahul Chandel
Contributor

Beyond uptime: How we redefined observability to protect performance, profits and people

Opinion
Jul 15, 20257 mins
IT LeadershipROI and MetricsStaff Management

What’s the point of speed if your engineers burn out? True observability means optimizing for performance, profits — and the people behind it all.

Observer, observable, looking, watching, binoculars
Credit: JOKE_PHATRAPONG / Shutterstock

A few years ago, my team hit a monumental milestone: a 10% decrease in p99 latency on our core APIs! According to our dashboards, it had been executed flawlessly. However, another metric was telling a much scarier story: an increase of 40% in after-hours alerts for the team supporting the same APIs. 

We were making the system faster for customers, but also demonstrably worse for our engineers. The data was quite clear: our metric worlds were warring against each other. This discord forced us to ask a more sophisticated question: What value is a high-performance system if the human architecture sustaining it is brittle? It was our threshold for beginning to examine a more holistic philosophy; one where we balanced not two, but three pivotal pillars of value. 

First leap: Linking technology to business 

Our story began in the same place as many others. We had too much data — CPU utilization, request latencies, availability trackers, error rates — but lacked serious insight. My team and I had built a best-in-class monitoring and observability, but could not provide meaningful answers to basic business impact questions. 

We had our breaking point with a “minor” 300 millisecond slowdown of our product recommendation engine, technically within our SLOs cost us close to $30,000 in revenue over 48 hours as we checked our customer impact on revenue. That experience spurred us to get serious in understanding and mapping technical performance against business KPI’s. We had meetings with marketing, sales and finance, and learned the terminology of conversion rates, customer lifetime value and paused conversions. We learned to look at systems through the lens of how through those systems, instead of simply monitoring them. 

It was our first evolution, and we were finally connecting the dots. 

And then came the breakthrough: the observability trifecta.

But over time, I grew increasingly concerned with all three disciplines. As we ratcheted up the development of change to drive business outcomes, the systems were becoming more complex. The cognitive load was weighing on my engineers. They were delivering new features faster, but at an increasing toll on their mental load and burnout, and on the time and effort to support the features. Admittedly, we were improving our business metrics, but we were doing it at the expense of our most valuable asset — our engineering talent. 

We were optimizing the ‘how’ (system performance) and optimizing the ‘why’ (business outcomes), and we weren’t optimizing for the ‘who’ (our developers). This is the point we realized our true north was the observability trifecta. 

We concluded that a sustainable, high-performing system takes a holistic perspective on three discrete but collectively interdependent pillars: 

  • System performance: The traditional technical metrics. Is the system fast, reliable and available? This is the baseline.
  • Business outcomes: The financial and customer-facing metrics. Is the system making money, improving conversion and delighting users? This is the purpose.
  • Developer experience (DX): The human-centric metrics. How easy is it to develop, test, deploy and operate a service? We started measuring metrics patterned on and : What is our lead time for changes? How much time do we spend on unplanned work and operational toil? Which systems generate the most on-call cognitive load and number of alerts? This pillar drives sustainability and speed of innovation. 

Putting the trifecta into action 

By adopting this three-pillar view, we want to transition from reactive to proactive and strategic. 

1. From business-aware SLOs to top-down BLOs 

We stopped developing technical SLOs and trying to justify their business implications. Instead, we collaborated with leadership to identify the top-level business-level objectives (BLOs). With a defined BLO of “Improve new user sign-up success rate from 95 to 98%” as our north star, my teams could develop the technical SLOs necessary from the authentication service, database and the front-end client; the work was not based on bottom-up discovery but now on top-down direction and purpose. 

2. Observability-driven product strategy 

The trifecta was a powerful input for our product strategy. In one of our reviews, we could see that as a legacy payments service, we had poor DX metrics with a high cognitive load, slow time to deploy and mediocre performance. However, the business metrics showed that it was tied to only a small part of our overall recurring revenue. Given this holistic view, we arrived at the strategic decision not to invest in fixing it, but to actively migrate the handful of remaining customers to our modern platform before deprecating the legacy service. Without the DX pillar, we risked spending months trying to enhance a very low-impact system. With the trifecta, we freed up an entire team to work on high-value, revenue-generating product innovation. 

3. Making impact real: The cost of delay dashboard 

To make this real for all, we worked with our data science team and developed a new kind of dashboard on Datadog and Amplitude. Along with metrics for latency, we now display a real-time dollar amount: the “cost of delay” metric. For every 100 milliseconds of latency introduced to our process, the model will estimate the revenue impact. Once an engineer realizes that a small change in performance costs the company $150-$200 per hour, the sense of urgency around fixing it becomes personal and global. 

The future: Composable view of value 

As we move forward, we aim to build a truly composable view of value. We are building tools to aid product managers and technical leads in modeling the trade-offs across the 3 pillars before any code is written. 

Do you want to introduce a feature that will increase conversions by 3%? Our models will show the expected impact on system load and the estimated increase in operational complexity for the team that owns the feature. This would shift observability from a rearview mirror view of the economy to a prediction and strategic vision for the whole enterprise. 

Profits, platforms AND people! 

While closing the initial gap between tech and business was an important step, it was not the end of the journey. The greatest shift was an acceptance of a holistic value perspective that placed our people on the same level of consideration as our profits and platforms. 

The most gratifying experience in my career came not from the dashboard but from a planning meeting that included our head of product, our VP of engineering and a lead architect, where they all used the same terminology. They weren’t debating features; they were discussing trade-offs that had real and rapid revenue impact against system reliability and developer velocity. They were speaking of three-dimensional value terminology together. That was the moment I knew we were no longer simply a platform team but operating as a core business driver.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

Rahul Chandel
Contributor

is an engineering leader with more than 15 years of experience in software engineering, distributed systems, cloud computing, blockchain technologies, payment systems and large-scale trading platforms. He has led high-performing teams at Coinbase, Twilio and Citrix, driving innovation, scalability, and operational excellence across mission-critical systems. Rahul is passionate about fostering innovation and designing systems that thrive under real-world pressure.

More from this author