Pages

Friday, May 8, 2026

E = MC² : The Equation That Never Gets Old


 

On Measurement, Continuous Improvement, and Customer Focus — Then and Now

 (A decade and half ago I wrote two linked blogs on Operational Excellence. The are referred at the bottom of this article. When I read them in the context of the world of today, many principles remail the same, but manifestations are different. This article is an attempt to revisit the idea of Operational Excellence in the era of AI and Agents)

There is a particular kind of excitement that technology companies are exceptionally good at, and a particular kind of discipline they are chronically bad at. The excitement is building. The discipline is running. Every new feature, every new product, every new platform gets showered with energy, talent, and attention. The unglamorous work of making sure it all actually works, consistently, reliably, at scale, day after day, gets left to whoever is available, measured by whatever is easy to measure, and improved only when something breaks badly enough to be embarrassing.

This is not a new observation. But it has become a vastly more consequential one. Because we are now deploying AI systems and autonomous agents into operational environments at a pace that far outstrips our willingness, or our ability, to govern them. And the cost of that gap is no longer measured in minor inefficiencies. It is measured in compounding, invisible failures,  in decisions that are wrong by design, in resources consumed by systems nobody is watching, and in customers quietly harmed by processes nobody is truly accountable for.

The answer to this is not more technology. It is better operational discipline. And the framework for that discipline is simpler than most people think.

We call it E = MC²: Excellence, derived from a culture that Measures relentlessly, pursues Continuous improvement, and never loses sight of Customer focus. These three elements are not independent. They are a virtuous cycle, each one feeding the others, each one incomplete without the others. Understanding how they connect, and how to make them real, is the central challenge of operational management in any era. Including this one.

Why Measurement is Hard, Even for People Who Handle Data for a Living

There is a paradox at the heart of the IT and services industries. These are sectors whose entire value proposition rests on data, on capturing it, organising it, analysing it, and making it useful. And yet, in practice, their internal operational measurement discipline is often surprisingly immature. The processes that organisations build for their customers are rarely applied with equal rigour to their own operations.

The reasons are not mysterious. The glamour in these industries flows toward novelty, toward "cool functions," "exciting features," and "latest gadgets." Boring pursuit of efficiency gains simply does not compete for talent or attention. When a senior engineer has a choice between building something new and spending six weeks instrumenting something old to understand why it sometimes fails, the outcome is predictable. And so operational measurement tends to happen reactively,  in response to a crisis, a customer complaint, or a regulator's inquiry,  rather than as a continuous, proactive discipline.

To learn how to do this differently, it helps to look at industries that never had the luxury of treating operations as an afterthought.

The hazardous chemical process industry is an instructive model, and not an intuitive one. It has been around for centuries, long enough to have matured its operational practices through hard experience. Its product lines are largely commoditised, which means margins are thin and efficiency is not optional, It is existential. The consequences of process failures are sometimes fatal, which means the scrutiny.  public, regulatory, and internal, is unrelenting. And its processes are integrated end-to-end, with limited visibility into what is actually happening inside the pipes at any given moment, which forces a culture of strong monitoring and control.

These are, in fact, exactly the conditions that characterise complex digital operations today. Thin margins. High stakes. Limited internal visibility. Regulatory scrutiny. The main difference is that the chemical industry has spent decades building the measurement culture to match these conditions, while the technology and services industries are still, in many cases, at the beginning of that journey.

From that more mature tradition, three elements of measurement discipline emerge as foundational.

The Three Pillars of Measurement

Flow Management: Count Every Transaction

The first pillar is what might be called micromanagement of the operation, not in the pejorative sense of hovering over people, but in the precise sense of tracking each input through each sub-process it was meant to traverse, confirming it arrived correctly and without error.

This sounds obvious. In practice, it is done poorly, or not at all, especially for processes that are still evolving. When a new system or workflow is still being refined, exceptions proliferate. And exceptions, in young computerised systems, have a dangerous tendency to become invisible,  swallowed by automated retry mechanisms, silently skipped, or classified as edge cases that never quite make it onto anyone's priority list.

The consequences of poor flow management are almost always financial and reputational, and they tend to be discovered embarrassingly late. A large bank once sent letters to its credit card customers admitting that it had not been tracking transactions correctly, and asking recipients to settle on the basis of their own personal records. The transactions were not hidden. They were not stolen. They had simply not been tracked. The systems were running; the accounting was not. When providers of transaction billing solutions are brought into organisations for the first time, the revenue leakage they surface, from transactions that fell through the cracks of inadequately monitored processes, is routinely staggering.

These are not exotic failures. They are the entirely predictable consequence of building systems without building the measurement infrastructure to watch over them.

Capacity Management: Know Where the Bottlenecks Are Before They Happen

The second pillar is the macro view, tracking the capacity of processes, people, service providers, and machines in order to anticipate bottlenecks before they become crises. This requires establishing trend measures for each element and monitoring them continuously, not just periodically.

Capacity management is especially treacherous in computerised environments for a structural reason: shared resources. Network infrastructure, compute capacity, database connections,  these are all consumed by multiple processes simultaneously, and the utilisation curve for each process grows differently. A system that appears to have adequate capacity for today's workload may have none for tomorrow's if the growth curves are not being watched and modelled.

Two particular categories of hidden capacity consumers deserve special attention, because they are pervasive and almost universally underestimated.

The first is queries. Every business generates a need for data extracts — for management reporting, regulatory compliance, customer service lookups, and ad hoc analysis. These queries consume the same production capacity as the operational processes. And they are disproportionately likely to be written inefficiently, because they are typically assigned to junior resources or business analysts who lack the training to optimise them, and because there is very little accountability for query performance until something breaks. A query that was meant to run once becomes a standard report. A standard report that runs nightly becomes a standard report that runs hourly. The cumulative resource consumption creeps upward invisibly until, one day, the system slows to a crawl during peak operational hours, and nobody can immediately explain why.

The second is design debt. For most software developers, the genuine satisfaction is in building features. Once a feature is live and functioning, interest moves on. The pressure to optimise, to refactor, to improve efficiency, runs directly against the incentive to ship the next thing. The result is that bespoke systems accumulate performance inefficiencies that are never addressed , not because fixing them is technically difficult, but because nobody is measuring the cost of leaving them in place, and nobody is accountable for the cumulative drag. In most organisations, there is scope for at least a hundred percent improvement in process efficiency simply by addressing the worst of these design inefficiencies,  but only if someone is measuring for them.

Service Levels: Commit to the Customer, Then Track the Commitment

The third pillar is where measurement connects most directly to purpose. The most powerful mechanism for ensuring that measurement and improvement activity stays focused and meaningful is to define, publicly and clearly, what the organisation is actually committing to deliver to its customers.

There is an important distinction to draw here between a Service Level Agreement and what might be called a Customer Service Commitment. An SLA is a floor — a formal definition of the minimum below which the organisation will try not to fall. It is a legal and contractual instrument, and it tends to create a culture of adequacy: as long as we are above the floor, we are fine. A Customer Service Commitment is something different. It is a genuine aspiration — a statement of what the organisation sincerely believes it can and should deliver, at a level meaningfully above the minimum.

This distinction matters because people and systems tend to optimise for what they are measured against. An organisation that measures against its SLAs will manage its operations to the SLA threshold. An organisation that measures against its Customer Service Commitments will manage its operations to the standard it actually believes in.

The mechanics of tracking these commitments deserve specific attention. Time-series data, tracking key performance parameters not just at a point in time, but continuously over time. is essential for detecting trends before they become crises. A single data point t ells you where you are today. A trend tells you where you are going. And it is the trend that matters operationally, because by the time a single bad reading turns into an obvious crisis, the window for preventive action has usually closed.

It is also worth having the team that tracks customer commitments sit separately from the team responsible for operations. This is not about distrust. It is about the structural reality that an operations team under pressure will, understandably, interpret ambiguous data in the most favourable light available. A separate tracking function provides the independent visibility that makes measurement honest.

Continuous Improvement: From Counting to Acting

All of this measurement serves one purpose: enabling the organisation to improve, continuously, before it is forced to by failure.

This is more difficult than it sounds, because the culture required to use data for continuous improvement is fundamentally different from the culture most organisations actually have. In most places, data tracking reports are either compliance artifacts — produced to satisfy an audit or a boss, or post-mortem instruments, pulled out after something has gone wrong to explain what happened. Neither of these uses generates improvement. They generate paper trails.

The culture of continuous improvement requires something harder: the regular, disciplined use of data to find problems that have not yet caused visible failures. This means looking at trend shifts before they become obvious. It means investigating unusual volatility in metrics that are still technically within acceptable bounds. It means preferring prevention over heroism — which runs directly against the organisational instinct that rewards the person who fixed the crisis rather than the person who avoided it.

To make this a habit rather than an occasional initiative, it has to become a ritual. The cadence of reviewing operational data, identifying trends, assigning root cause investigations, and tracking improvement actions has to be embedded into the organisation's regular rhythm, not treated as an additional burden on top of "real work." When it is done well, it does not feel like overhead. It feels like the organisation learning from itself in real time.

The AI Era Changes the Stakes, Not the Principles

Everything described above was relevant in 2009. It is more relevant now by an order of magnitude.

The introduction of AI systems and autonomous agents into operational environments does not render these principles obsolete. It makes them urgent. Because AI introduces a new category of operational actor, one that is more capable, more opaque, and more consequential than anything that preceded it,  into environments that, in many cases, barely had adequate measurement cultures to begin with.

The most important thing to understand about AI in operations is that it fails in ways that are qualitatively different from how conventional software fails. Traditional software fails visibly. A system crashes. A transaction errors out. A service goes down. These failures are, in their own way, manageable, because they announce themselves. AI fails silently. A model that has drifted from its training data continues to generate outputs that look confident and coherent, while producing decisions that are subtly, systematically wrong. A recommendation engine with a bias baked into its training data does not flag an anomaly; it just consistently disadvantages certain customers. A document processing agent that hallucinates does not throw an exception; it produces a confident, plausible, and incorrect result.

This is the flow management problem, rewritten for the age of AI. Every AI-powered process needs a systematic accounting not just of what it produces, but of the quality, reliability, and drift of those outputs over time. The input went in; the output came out, but was the agent's reasoning within acceptable bounds? Was its confidence calibrated? Were there exceptions that the system silently swallowed rather than escalating to a human? The revenue leakage and customer harm that flow from unmonitored AI processes make the untracked credit card transactions of an earlier era look quaint.

The capacity management problem is also fundamentally transformed. AI models are the most resource-intensive entities ever introduced into enterprise operations. A single large model inference can consume more compute than an entire legacy application stack, and when multiple agents run concurrently, as they increasingly do, in agentic architectures where AI systems orchestrate other AI systems, the shared infrastructure constraints become genuinely complex to manage. The hidden capacity consumers have multiplied: poorly designed prompts that generate verbose, expensive outputs; inefficient agent chains that make redundant calls; one-time AI automations that quietly become permanent fixtures eating into rate limits and GPU capacity. None of this shows up on a standard IT dashboard unless someone has specifically built the instrumentation to see it.

And the service levels question, always the most important one, has become the most morally loaded. When an AI agent makes a decision that affects a customer,  about a loan, a medical triage, a service entitlement, a pricing offer,  that customer has a right to understand it, challenge it, and have a human correct it. This is not only a regulatory requirement in an increasing number of jurisdictions. It is the operational definition of customer focus in a world where the agent, not the employee, is the primary interface. A Customer Service Commitment in the AI era must include commitments about explainability, human override, and recoverability, not just turnaround time and accuracy.

The Measurement Culture the AI Era Demands

Bringing this together, what does operational excellence actually look like for an organisation running AI at scale?

It looks like flow management that tracks not just whether transactions were processed, but whether the AI agents that touched those transactions acted within defined parameters, and that surfaces exceptions rather than silently absorbing them.

It looks like capacity management that instruments AI resource consumption with the same rigour that a hazardous chemical plant instruments its pressures and temperatures,  understanding not just current utilisation, but growth trajectories, shared resource constraints, and the hidden consumers that creep up over time.

It looks like Customer Service Commitments that extend into the AI layer,  that define not just what will be delivered, but how decisions will be explained, how errors will be corrected, and how human accountability will be maintained even where AI is the primary actor.

And it looks like an organisation where data is used not to satisfy bosses or produce compliance artifacts, but as a genuine tool for continuous improvement by everyone at every level. Where a shift in a trend line is treated as a signal worth investigating, not as noise to be explained away. Where prevention is valued as much as heroism. Where the excitement of building is matched, at last, by the discipline of running.

The Hardest Part Has Not Changed

In the end, the measurement framework, however well designed, is only as good as the culture that uses it. And culture is stubbornly human. The data is the easy part. The hard part is persuading organisations and the people within them to use data as a tool for honest self-improvement rather than as a performance to be staged for external audiences.

That challenge has not changed in sixteen years. It will not change in the next sixteen either. What changes is the cost of getting it wrong.

Give the people the facts, about their processes, their agents, their customers, their capacity, their failures, and their potential,  and they will, if the culture is right, do the right thing.

That is still the bet. It is a harder bet to lose than it has ever been. But it is the only bet worth making.

"The customer does not care about your dashboard. They care about what happened to them. Those are not always the same thing."

Retaled Posts


No comments:

Post a Comment