Three API metrics that saved us more times than I can count

As backend PM, you should own the product and give any solution the frontend or the client needs for the different use cases; in short terms, you own the APIs and their behavior. When I started as PM, I was obsessed with metrics. I thought that I should have numbers for everything, every feature, every use case.

I challenged the team a lot to build and have different visualizations until it got messy and difficult to follow.

At that moment, I was thinking of a better way to do it.

How can I see that something is not working as expected? So I had a conversation with our architect at that moment, and he told me that in API, a lot of things can go wrong, and it will be difficult to see, but 3 metrics can tell you the symptoms, and you will be able to act and troubleshoot the problem.

For each API, you should track: Request per second, latency, and error ratio. Three metrics. That’s it. Let’s dive into each of them and why they are important

Requests per second (RPS)

Imagine your API normally handles around 200 RPS on a Tuesday afternoon. That afternoon, a campaign goes viral on TikTok, and you suddenly see a spike in users. The servers aren’t autoscaling fast enough, requests start queuing, and the downstream booking service gets hammered with calls it wasn’t provisioned for. Chaos!.

This happened to us a few years ago. The good thing was that we had an alert set around a 30% increase in traffic, which gave us time to do two things: validate that the traffic was legit — real customers entering the funnel — and autoscale our infrastructure to support the new load.

That alert let us capitalize on the moment and offer a decent experience. Without it, we would have been firefighting instead of watching the bookings come in.

Latency

This one happened to us at checkout. Our get booking flow API had a p50 latency of 120ms and a p99 of 340ms — stable for weeks. One Thursday, it started creeping: p50 held at 130ms, but p99 jumped to 1,800ms. RPS looked normal, error rate was zero. Something was slow but not broken.

The team dug in and found a database index that had stopped being used after a schema migration. No errors were thrown. Users on slower connections were just waiting nearly two seconds for a page that used to load in under half a second. Latency caught it; RPS and error rate told us nothing.

Without latency as a tracked metric, that problem could have lived there for months.

Error ratio

This is the most visible metric — it turns red when something is broken, and sometimes it’s not even your fault.

Everything looks normal, then you start seeing the error rate on your API creeping up. The alert triggers at 5%, and it’s still climbing. You ask the team, and nobody has released anything related to that API. You dig deeper and find that another team made a change that broke the contract — your API is suddenly unable to respond correctly.

This happens constantly in large teams. Services evolve at different speeds, teams maintain separate versions, and when someone deprecates an endpoint or makes what looks like a minor change, a client that hasn’t been updated in months breaks on a random Tuesday with no warning.

That alert is what turns a silent incident into something you can actually act on before your customers feel it.

These problems are difficult to prevent — they can happen on any random day. But having these three metrics on a dashboard with proper alerts configured will surface the symptoms early and give you and your team the time to act.

You can always add more specific metrics for particular use cases, and sometimes you should. But RPS, latency, and error ratio have saved us more times than I can count. Start there. Everything else is optional.