Debugging at 2 AM

debuggingstories

It was 2:17 AM when my phone buzzed. PagerDuty. The kind of notification that makes your stomach drop before your eyes even focus.

[CRITICAL] API response time > 30s
Affected: 47% of requests
Duration: 12 minutes and counting

I was awake in seconds.

The Scene of the Crime

Here's what I knew:

  • Deploy happened at 11 PM (a Friday deploy, yes, I know)
  • Metrics were fine until 1:45 AM
  • Database connections looked normal
  • CPU and memory: fine
  • Logs: suspiciously quiet

That last one was the red flag I missed for twenty minutes.

The Hunt

  [2:23 AM]

     ___________
    |  _______  |
    | |       | |    $ tail -f /var/log/api.log
    | | ERROR | |    
    | |_______| |    ... nothing ...
    |___________|    
        | |         $ grep -r "panic" .
       _|_|_        
                    ... nothing ...

I ssh'd into production. Tailed logs. Ran htop. Everything looked... normal. Too normal.

Then I checked the deployment diff:

- logger.Info("processing request", req.ID)
+ // logger.Info("processing request", req.ID)

Someone (me, three weeks ago) had commented out logging to "reduce noise" during a debugging session. And never uncommented it.

The logs were quiet because we weren't writing any.

The Real Bug

Twenty minutes later, I found it. Deep in a goroutine:

for {
    select {
    case msg := <-channel:
        process(msg)
    default:
        // "optimization" - don't block
    }
}

That default case was a CPU-melting busy loop. Without the logging, there was no indication of the thousands of empty iterations per second. The deploy just happened to coincide with higher traffic that pushed it over the edge.

The Fix

Two characters:

case msg := <-channel:
    process(msg)
// removed the default case

Deploy. Watch. Breathe.

  [3:04 AM]

     Response time: 45ms
     Error rate: 0%
     
         \o/
          |
         / \

What I Learned

  1. Never deploy on Friday. I know. We all know. We do it anyway.

  2. Logs are not noise. They're breadcrumbs. Remove them and you're lost in the forest.

  3. "Optimizations" need comments. If you're doing something clever, future-you needs to know why.

  4. The bug is never where you think. I spent 20 minutes checking database connections when the problem was a two-line diff from three weeks ago.

The Postmortem

The next Monday, I wrote it up. Blameless. We added monitoring for log volume — if it drops below a threshold, that's now an alert too.

And I uncommented the logging.


It's 3:30 AM now. The incident is closed. Sleep will not come easy, but at least the API is fast again.