The Most Important Lesson Wasn't in the Code

It was in the system around it

Dec 13, 2025

I was watching a few recent tech talks and platforms updates, the kind you half-listen to while doing something else, including AWS re:Invent 2025 in Las Vegas, when something caught my attention.

Not a feature. Not an AI demo. Not a big launch.

It was how many decisions the platform had already made for engineers.

At first I thought, maybe this is just AWS being more opinionated. But the more I watched, the more it reminded me of something much closer to home.

Personal Story: A Small Team, A Recurring Problem

A while back, I was working with a small team, a couple of engineers, moving fast, juggling product and infra at the same time.

At one point, we had this annoying pattern.

Every few weeks, something would go wrong in production. Nothing dramatic. But the first 30 minutes were always chaos. Logs were incomplete. Metrics existed, but not where we expected. Alerts fired, but didn’t really help.

We always ended up asking each other “Do you remember how this service works?”

We did the usual tings, talked about it in a retro, agreed it was important, and wrote a short guideline. Everyone nodded.

And then, the same thing happened again a few weeks later.

That’s when it clicked. This wasn’t a people problem. It was a system problem.

When the Same Problem Shows Up Again

What finally caught my attention wasn’t a big outage. It was how familiar everything felt.

Different week, different person, same questions:

“Where do I see what’s actually happening?”
“Do we even have logs for this?”
“Which dashboard should I look at?”

The first time, I brushed it off. Stuff happens. The second time, I couldn’t unsee the pattern. That’s when I started using a very simple rule for myself:

If a problem shows up twice, stop fixing the incident. Look at the setup.

Not the architecture. Not the people.

The boring stuff everyone walks past.

In our case, the setup was a service template and a PR checklist. Observability was “recommended”, but nothing broke if you skipped it. And under pressure, guess what people did.

I probably would’ve done the same.

What I Pay Attention to Now

These days, when something keeps breaking, I don’t ask who owns it first.

I ask a different question:

What did our setup make easy?

If I have to explain the same thing twice, I look for where to bake it in.

If something is important but optional, I assume it’ll be skipped.

I’m not claiming this is the “right” way. I’m still learning. But fixing incidents feels busy. Fixing the setup actually changes behaviour.

If you’re designing systems or running a small team and this sounds familiar, I’d love to compare notes. Always curious how others shape their defaults.

Discussion about this post

Ready for more?