Most teams adopt tracing to chase slow requests. That works, but it misses the real problem. In real SaaS systems, the hardest bugs aren’t about speed — they’re about understanding why something happened at all. This article reframes distributed tracing causality as a way to follow cause and effect across async jobs, retries, and side effects, so debugging stops being guesswork and starts making sense again.
Intro
Most teams turn on tracing because something is slow.
That’s reasonable.
It’s also the wrong reason.
If tracing only helped you find slow endpoints, it would barely be worth the effort. Metrics already do that better. Logs usually get you there faster anyway.
Tracing becomes valuable when your system stops making sense.
And that always happens before performance becomes the real problem.
Why debugging still hurts even though “we have observability”
This is the part nobody likes to admit.
You’ve got logs.
You’ve got metrics.
You’ve even got traces wired up.
And yet, when something breaks in production, you still hear:
- “I’m not sure how this happened.”
- “It looks like the job ran twice?”
- “This request shouldn’t have triggered that.”
- “The system did the right thing… I think?”
If this feels familiar, it’s not because your tools are bad.
It’s because you’re using tracing to answer the wrong question.
Most teams use tracing to ask:
“What was slow?”
But the question that actually matters is:
“What caused this?”
Why we always start by chasing the slowest thing
Open any tracing UI and look at how people interact with it.
They sort by duration.
They click the slowest trace.
They zoom in on the biggest span.
That behavior is trained into us.
Latency is concrete.
Latency is numeric.
Latency feels actionable.
But here’s the uncomfortable truth:
The slow thing is usually not the broken thing.
It’s often just where the damage shows up.
By the time you’re staring at a slow database query or a long-running job, the real mistake already happened somewhere else. Earlier. Quieter. Faster.
Tracing doesn’t help when you use it like a stopwatch.
It helps when you use it like a timeline.
What tracing actually gives you (when you stop abusing it)
Logs tell you what a component thought happened.
Metrics tell you what happened on average.
Tracing tells you what actually happened, in order.
That’s the difference.
A trace isn’t a performance artifact.
It’s a causal artifact.
It answers questions like:
- What triggered this?
- What depended on that?
- What happened before this decision?
- What kept going even after the original request ended?
If logs are witness statements and metrics are summaries, traces are the security camera footage.
Not everything.
Not perfect.
But close enough to reconstruct reality.
Async is where your mental model quietly falls apart
Synchronous code is comforting.
You can read it top to bottom.
You can step through it.
You can pretend you understand it.
Async work is where that illusion dies.
Background jobs.
Queues.
Retries.
Webhooks.
Schedulers.
These things don’t just run later.
They detach causality unless you explicitly preserve it.
Most systems don’t.
They enqueue a job.
They forget why.
They hope it works.
And when it doesn’t, you’re left debugging effects with no memory of their cause.
Example 1: The signup job that “worked”
The naive implementation
User signs up.
The API handler does a few things:
- Creates the user
- Creates a tenant
- Enqueues a SendWelcomeEmail job
Looks clean. Ships fast. Everyone’s happy.
The job runs in the background.
It loads the user.
It sends the email.
It logs “success”.
The moment it broke
One day, support reports:
“Some users are getting emails with missing tenant info.”
We check the logs.
The job succeeded.
No errors.
No retries.
The API endpoint looks fine.
The job looks fine.
This is where people start saying things like:
“Maybe the database was slow?”
“Maybe it’s eventual consistency?”
“Works on my machine.”
This is a mess.
The symptom developers noticed
Nothing obvious.
No slow spans.
No failures.
Just incorrect behavior that couldn’t be explained.
The system didn’t crash.
It didn’t alert.
It just did the wrong thing.
Those are the worst bugs.
What actually happened
The email job ran before the tenant setup finished.
Not because of a race condition in code.
Because causality wasn’t preserved.
The job didn’t know:
- Who triggered it
- What step of signup it belonged to
- Whether it was safe to run yet
It was just a job. Floating in space.
The fix
We stopped treating the job as “background”.
We treated it as a continuation.
Same trace.
Same context.
Same causal chain.
Once we did that, the trace made the bug painfully obvious.
The job span sat before the tenant creation span.
No guessing.
No log archaeology.
Just “oh… that’s wrong.”
Why tracing HTTP endpoints is lying to yourself
A lot of teams say:
“We have tracing.”
What they mean is:
“We trace inbound HTTP requests.”
That’s not tracing.
That’s logging with better UI.
Real systems don’t stop at the request boundary.
They spill into:
- Jobs
- Retries
- Schedulers
- Side effects
- External systems
If your traces end when the request returns, you’re only seeing the opening scene of the movie.
Everything interesting happens after.
Example 2: The retry that charged the card twice
The naive implementation
Payment flow looks reasonable:
- Charge the card
- Record the payment
- If anything fails, retry the job
- Add an idempotency key “just in case”
This pattern exists in thousands of codebases.
The moment it broke
A customer gets charged twice.
Panic ensues.
Metrics look fine.
Latency is fine.
No spike in errors.
The payment provider says:
“You sent two charge requests.”
The team says:
“We only retry on failure.”
Both are technically correct.
The symptom developers noticed
Nothing was slow.
Nothing crashed.
Everything “worked”.
Except the money part.
Logs show:
- A retry happened
- The idempotency key was reused
- The job completed successfully
So… why the double charge?
What actually happened
The retry didn’t mean “retry the same intent”.
It meant “run the code again”.
Between the first charge and the record write, something failed.
The retry ran with:
- The same code
- The same idempotency key
- But no memory of why it was retrying
The causal chain was broken.
Tracing showed two separate payment attempts with no shared narrative.
The fix
We traced decisions, not just calls.
The retry span was linked to the original attempt.
The trace told a story:
First attempt started.
Charge succeeded.
Failure occurred after.
Retry continued the same intent.
Once you could see that, the fix was obvious.
You don’t retry code.
You retry state progression.
Tracing without causality is just expensive logging
This is where I get opinionated.
If your traces:
- Are heavily sampled
- Only exist for HTTP
- Drop context at async boundaries
- Don’t survive retries
Then you don’t have tracing.
You have logs with trace IDs sprinkled on top.
Sampling is fine for performance analysis.
It’s terrible for explaining rare bugs.
The bugs you care about are the ones that happen once.
At 2am.
To one customer.
Those are the ones where causality matters most.
Example 3: The feature flag that split reality
The naive implementation
There’s a new sync pipeline behind a feature flag.
Worker code checks a boolean and chooses a path.
Simple.
Flexible.
Ship it.
The moment it broke
Tenants start reporting inconsistent data.
Some records are updated.
Some aren’t.
Same tenant.
Same timeframe.
Nobody can reproduce it locally.
The symptom developers noticed
No errors.
No slow jobs.
No clear pattern.
Logs say the flag was on.
Other logs say it was off.
Everyone starts blaming caching.
What actually happened
The feature flag was evaluated inside the worker.
The flag changed mid-flight.
Different parts of the same logical operation ran under different rules.
From the system’s point of view, reality split.
The fix
We captured the flag value at the start.
We carried it through the trace.
The sync didn’t ask, “what’s the flag now?”
It knew, “this is the path I’m on.”
Once we did that, traces stopped being confusing.
And the bug stopped existing.
When effects outlive their causes, you lose the plot
This is the common thread in all of these.
Something happens.
Later, something else reacts.
But the reaction has no memory of the original cause.
At that point:
- Logs lie
- Metrics smooth things out
- Performance data looks normal
And you’re left with vibes.
Tracing is the only tool that can preserve the “why”.
But only if you treat it that way.
What changed when we stopped tracing speed and started tracing flow
The shift is subtle.
But once it happens, you can’t unsee it.
We stopped asking:
“How long did this take?”
We started asking:
“What caused this to run at all?”
We started:
- Carrying context everywhere
- Linking jobs to requests
- Treating retries as forks, not resets
- Recording decisions, not just calls
Debugging stopped feeling like archaeology.
You follow the trace.
You see the chain.
You understand the mistake.
It’s boring again.
That’s a good thing.
How this changes how you design systems
You start designing with continuity in mind.
APIs stop being “endpoints”.
They become transitions.
Jobs stop being “background”.
They become progress.
Retries stop being “try again”.
They become explicit branches.
Feature flags stop being magic.
They become part of the execution context.
None of this requires new tools.
It requires better judgment.
Performance is a symptom. Causality is the system.
Performance problems come and go.
Causality problems compound.
If you can’t explain why something happened, you don’t really observe your system.
Tracing isn’t about speed.
It’s about truth.
Once you internalize that, everything else gets easier.