OAuth Tokens in a Race: A Production Debugging Story

10 min read
Cover Image for OAuth Tokens in a Race: A Production Debugging Story

Note: After I reported and investigated the issue described in this post, Apideck published a guide outlining the issue and recommended mitigation strategies. If you’re looking for a practical reference, you can find it here.

Part 1: The Problem

Context: An Integration’s Sneaky Failure

In a previous role, I worked on a feature that depended on a unified API to integrate with a third-party service. This feature required making several API calls in parallel, and as it evolved, more requests were added to the process.

What started as two requests sent together grew to four, and eventually to six. The last two requests were optional, triggered only under specific conditions, and the feature did not error if those responses failed or never came back. That detail will matter later.

At the time, I made a few assumptions. One of them was that the unified API could safely handle concurrent requests.

Pro tip I’ve learned the hard way: do not assume an API or service provider does something, even if it feels obvious, unless they explicitly document that they do!

The Only Symptom: A “Random” 401

I had just wrapped up a project a little early when my PM sent me a new ticket. The entire ticket consisted of eight words and a screenshot of some logs:

“Look into this random 401 error in Apideck.”

Hmmmm. Not exactly detailed, but I actually love tickets like this. What developer doesn’t enjoy a little mystery?!

Early Hypotheses and Immediate Confusion

I started by combing through the Apideck logs, focusing first on the batch of requests that included the 401. The failure always came from one request in the larger group, usually toward the end. At first, I assumed the user’s session might have expired right as that final request was being processed.

That assumption didn’t hold up.

Next, I filtered the logs to show only 401 responses. That’s when the confusion really set in, as I couldn’t identify a clear pattern.

Most of the time, the failure appeared in one of the last two requests. Remember how those were optional? That started to feel suspicious. But every once in a while, one of the first requests in the batch would fail instead.

Finding those failures from requests that came earlier in the batch is when I started to feel genuinely stuck.

I knew the requests themselves were not malformed. They were all sent together, using the same authentication context, and most (usually all) of them succeeded. When the failure happened, it looked totally arbitrary.

Signals That Didn’t Quite Line Up

The error message referenced an invalid refresh token, but even that wasn’t consistent.

Sometimes the request failed while using what appeared to be the “new” refresh token. Other times it failed with the old one. There was no obvious correlation between which token was used and which request failed.

Part of the problem was that I couldn’t reason clearly about timing. I didn’t control the OAuth flow, and token exchange lived entirely inside an external system.

To make matters more confusing, the documentation for the individual API we were integrating with didn’t line up perfectly with the unified API documentation. Each described token refresh behavior slightly differently, including when refreshes occurred and under what conditions.

That led me down several unproductive paths. If the error said “invalid refresh token,” how could I guarantee every request was using the most up-to-date token? When exactly did the token exchange happen? Which system was the source of truth?

None of those questions had clean answers.

After a lot of log analysis, I was eventually able to narrow down the conditions under which the bug might appear. A consumer needed to have an active session, and they needed to wait at least one hour between making requests. If both of those were true, the failure could occur. Not would occur, but could.

The Lengths We Go to Reproduce Bugs We Don’t Control

This was easily the hardest part of the entire investigation. Now that I had my theory, it was time to test it. My eyes hurt from scrolling through countless logs and my brain was scrambled from hours of scanning “Okay, request A token ended in 6b788, and request D ended in…”

Because the OAuth flow was external and time-based, reproducing the issue meant a lot of waiting. I literally set timers on my phone for days, at work and at home, so I could re-trigger the flow at the right interval. In order to reproduce the bug, I had to ensure my session didn’t time out, but that I waited at least an hour between requests.

Meetings with other engineers were occasionally interrupted with, “Sorry, one second. My timer just went off,” which is not a phrase you expect to hear in a work meeting, but it was worth it to slowly piece together the puzzle. There was no quick feedback loop. No reliable reproduction steps. Just a narrow window where the timing might line up well enough (or badly enough, depending on your outlook) for the issue to surface.

I didn’t fully understand what was happening, only that the timing and token exchange were not behaving the way I expected.

Part 2: The Root Cause and the Fix

Up until this point, everything about the issue felt inconsistent. Eventually, that inconsistency itself became the clue.

The Eureka Moment: When Inconsistency Becomes the Tell

After spending enough time staring at logs, something finally clicked.

Even when I managed to reproduce the issue, the failure never occurred the same way twice.

At first, that variability felt like chaos, like there were a dozen different paths to explore at once. Over time though, it became clear that the inconsistency itself was the signal. It wasn’t hiding the problem, but rather pointing directly at it.

This wasn’t a single bad request or a malformed payload. This was a timing issue. A concurrency issue. A race condition.

More specifically, the race was happening during the token refresh exchange itself.

Confirming the Diagnosis: A Token Exchange Race Condition

Once I had that framing in mind, everything started to make a lot more sense.

Under the right timing conditions, multiple concurrent requests could independently trigger a token refresh. Instead of one request refreshing the token and the others reusing it, each request could attempt its own exchange. When that happened, one refresh would invalidate another, and whichever request lost that race would fail with a 401.

That explained why different requests failed each time, why the failures weren’t tied to a specific endpoint or token, and why the issue only surfaced under specific timing conditions.

As I suspected, nothing about our requests was incorrect. The behavior lived entirely inside the OAuth flow of an external provider.

“Can We Control This From Our Side?”

Once the root cause was clear, the next question was obvious: could we fix or control this ourselves?

I explored every option I could think of. Could we serialize requests? Could we add locking? Could we ensure only one refresh happened at a time?

The answer, unfortunately, was no.

The token refresh logic was completely outside our system. We had no way to coordinate or single-flight the exchange on our end. Once the requests left our service, we were at the mercy of how the external OAuth flow handled concurrency.

Documentation Gaps and Dead Ends

At this point, I went back to the documentation. Thoroughly.

I reread the unified API docs. I cross-referenced them with the documentation for the individual API we were integrating with. Each source described token refresh behavior slightly differently, including when refreshes occurred and how long tokens were expected to remain valid.

None of it mentioned how concurrent refresh attempts were handled. Nothing warned about race conditions. Nothing suggested that multiple requests might each trigger their own exchange under certain timing conditions.

I noticed an endpoint that seemed related to token states, but the documentation at the time read as if it didn’t fully match our needs.

Working With Support (and Why “Just Break Up the Requests” Didn’t Work)

Eventually, I reached out to Apideck support to talk through what I was seeing.

One early suggestion was to break up the API calls and make them sequential instead of concurrent. In isolation, that makes sense. In practice, it didn’t work for our feature. The requests were grouped for a reason, and partial results weren’t acceptable, and neither was the major increase in latency from sending them one by one.

This issue was especially concerning because the failures were silent. Since some of the requests were optional, the feature wouldn’t hard error. It would simply fail to retrieve data that the user should have had access to.

Even though the feature was still only rolled out to a small group of users, we couldn’t afford to quietly miss data. That meant fixing the root cause, not just logging the failure and moving on.

I also started thinking about other defensive options. Could we track when a user was approaching a refresh window and proactively force them to re-authenticate with the third party entirely? Re-authentication was at least something we had control over, and it felt preferable to returning incomplete data.

The Actual Fix: Forcing a Token Refresh Up Front

After more back and forth with support, and eventually speaking directly with one of the Apideck founders (shoutout Apideck!), something important became clear.

That endpoint I had dismissed earlier actually did fully support our use case. It allowed us to explicitly force a token refresh before making any other requests.

Once I pointed out the race condition and how it surfaced under concurrency, Apideck updated their documentation regarding that endpoint, and published a guide outlining the issue and potential solutions, including the exact approach we ended up using:

https://developers.apideck.com/guides/refresh-token-race-condition

The fix was simple in concept. Before making any of the grouped API calls, we would explicitly trigger a token refresh. That ensured all subsequent requests shared the same, valid token and fully eliminated the race condition.

The Tradeoff: 500 ms Well Spent

This solution added some overhead, as forcing a token refresh increased the request flow round trip by roughly 500 ms.

In this case, that increased waiting time was more than acceptable when the alternative was silently missing data and returning incomplete results to users. A slightly slower request was a small price to pay for correctness, predictability, and trust in the integration.

After we made that change, the random 401s disappeared entirely, and my coworkers never had to hear my cursed timers going off again.

Once the issue was resolved, I realized this bug had taught me more than I expected.

Lessons Learned

This issue stuck with me not because it was particularly flashy, but because of what it revealed about assumptions I didn’t realize I was making.

Concurrency makes assumptions visible.
Everything about this integration worked until it didn’t, and the moment multiple requests overlapped in just the wrong way is when it silently broke. If an API doesn’t explicitly guarantee single-flight behavior for something like token refresh, assume it won’t happen for you.

Inconsistency is often a signal, not just a symptom.
The fact that different requests failed under different conditions made the problem feel impossible to narrow down at first. In reality, that variability was the defining characteristic of the bug. Once I stopped looking at individual failures and started looking at system behavior over time, the root cause became much clearer. You (hopefully) won’t catch me getting fooled by a race condition again.

External OAuth flows are a different class of problem.
When token exchange lives outside your system, you lose a lot of control. Debugging becomes slower, reproduction gets harder, and “just add logging” only takes you so far. Designing defensively around that goes a long way.

Optional data still has correctness requirements.
This was one of the biggest takeaways for me. Just because a piece of data is optional does not mean it’s acceptable to silently fail to retrieve it in every case. In our case, those optional requests masked the problem for a long time. The feature didn’t hard error, but users could have been missing data they should have had. Optional does not mean unimportant, and it definitely doesn’t mean safe to ignore when you explicitly request it and don’t get it.

Correctness is often worth a small performance cost.
Forcing a token refresh added latency, but it eliminated an entire class of silent failures. A slightly slower request was a small price to pay for predictable behavior and data integrity.

Key Takeaway

Looking back, this bug wasn’t just about OAuth or race conditions. The sleuthing was an exercise in learning where abstractions end, where assumptions creep in, and how easily things that “work” on the surface can hide real problems until the timing is just right.