Konstantin Tarkus
banner
koistya.com
Konstantin Tarkus
@koistya.com
Building the rails that AI workflows run on. Principal Engineer architecting the foundation for safe & scalable intelligence. ☕️ // Code #DevOps #Startups #AI
November 5, 2025 at 8:59 AM
9/9 Need the full checklist, recovery-time calculator, and TypeScript snippets for per-user, per-route, and cost-based buckets? I broke it all down here →

levelup.gitconnected.com/designing-fa...

github.com/kriasoft/ws-...
Medium
levelup.gitconnected.com
November 3, 2025 at 5:59 PM
8/9 Before rollout: 10× baseline load, jitter burst test, abuse flood, mixed workloads. After launch: watch rate-limit hit %, histogram tokens-at-reject, and keep limiter latency <1 ms or it becomes the bottleneck.
November 3, 2025 at 5:59 PM
7/9 Cost-based limiting is underrated. Set text = 1 token, file upload = 20, admin command = 10. One bucket per user, different prices. Suddenly your limiter tracks backend spend instead of treating every packet as equal.
November 3, 2025 at 5:59 PM
6/9 Layer limits like a pyramid: Global → per-IP/device → per-user-per-route → per-operation cost. Global caps stop botnets, lower tiers keep heavy routes or chatty users from starving everyone else.
November 3, 2025 at 5:59 PM
5/9 Three failure modes I still see weekly:

• Capacity ≫ rate → giant attack window
• Only per-user buckets → aggregate floods melt the DB
• Zero jitter headroom → works on ethernet, fails on 4G

All three come from sizing purely in the lab.
November 3, 2025 at 5:59 PM
4/9 Starting points that work well: Chat 100–200 capacity @ 1–2 tokens/sec (lets users paste history; yes, recovery is ~50–200 s but bursts are rare). 30 Hz games 10–15 capacity @ 35–40 tokens/sec (≈30 updates/sec plus jitter). Streaming when tokens track KB: capacity ≈ bitrate × buffer seconds.
November 3, 2025 at 5:59 PM
3/9 Shortcut: recovery_time = capacity / refill_rate. Want a user to bounce back from empty in 3 s? At 10 tokens/sec you need ~30 capacity; at 40 tokens/sec you need ~120. Pick the recovery window first, then solve for both numbers.
November 3, 2025 at 5:59 PM
2/9 Two knobs drive everything: capacity = burst tolerance, refill rate = sustained throughput (tokens/sec). Treat them as a pair. Set them independently and you either throttle legitimate spikes or hand attackers a long grace period.
November 3, 2025 at 5:59 PM
TL;DR

- "A distributed lock ≠ a mutex"
- "Locks are leases"
- "Leases expire"
- "Fencing tokens make expiry safe"

🧩 Full explanation + examples (Redis, Postgres, Firestore):

levelup.gitconnected.com/beyond-the-l...
Medium
levelup.gitconnected.com
October 18, 2025 at 2:38 PM
Even if a process wakes up from a 1-minute pause, it can’t corrupt data anymore — its stale token is rejected.

The system becomes safe by design, not by timing.
October 18, 2025 at 2:38 PM
The fix is adding a third party — the resource itself.

Every lock acquisition returns a fencing token, a monotonically increasing number.

Each write includes the token, and the resource rejects any write with an older token.

✅ Deterministic correctness.
October 18, 2025 at 2:38 PM
That’s not a bug in Redis.
Not even a bug in your code.

It’s a two-party problem:

- "Client thinks it holds the lock"
- "Lock manager knows it expired"

There’s no one to stop the stale client from writing bad data.
October 18, 2025 at 2:38 PM
While your process is frozen, the lock expires.
Another instance grabs the same lock and processes the payment again.

When your original process wakes up, it still thinks it holds the lock — and charges the customer twice.
October 18, 2025 at 2:38 PM
Imagine this:

Your service acquires a Redis lock with a 30-second TTL to process a payment.

Looks safe, right?

But then the JVM GC or network connection pauses for 35 seconds… 🧟
October 18, 2025 at 2:38 PM