Let's Connect

A traffic flow dashboard with charts and counters, representing API request throttling and rate limit budgets

Good api rate limit design protects your service from abuse without punishing the customers paying you. Most teams get the protection part and botch the second half: they key on IP, return a bare 429 with no information, and reject the first short burst of an otherwise well-behaved client. The fix is three deliberate choices made before you write any throttling code: key on the tenant, not the IP; tell the client its remaining budget in every response; and allow short bursts while capping the sustained rate. Get those three right and you can be aggressive about protection while almost never blocking someone who should be allowed through. This post is about the design. The Laravel mechanics are a separate write-up.

What should you actually key the limit on?

The default everyone reaches for is the client IP, and it is the wrong default for an authenticated API. Keying on IP punishes shared infrastructure: a corporate office, a university, a mobile carrier, or any cloud NAT gateway funnels hundreds of independent users through one address. One noisy script behind that NAT exhausts the budget and everyone else behind it starts getting 429s for traffic they never sent. You have now turned a single misbehaving client into an outage for an entire customer's office.

For anything authenticated, key on the API key or the tenant ID. That is the unit of fairness you actually care about: each paying account gets its own bucket, and one tenant cannot starve another. IP-based limiting still has a place, but its place is the unauthenticated edge, login and signup endpoints, and as a coarse backstop in front of the real per-tenant limiter, not as the primary control. The brute-force angle on those auth endpoints is its own topic, which I cover in Laravel rate limiting and brute-force protection.

  • Authenticated API traffic: key on the API key or tenant ID. This is the fairness boundary and the one that matters most.
  • Login, signup, password reset, OTP: key on IP plus the submitted identifier (email or username), because there is no authenticated key yet and the threat is brute force.
  • Unauthenticated public reads: key on IP, but keep the limit generous and treat it as a backstop, not a precision tool.
  • Per-endpoint cost: weight expensive endpoints (search, report generation, bulk export) so one heavy call counts as several cheap ones against the same bucket.

How do you set different limits for different plans?

A flat limit across every customer is a pricing decision you made by accident. If your free tier and your enterprise tier hit the same wall, you are either throttling the people paying you the most or leaving the free tier wide open. Tie the limit to the plan stored on the tenant and resolve it at request time, so the budget is a property of who is calling, not a constant baked into a middleware.

Plan-based rate limit tiers
{
  "free":       { "requests_per_minute": 60,    "burst": 20 },
  "pro":        { "requests_per_minute": 600,   "burst": 120 },
  "enterprise": { "requests_per_minute": 6000,  "burst": 1200 }
}

// Resolve the tier from the authenticated tenant at request time.
// The limiter reads `requests_per_minute` as the sustained refill rate
// and `burst` as the bucket capacity (see the token-bucket section).

Keep the tiers in config or in the tenant record, never hard-coded in the throttling path. When sales closes a deal with a custom limit, or when you need to raise a single customer during a migration, you want that to be a data change, not a deploy.

Why do the response headers matter so much?

A client that cannot see its budget has only one way to discover the limit: hit it. So it will, repeatedly, hammering you with retries the instant it gets a 429 because it has no idea when the window resets. A client that can see its budget paces itself. The fix is to return the budget on every response, success or failure, using the headers clients already expect. Send X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset on the 200s, and add Retry-After on the 429 so the client knows exactly how long to back off.

Rate limit response headers
# Successful request, budget still available
HTTP/1.1 200 OK
X-RateLimit-Limit: 600
X-RateLimit-Remaining: 417
X-RateLimit-Reset: 1743945600   # unix epoch when the window refills

# Budget exhausted: tell the client exactly how long to wait
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 600
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1743945600
Retry-After: 23                 # seconds; honour this and clients will too
Content-Type: application/json

{ "error": "rate_limited", "retry_after": 23 }

Two practical notes from shipping this. Pick one convention for X-RateLimit-Reset (I use an absolute unix timestamp so clients do not have to do clock math) and document it, because the ecosystem is split between epoch seconds and seconds-remaining and that ambiguity causes real bugs. And make Retry-After the single source of truth on a 429: a well-built client like the one in a resilient third-party API client with retries and backoff will read it and wait exactly that long, which is the behaviour you want to reward.

A 429 with no Retry-After is not rate limiting, it is a guessing game. The header is the difference between a client that backs off politely and one that retries you into the ground.Md Raihan Hasan
Lines of structured code on a dark editor screen, representing rate limit headers and configuration in an API response
Every response carries the budget. Clients that can read X-RateLimit-Remaining and Retry-After pace themselves instead of hammering you.

How do you allow bursts without losing control of the sustained rate?

A hard per-minute counter does not match how real clients behave. A dashboard loads and fires twelve requests in 300 milliseconds, then goes quiet for the rest of the minute. Under a naive fixed-window counter, that legitimate burst trips the limit even though the client's average rate is trivial. Worse, fixed windows have an edge effect: a client can send a full window's worth at 11:59:59 and another full window at 12:00:01, doubling your intended rate across the boundary.

A token bucket fixes both. The bucket holds up to `capacity` tokens and refills at a steady `rate` tokens per second. Each request spends one token (or several, for a weighted expensive endpoint). A client that has been quiet has a full bucket and can spend it in a quick burst; a client hammering you drains the bucket and is then limited to the refill rate. Bursts are absorbed, sustained abuse is capped, and there is no window edge to exploit.

Token bucket: burst capacity vs sustained refill
// Pro tier: 600 requests/minute sustained, allow a burst of 120
{
  "capacity":      120,        // max tokens the bucket holds (the burst)
  "refill_rate":   10,         // tokens added per second = 600/min sustained
  "cost_per_call": 1           // expensive endpoints can charge 5, 10, ...
}

// On each request:
//   1. refill: tokens = min(capacity, tokens + elapsed_seconds * refill_rate)
//   2. if tokens >= cost_per_call: tokens -= cost_per_call -> allow (200)
//   3. else: reject (429) and set Retry-After to time until enough tokens

Store the bucket state (token count and last-refill timestamp) per key in something fast and shared like Redis, so the limit holds across every app server behind your load balancer. An in-memory counter on a single node silently lets the real limit scale with your instance count, which means you are not actually limiting anything once you autoscale. The implementation details, including the atomic Redis update that avoids a race between read and decrement, are in the Laravel rate limiting write-up.

What does a non-hostile rate limit policy look like?

The throttling algorithm is only half the design. The other half is the policy around it, the part your users actually feel. Be generous by default, because a limit that legitimate traffic never touches still stops abuse, and a limit set too tight just generates support tickets and churn. Then make the limit discoverable before anyone hits it.

  • Document the limits in your API docs with the exact numbers per tier, not a vague 'reasonable use' clause nobody can code against.
  • Return the headers on every response so clients self-pace instead of probing for the wall.
  • Degrade gracefully: prefer shedding the most expensive or least critical work first rather than returning a blanket 429 for everything.
  • Send a 429 with Retry-After, never a 503 or a dropped connection, so clients can tell deliberate throttling apart from an outage.
  • Give yourself an override knob: a per-tenant multiplier you can raise without a deploy when a customer has a legitimate spike or a migration window.

Rate limiting fails users when it is invisible, IP-keyed, and unforgiving of normal bursts. It succeeds when it is keyed to the tenant, communicated in every response, and built on a token bucket that treats a quick burst as the normal traffic it usually is. Decide those three things first. The middleware that enforces them is the easy part, and once the design is right, almost no legitimate client will ever notice the limit is there. That is the goal: protection your real users never feel.