Rate Limiter System Design: Token Bucket, Leaky Bucket, Scaling

Key Concepts

Rate Limiting: A mechanism to control the number of requests a client can make to an API within a given time window.
Fixed Window Counting: A basic rate limiting algorithm that divides time into fixed intervals and resets a counter at the start of each window.
Token Bucket Algorithm: An industry-standard rate limiting algorithm that uses a "bucket" to hold "tokens," which are consumed by requests and refilled at a steady rate.
Sliding Window Logs/Counters: Alternative rate limiting algorithms that address the limitations of fixed window counting.
Leaking Buckets: Another rate limiting algorithm, similar in concept to token buckets but with a different flow.
Client-side Rate Limiting: Implementing rate limiting logic within the client application.
Server-side Rate Limiting: Embedding rate limiting logic directly into the application's backend code.
Middleware Rate Limiting: Using a dedicated service (e.g., API gateway, reverse proxy) between clients and APIs to enforce rate limits.
HTTP 429 (Too Many Requests): The standard HTTP status code returned when a client exceeds a rate limit.
P95 Latency: The 95th percentile of latency, meaning 95% of requests complete within this time.
Redis: An in-memory data store often used for fast, shared storage of counters and token bucket states in rate limiting systems.
Atomic Operations: Operations that are guaranteed to complete entirely or not at all, preventing race conditions in concurrent systems.
Race Conditions: A situation where multiple processes or threads access and manipulate shared data concurrently, and the outcome depends on the order of execution.

Introduction to API Rate Limiters

Every major API, such as GitHub, Stripe, or AWS, implements rate limits to control the number of requests a client can make within a specific timeframe. This prevents system overload and ensures fair access for legitimate users.

Core Design Requirements:

Configurable Rules: The system must limit incoming requests based on defined rules (e.g., "100 API requests per minute per user").
Error Handling: When limits are exceeded, the system should reject requests with an HTTP 429 (Too Many Requests) status code. It should also include helpful headers like RateLimit-Remaining and RateLimit-Reset to inform the client.
Minimal Latency Overhead: The rate limit check should introduce very little delay, ideally under 3 milliseconds P95 per check.
High Availability and Scalability: The system must be highly available and accessible by multiple servers to support distributed architectures.

Rate Limiting Algorithms

1. Fixed Window Counting

This is the simplest approach to rate limiting.

Methodology: Time is divided into fixed intervals (e.g., one-minute windows). Each user is assigned a counter that resets to zero at the beginning of each window.
Process:
1. A user is allowed, for instance, 100 requests per minute.
2. At the start of each minute, their counter resets to zero.
3. Each incoming request increments the counter.
4. Once the counter reaches 100, any subsequent requests within that minute are rejected until the next minute begins.
Storage Challenges:
- Database: Using a traditional database for counters is too slow and could overload the very system it's meant to protect.
- In-memory (Single Server): While fast, this approach only works for a single server. In a multi-server environment, each server would maintain its own independent counters, allowing a user to effectively bypass the limit (e.g., 100 requests to server A and 100 to server B results in 200 requests).
- Solution: Redis: Redis is an ideal in-memory data store for this purpose. It's shared across all servers and provides primitives for atomically incrementing and resetting counters, ensuring consistency.
Critical Flaw (Window Boundary Problem): Fixed window counting has a significant vulnerability at window boundaries.
- Example: A user makes 100 requests in the last 10 seconds of minute 1, and then another 100 requests in the first 10 seconds of minute 2. Individually, both bursts are within the 100 requests/minute limit. However, the user has made 200 requests in just 20 seconds, which clearly violates the intended rate limit. This "gaming" of the system occurs at every window boundary.

2. Token Bucket Algorithm (Industry Standard)

The Token Bucket algorithm is widely adopted by companies like AWS and Stripe because it effectively solves the fixed window problem.

Concept: Imagine a bucket that holds "tokens." New tokens are added to the bucket at a steady "refill rate." Each incoming request consumes one token. If the bucket is empty (no tokens available), the request is rejected.
Mechanism:
1. The algorithm uses two key parameters:
  - Bucket Capacity: Determines the maximum burst size allowed.
  - Refill Rate: Determines the sustained throughput (how many tokens are added per unit of time).
2. Example: A bucket with a capacity of 100 tokens that refills at 100 tokens per minute.
3. During periods of low activity, tokens accumulate in the bucket up to its capacity.
4. When a user makes a burst of requests, even across window boundaries, they consume the accumulated tokens. However, they cannot exceed the long-term refill rate.
Key Insight: Tokens accumulate during quiet periods, allowing for legitimate traffic bursts while still enforcing the overall rate limit over time. This prevents users from exploiting window boundaries.
Other Algorithms: While token bucket is highly effective, other algorithms like sliding window logs, sliding window counters, and leaking buckets also address the fixed window problem, but token bucket generally offers the best balance of simplicity and effectiveness for most use cases.

System Architecture: Placement of the Rate Limiter

There are three primary options for where to implement the rate limiter within a system architecture:

Client-side Rate Limiting:
- Description: The rate limiting logic is embedded directly into the client application.
- Disadvantage: This approach is inherently untrustworthy. Malicious users can easily modify the client code or bypass these restrictions entirely, rendering the limits ineffective.
Server-side Rate Limiting:
- Description: The rate limiting logic is integrated directly into the application's backend code.
- Advantages: Provides complete control over the algorithm and keeps all logic within the application.
- Disadvantages: Rate limiting logic gets mixed with business logic, potentially increasing code complexity. Each individual service or microservice would require its own separate implementation, leading to duplication.
Middleware Rate Limiting:
- Description: A dedicated service sits between clients and the API servers to enforce rate limits. This can be an API gateway, a reverse proxy, or a custom-built service.
- Advantages:
  - Separates rate limiting concerns from core business logic.
  - Provides a single, centralized place to manage and apply rate limiting policies across multiple APIs.
- Disadvantage: Introduces an additional component, increasing overall system complexity.

Recommendation: The best choice depends on the specific situation.
- If an API gateway is already in place for authentication or other purposes, adding rate limiting there is often logical.
- For highly custom or experimental algorithms, server-side implementation offers maximum flexibility.
- For most systems, middleware offers the optimal balance of control and operational simplicity.

Detailed Middleware Architecture and Race Conditions

A typical middleware-based rate limiting architecture would look like this:

Configuration Service: Stores rate limiting rules (e.g., "premium users get 1000 requests per hour," "free users get 100 requests per hour").
Middleware: Fetches these rules and stores the state of each user's token bucket in Redis.
Request Flow:
- When a request arrives at the middleware, it first identifies the user.
- It then fetches the user's corresponding token bucket state from Redis.
- The middleware checks if tokens are available in the bucket.
- If tokens are available: It atomically decrements the token count in Redis and forwards the request to the appropriate API server.
- If no tokens remain: It returns an HTTP 429 response to the client, including headers indicating when the user can retry (e.g., RateLimit-Reset).

Race Conditions in Multi-server Environments:

Problem: When scaling to multiple middleware servers, a race condition can occur. If Server A reads a token count of '3' from Redis, and simultaneously Server B also reads '3', both servers might independently decide the request is permissible. They then both increment the value to '4' and write it back to Redis. The final count in Redis would be '4' instead of the correct '5', meaning one request was effectively not counted, allowing the user to exceed the limit.
Solution: Atomic Operations: This problem is solved using atomic operations. Redis supports Lua scripts that can bundle the read, check, and increment operations into a single, indivisible unit. This ensures that the entire sequence completes without interruption from other concurrent operations, completely preventing race conditions and maintaining accurate token counts.

Further Considerations

The video covered the essential building blocks, but several advanced topics related to rate limiting were not discussed:

Geographic Latency: How to handle rate limiting effectively in multi-region deployments where latency can vary.
Hot Keys: Strategies for managing situations where a few specific users or keys generate a disproportionately large amount of traffic.
Dynamic Rule Updates: How to upload or modify rate limiting rules without requiring server restarts.
Failure Modes: Whether the rate limiting system should "fail open" (allow all requests) or "fail closed" (reject all requests) if key components go down.

Sponsor: The video was sponsored by Warp, an AI agent for coding that understands context and writes production-ready code, saving users over an hour a day.

Community Resource: For those looking to ace technical interviews, byitebico.com offers comprehensive courses on system design, coding, behavioral questions, machine learning, and object-oriented design.