RS
Ravi Shukla
HomeBlogToolsAbout
Resume
RS
Ravi Shukla

Senior Java + AI engineer. Kafka, RAG, distributed systems.

Content

  • Blog
  • System Design
  • AI & ML
  • DevOps

Explore

  • About Ravi
  • Open Stats
  • Thank You

© 2026 Ravi Kant Shukla. All rights reserved.

Deployed on Vercel · Mumbai region

Back to Writing
system-design

Designing for Scale: From 0 to 1 Million Requests/Day

A practical system design walkthrough for scaling a product from a single server to 1 million requests per day, covering load balancing, caching, database bottlenecks, queues, observability, and operational trade-offs.

May 11, 202618 min read
System DesignScalabilityLoad BalancingCachingSpring BootDatabases

Every scalable system starts embarrassingly small.

One server. One database. A few users. Logs written to disk. Maybe a cron job running on the same machine because "we will move it later."

That is not a failure. It is usually the right starting point.

The mistake is not starting with a monolith. The mistake is pretending the same shape will survive every growth phase.

In this post, we will design the evolution of a web application from 0 to 1 million requests per day. Not as a fantasy architecture with every cloud service on day one, but as a sequence of pressure points:

  • What breaks first?
  • What do we change?
  • What trade-off did we just accept?
  • How do we know it is time for the next step?

The example system is intentionally familiar: an API-backed product with users, sessions, a relational database, and a few read-heavy endpoints. Think SaaS dashboard, job board, learning platform, internal workflow tool, or marketplace MVP.

At 1 million requests per day, you are averaging only about 12 requests per second:

1,000,000 requests/day / 86,400 seconds = 11.57 requests/second

That number looks small. The system still gets interesting because traffic is not evenly distributed. You may average 12 RPS but see 100 to 300 RPS during peak windows, deploys, campaigns, retries, webhooks, or bot traffic.

Scale is not just about averages. It is about peaks, bottlenecks, failure modes, and the blast radius of ordinary mistakes.


Phase 0: Make One Server Boring

At the very beginning, your architecture can be simple:

This is enough for a surprising amount of traffic if the application is written carefully.

For an early product, optimize for:

  • Fast iteration
  • Simple deploys
  • Good logs
  • Database indexes on common queries
  • Backups from day one
  • Clear ownership of configuration and secrets

Do not add Kafka, Kubernetes, service mesh, or sharding here. Those tools solve real problems, but they also create operational weight. At low scale, the highest-leverage work is boring: schema design, query planning, request validation, connection pooling, and observability.

A Simple Spring Boot Baseline

For a Java service, the first production baseline might look like this:

application.yml
server:
  tomcat:
    threads:
      max: 200
      min-spare: 20
  compression:
    enabled: true

spring:
  datasource:
    hikari:
      maximum-pool-size: 20
      minimum-idle: 5
      connection-timeout: 2000
      validation-timeout: 1000
  jpa:
    open-in-view: false

The important part is not the exact values. It is the discipline:

  • Bound your thread pool.
  • Bound your database pool.
  • Fail quickly when the database is unhealthy.
  • Do not keep database sessions open during view rendering.

The fastest way to melt a small system is to let every incoming request create uncontrolled downstream pressure.

The First Bottleneck Is Usually the Database

Application servers are easy to add. Databases are harder.

Before scaling the app tier, check whether the database is already doing unnecessary work:

Sql
EXPLAIN ANALYZE
SELECT id, title, status, created_at
FROM tickets
WHERE account_id = $1
  AND status = 'OPEN'
ORDER BY created_at DESC
LIMIT 50;

If this query scans a large table, fix it before adding more servers:

Sql
CREATE INDEX idx_tickets_account_status_created
ON tickets (account_id, status, created_at DESC);

At low scale, one good index often beats an entire caching layer.


Phase 1: Add a Reverse Proxy and Static Asset Strategy

Once traffic grows, the app server should stop doing work that a simpler layer can handle.

Add Nginx, Caddy, Apache, Cloudflare, or a managed load balancer in front of the app. Even before you have multiple app instances, a reverse proxy gives you:

  • TLS termination
  • Gzip or Brotli compression
  • Request size limits
  • Static asset caching
  • Basic rate limiting
  • Cleaner deploy boundaries

Example Nginx config:

nginx.conf
upstream app_backend {
    server 127.0.0.1:8080;
}

server {
    listen 443 ssl http2;
    server_name api.example.com;

    client_max_body_size 2m;

    location / {
        proxy_pass http://app_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        proxy_connect_timeout 2s;
        proxy_read_timeout 30s;
        proxy_send_timeout 30s;
    }

    location /assets/ {
        expires 30d;
        add_header Cache-Control "public, immutable";
    }
}

This phase is also where you should move static assets to a CDN if you have meaningful global traffic. A CDN reduces latency and removes waste from your origin server.

If your product has pages with images, PDFs, JavaScript bundles, documentation, or downloadable reports, serving them through the app server is a quiet tax. It consumes CPU, memory, connections, and bandwidth that should be reserved for dynamic requests.


Phase 2: Horizontal Scaling the App Tier

Eventually, one app process is not enough. Maybe CPU is high. Maybe GC pauses hurt latency. Maybe deploys cause downtime. Maybe a single VM reboot takes the whole product down.

The next move is multiple app instances behind a load balancer:

This requires one architectural rule:

App instances must be stateless.

That means:

  • No user sessions stored only in process memory
  • No uploaded files stored only on local disk
  • No background jobs that assume exactly one app instance
  • No in-memory counters used for billing, quotas, or correctness

Move sessions to Redis, signed cookies, or your database. Move files to object storage. Move scheduled work to a dedicated worker or scheduler. Make every app instance replaceable.

Health Checks Matter

The load balancer should send traffic only to healthy instances. A good service exposes both liveness and readiness:

HealthController.java
@RestController
@RequestMapping("/health")
public class HealthController {

    private final DataSource dataSource;

    public HealthController(DataSource dataSource) {
        this.dataSource = dataSource;
    }

    @GetMapping("/live")
    public Map<String, String> live() {
        return Map.of("status", "UP");
    }

    @GetMapping("/ready")
    public ResponseEntity<Map<String, String>> ready() {
        try (Connection connection = dataSource.getConnection()) {
            if (connection.isValid(1)) {
                return ResponseEntity.ok(Map.of("status", "READY"));
            }
            return ResponseEntity.status(503).body(Map.of("status", "NOT_READY"));
        } catch (SQLException ex) {
            return ResponseEntity.status(503).body(Map.of("status", "NOT_READY"));
        }
    }
}

Liveness asks: "Should this process be restarted?"

Readiness asks: "Should this process receive traffic right now?"

Those are different questions. Mixing them creates noisy restarts and failed deploys.


Phase 3: Connection Pooling and Database Protection

Adding app instances increases pressure on the database.

If each instance has a pool of 20 database connections and you scale to 10 instances, you may open 200 connections. That can be fine, or it can crush a small database.

The load balancer solved the app bottleneck. It may have exposed the database bottleneck.

Before:
1 app instance * 20 DB connections = 20 possible DB connections

After:
10 app instances * 20 DB connections = 200 possible DB connections

This is why scaling is not "add more servers" forever. Every tier pushes load to the next tier.

Use three patterns:

  1. Keep the application pool small and intentional.
  2. Add a database pooler if needed, such as PgBouncer for Postgres.
  3. Put timeouts on every database call path.

In Spring Boot, avoid unbounded waits:

application.yml
spring:
  datasource:
    hikari:
      maximum-pool-size: 15
      connection-timeout: 1500
      max-lifetime: 1800000
      idle-timeout: 600000

A request that waits 30 seconds for a database connection is not "resilient." It is occupying a web thread while the system is already unhealthy.

Fast failure is often kinder to the system than slow failure.


Phase 4: Add Caching Where It Pays Rent

At this point, you may be serving tens or hundreds of thousands of requests per day. The database is healthy, but some endpoints are read-heavy:

  • Product detail pages
  • User profiles
  • Configuration lookups
  • Feature flags
  • Dashboard summaries
  • Permission checks
  • Search filters

Add caching where the data is read often and can tolerate brief staleness.

The simplest pattern is cache-aside:

AccountSettingsService.java
@Service
public class AccountSettingsService {

    private final StringRedisTemplate redis;
    private final AccountSettingsRepository repository;
    private final ObjectMapper objectMapper;

    public AccountSettingsService(
            StringRedisTemplate redis,
            AccountSettingsRepository repository,
            ObjectMapper objectMapper
    ) {
        this.redis = redis;
        this.repository = repository;
        this.objectMapper = objectMapper;
    }

    public AccountSettings getSettings(UUID accountId) {
        String key = "account-settings:" + accountId;
        String cached = redis.opsForValue().get(key);

        if (cached != null) {
            return deserialize(cached);
        }

        AccountSettings settings = repository.findByAccountId(accountId)
                .orElseThrow(() -> new NotFoundException("Settings not found"));

        redis.opsForValue().set(key, serialize(settings), Duration.ofMinutes(10));
        return settings;
    }

    public void updateSettings(UUID accountId, AccountSettingsUpdate update) {
        repository.update(accountId, update);
        redis.delete("account-settings:" + accountId);
    }

    private String serialize(AccountSettings settings) {
        try {
            return objectMapper.writeValueAsString(settings);
        } catch (JsonProcessingException ex) {
            throw new IllegalStateException("Could not serialize settings", ex);
        }
    }

    private AccountSettings deserialize(String value) {
        try {
            return objectMapper.readValue(value, AccountSettings.class);
        } catch (JsonProcessingException ex) {
            throw new IllegalStateException("Could not deserialize settings", ex);
        }
    }
}

Cache-aside is easy to reason about:

  • Read from cache.
  • On miss, read from database.
  • Put result in cache.
  • On write, update database and invalidate cache.

But caching is not free. You now have two copies of data and a freshness problem.

What Not to Cache

Be careful caching:

  • Financial balances
  • Payment status
  • Inventory at low stock levels
  • Permission changes
  • Security-sensitive user state
  • Anything with complex invalidation rules

Cache data that improves latency and reduces load without risking correctness.

The best cache invalidation strategy is often a small TTL plus explicit invalidation on writes.


Phase 5: Split Reads from Writes

As traffic grows, read queries usually dominate. A common ratio is 80:20 or 90:10 reads to writes.

If your primary database is busy serving dashboard pages, search filters, and reporting queries, writes will suffer.

Add read replicas:

This introduces replication lag.

If a user updates their profile and immediately refreshes the page, a replica may still have the old value. For many pages, this is acceptable. For some flows, it is not.

A practical rule:

  • Read your own writes from the primary.
  • Use replicas for stale-tolerant reads.
  • Use replicas for admin reports, public pages, and browsing flows.
  • Keep critical transactional flows on the primary.

You can route at the repository/service level:

ReadRoutingDataSource.java
public final class ReadOnlyContext {
    private static final ThreadLocal<Boolean> READ_ONLY = ThreadLocal.withInitial(() -> false);

    public static void markReadOnly() {
        READ_ONLY.set(true);
    }

    public static boolean isReadOnly() {
        return READ_ONLY.get();
    }

    public static void clear() {
        READ_ONLY.remove();
    }
}

In many teams, I prefer starting simpler: explicitly separate query services that use a read-replica datasource. Hidden routing can be elegant, but it can also surprise people during incident debugging.


Phase 6: Move Slow Work Out of the Request Path

At 1 million requests per day, latency usually comes from doing too much synchronously.

Examples:

  • Sending emails
  • Generating PDFs
  • Calling webhooks
  • Updating analytics
  • Creating thumbnails
  • Running fraud checks
  • Writing audit events
  • Syncing data to a search index

If the user does not need the result immediately, do not make them wait.

Introduce a queue:

For example, when a user signs up:

  1. Create the user in the database.
  2. Publish UserRegistered.
  3. Return success.
  4. Worker sends welcome email, creates CRM record, updates analytics, and warms recommendations.

The API path stays fast. The slow work becomes retryable.

Idempotency Is Not Optional

Queues usually provide at-least-once delivery. That means a worker may process the same event more than once.

Design handlers so duplicates are safe:

WelcomeEmailConsumer.java
@Component
public class WelcomeEmailConsumer {

    private final ProcessedEventRepository processedEvents;
    private final EmailService emailService;

    public WelcomeEmailConsumer(
            ProcessedEventRepository processedEvents,
            EmailService emailService
    ) {
        this.processedEvents = processedEvents;
        this.emailService = emailService;
    }

    @Transactional
    public void handle(UserRegistered event) {
        if (processedEvents.existsByEventId(event.eventId())) {
            return;
        }

        emailService.sendWelcomeEmail(event.email());
        processedEvents.save(new ProcessedEvent(event.eventId(), Instant.now()));
    }
}

This pattern matters more than the queue technology. Kafka, SQS, and RabbitMQ all fail if your consumer logic is not safe to retry.


Phase 7: Protect the System with Backpressure

When systems fail, they often fail because every layer keeps trying harder.

The load balancer retries. The app retries. The HTTP client retries. The queue retries. Suddenly a small database slowdown becomes a retry storm.

Backpressure means the system can say "not right now" before it collapses.

Use:

  • Timeouts on every network call
  • Circuit breakers around unstable dependencies
  • Rate limits for users and API clients
  • Bounded queues
  • Bulkheads for expensive operations
  • Retry budgets instead of unlimited retries

Example with Resilience4j:

ExternalBillingClient.java
@Service
public class ExternalBillingClient {

    private final RestClient restClient;

    public ExternalBillingClient(RestClient restClient) {
        this.restClient = restClient;
    }

    @CircuitBreaker(name = "billing", fallbackMethod = "fallback")
    @TimeLimiter(name = "billing")
    @Retry(name = "billing")
    public CompletableFuture<BillingStatus> fetchStatus(UUID accountId) {
        return CompletableFuture.supplyAsync(() ->
                restClient.get()
                        .uri("/accounts/{id}/billing-status", accountId)
                        .retrieve()
                        .body(BillingStatus.class)
        );
    }

    public CompletableFuture<BillingStatus> fallback(UUID accountId, Throwable ex) {
        return CompletableFuture.completedFuture(BillingStatus.unknown());
    }
}

The fallback must be chosen carefully. For a non-critical dashboard badge, unknown may be fine. For payment authorization, failing closed is safer.

Resilience is not a library annotation. It is a product decision.


Phase 8: Observability Before Heroics

You cannot scale what you cannot see.

By the time you approach 1 million requests per day, you need visibility across:

  • Request rate
  • Error rate
  • Latency percentiles
  • Database query latency
  • Cache hit ratio
  • Queue depth
  • Worker lag
  • External dependency latency
  • JVM memory and garbage collection
  • Saturation signals such as CPU, threads, and connection pools

The minimum useful dashboard is often called the RED method:

SignalQuestion
RateHow many requests are we serving?
ErrorsHow many are failing?
DurationHow long do successful and failed requests take?

For infrastructure, add USE:

SignalQuestion
UtilizationHow busy is the resource?
SaturationIs there queued work waiting?
ErrorsIs the resource failing?

One of the strongest signals in a production system is p95 and p99 latency. Average latency hides pain.

Endpoint: GET /api/accounts/{id}/dashboard

p50:  85ms
p95:  420ms
p99:  1800ms
max:  9200ms

The average might look healthy while 1% of users wait almost two seconds. At 1 million requests per day, 1% is 10,000 painful requests.

Correlation IDs

Every request should have a correlation ID:

CorrelationIdFilter.java
@Component
public class CorrelationIdFilter extends OncePerRequestFilter {

    private static final String HEADER = "X-Correlation-ID";

    @Override
    protected void doFilterInternal(
            HttpServletRequest request,
            HttpServletResponse response,
            FilterChain filterChain
    ) throws ServletException, IOException {
        String correlationId = Optional.ofNullable(request.getHeader(HEADER))
                .filter(id -> !id.isBlank())
                .orElse(UUID.randomUUID().toString());

        MDC.put("correlationId", correlationId);
        response.setHeader(HEADER, correlationId);

        try {
            filterChain.doFilter(request, response);
        } finally {
            MDC.remove("correlationId");
        }
    }
}

When an incident happens, this one field can save hours.


Phase 9: Deployment Without Drama

At low scale, deploying by SSH into a server feels fast. At meaningful scale, it becomes a liability.

You want repeatable deploys:

  • Build artifact once.
  • Run tests.
  • Publish image.
  • Deploy gradually.
  • Health check before routing traffic.
  • Roll back quickly.

Blue-green and rolling deployments both work.

For a service at this size, a simple rolling deploy behind a load balancer is usually enough:

  1. Start a new app instance with the new version.
  2. Wait for /health/ready.
  3. Add it to the load balancer.
  4. Drain one old instance.
  5. Repeat.

The risky part is often the database migration, not the app deploy.

Use backward-compatible migrations:

  1. Add nullable column.
  2. Deploy code that writes both old and new shape.
  3. Backfill data.
  4. Deploy code that reads new shape.
  5. Drop old column later.

Never make a deploy require the app and database to switch at the exact same millisecond.


Phase 10: When to Consider Sharding

Sharding is powerful, but it is not the first scaling move.

Before sharding, exhaust easier options:

  • Good indexes
  • Query optimization
  • Connection pooling
  • Caching
  • Read replicas
  • Archival of cold data
  • Partitioning large tables by time or tenant
  • Moving search/reporting workloads out of the primary database

You consider sharding when:

  • One primary database can no longer handle write load.
  • A single table is too large to maintain effectively.
  • One tenant or region needs isolation.
  • Data residency requires geographic separation.

At 1 million requests per day, many products do not need sharding. They need disciplined database usage.

If you do shard, choose a shard key that matches access patterns:

Shard keyGood forRisk
account_idB2B SaaS, tenant isolationLarge tenants become hot
user_idConsumer appsCross-user queries get harder
Geographic regionData residency, latencyUsers move, global queries need aggregation
Hash of entity IDEven distributionHarder range queries

Sharding changes application logic forever. Treat it as a serious boundary, not a routine optimization.


A Practical 1 Million Requests/Day Architecture

For many products, the architecture at this stage looks like this:

This is not the only valid design, but it is a sane default:

  • CDN for static and cacheable content
  • Load balancer for traffic distribution and health checks
  • Stateless app cluster for horizontal scaling
  • Redis for sessions, cache, and rate limits
  • Postgres primary for writes and strong consistency
  • Read replica for stale-tolerant reads
  • Queue and workers for slow side effects
  • Observability across the full path

The biggest win is not any single component. It is separating responsibilities so one slow subsystem does not block the whole product.


Capacity Planning Cheat Sheet

Start with simple math:

Daily requests:        1,000,000
Average RPS:           12
Peak multiplier:       10x to 25x
Peak RPS estimate:     120 to 300

Average response size: 50 KB
Daily bandwidth:       50 GB

Read/write ratio:      90:10
Daily reads:           900,000
Daily writes:          100,000

Then ask sharper questions:

  • Which endpoints are hot?
  • Which endpoints are slow?
  • Which endpoints write to the database?
  • Which endpoints call external services?
  • Which endpoints can be cached?
  • Which operations can be async?
  • What happens if Redis is down?
  • What happens if the email provider is slow?
  • What happens if the read replica lags by 5 seconds?

The answers matter more than the raw traffic number.


Common Mistakes at This Stage

1. Scaling the App Before Fixing Queries

Adding app instances increases database pressure. If the main bottleneck is a missing index, horizontal scaling can make the system fail faster.

2. Caching Everything

Caching is useful when the data is read often, expensive to compute, and safe to serve stale for a while. It is dangerous when used to hide unclear ownership of data freshness.

3. Using Queues Without Idempotency

Async processing improves latency, but duplicate events are normal. If workers are not idempotent, queues create correctness bugs at scale.

4. Ignoring Tail Latency

p99 latency is where user pain hides. Monitor percentiles, not just averages.

5. No Load Testing Until Launch Week

Load testing should start before the marketing campaign, not during it.

A basic test plan:

1. Smoke test at 10 RPS.
2. Run expected peak load for 30 minutes.
3. Run 2x expected peak for 10 minutes.
4. Inject one slow dependency.
5. Watch latency, errors, DB connections, cache hit rate, and queue depth.

The goal is not to prove the system is invincible. The goal is to learn where it bends.


Final Takeaway

Scaling from 0 to 1 million requests per day is less about exotic distributed systems and more about sequencing.

Start with one well-built server. Add a reverse proxy. Make the app stateless. Scale horizontally. Protect the database. Cache carefully. Move slow work to queues. Add read replicas when reads dominate. Watch the system with real metrics. Delay sharding until simpler tools are no longer enough.

The architecture grows in response to pressure.

That is the senior engineering skill: not adding complexity early, and not waiting so long that the system collapses under its own success.

Next in this system design series: Database Scaling and Sharding Strategies, where we will go deeper into replicas, partitioning, shard keys, hotspot mitigation, and the operational cost of distributed data.


Subscribe to get system design and backend engineering posts in your inbox every two weeks.

R

Ravi Kant Shukla

Senior Java + AI engineer. 9+ years in system design, Kafka, microservices, and LLM/RAG pipelines.

About Ravi →More Posts →

Enjoyed this post?

Get more system design and AWS insights delivered weekly. No spam.

Comments (0)

Loading comments...

Leave a comment

Your email will not be displayed publicly.

Related Posts

system-design

Designing a URL Shortener on AWS: From Zero to Production

A complete walkthrough of designing a production-ready URL shortener on AWS — covering hashing strategies, database selection, caching, and scaling to billions of redirects.

AWSSystem DesignDynamoDB
Mar 10, 202412 min read