Designed with DesignCopilot

Try it on your own idea
DesignCopilotRoast your own

Verdict.

A masterclass in how to build a real-time delivery app that can't handle real-time delivery.

Roasted by DesignCopilot

3/ 10
Fragile

This is a textbook example of scaling a monolith without understanding the workload. You're hammering Postgres with 30K queries/second for live tracking, processing webhooks inline until they time out, and storing 5 years of GPS coordinates in a table with no indexes. The architecture worked fine at 1K users — at 50K, it's a ticking time bomb. The good news: most fixes are straightforward database and caching wins.

Top issues

  • Critical

    Every active tracking query runs SELECT * across 800GB orders table every 2 seconds with no proper indexing

    This is why your database CPU is pegged at 95% during dinner rush — you're doing full table scans 30,000 times per minute

    Add composite index on (status, updated_at) and stop selecting all 47 columns when you only need 5

  • Critical

    Driver location lookups use ORDER BY timestamp DESC LIMIT 1 with no index on (driver_id, timestamp)

    Each location lookup triggers a full table scan of millions of GPS coordinates, creating 4+ second map delays

    Add composite index on (driver_id, timestamp) and consider a separate current_locations table

  • Critical

    Stripe webhooks processed synchronously with no deduplication locks, causing double-charges on retries

    You're literally charging customers multiple times when Stripe retries slow webhooks, creating chargeback liability

    Process webhooks asynchronously in SQS with idempotency keys and proper row locking

  • High

    All 50K users share a single Redis session store with no replication

    When that t3.medium dies, every customer gets force-logged out during peak hours

    Set up Redis cluster with failover or switch to stateless JWT refresh tokens

  • High

    Customer map view attempts to render 5,000 concurrent delivery markers without clustering

    Browser performance tanks when rendering thousands of DOM elements, making the map unusable

    Implement marker clustering or viewport-based filtering to show max 100 markers

  • Medium

    Rate limiting uses in-memory Maps across 12 servers, making it completely ineffective

    Attackers can bypass limits by hitting different servers, and normal users get inconsistent limiting

    Move rate limiting to Redis or add a proper API gateway like Kong

  • Medium

    Background job assignment runs on a single EC2 cron with 23-minute mean recovery time

    When the job server crashes during dinner rush, no new deliveries get assigned until someone manually notices

    Move to SQS with multiple workers or use ECS Scheduled Tasks with health checks

Concerns

OverengineeringNone flagged
Underengineering3 items
  • No database connection pooling mentioned for a workload hitting Postgres 30K times per minute
  • No CDN caching for restaurant search results that rarely change
  • Zero redundancy in critical job processing for a real-time delivery platform
Missing pieces4 items
  • Read replicas for the massive SELECT workload destroying your primary database
  • Message queue for async webhook processing and notifications
  • Application-level caching layer (Redis) for restaurant search and static data
  • Error tracking and APM for debugging production performance issues
Single points of failure4 items
  • Single Redis session store taking down all 50K active users when it fails
  • Single PostgreSQL primary handling all reads and writes with no failover
  • One EC2 box running all background job assignment with manual recovery
  • Single Stripe webhook endpoint with no backup processing mechanism
Security concerns3 items
  • CORS allows * meaning any website can make API calls on behalf of logged-in users
  • JWT tokens in localStorage with 7-day expiry are vulnerable to XSS attacks
  • No proper webhook signature validation race conditions in Stripe processing
Scaling concerns4 items
  • Database will collapse when delivery volume doubles due to unindexed queries
  • Frontend polling every 2 seconds doesn't scale past 100K concurrent users
  • Synchronous notification sending will block order updates under high load
  • Marker rendering will crash browsers when concurrent deliveries exceed 10K
Operational concerns3 items
  • 15-minute rolling deploys with mixed versions cause data inconsistencies
  • No structured logging makes debugging production issues nearly impossible
  • Zero distributed tracing across 12 services makes latency root-causing hopeless

What to change

  1. 01Add composite indexes on orders(status, updated_at) and driver_locations(driver_id, timestamp)
  2. 02Move Stripe webhook processing to async SQS queue with idempotency
  3. 03Implement Redis-based caching for restaurant search with 1-hour TTL
  4. 04Set up PostgreSQL read replicas for all location and search queries
  5. 05Replace polling with WebSocket connections for real-time tracking updates

Improved architecture

Monday morning todo

  • Add composite index on orders(status, updated_at) and driver_locations(driver_id, timestamp) this week
  • Set up one PostgreSQL read replica and route all location queries to it
  • Implement Redis caching for restaurant search with 1-hour expiration
  • Move Stripe webhook processing to SQS with proper idempotency keys
  • Add marker clustering to map view to limit visible markers to 100

Make your own — free at systemdesigncopilot.com

Get started free

systemdesigncopilot.com