Verdict.
A masterclass in how to build a real-time delivery app that can't handle real-time delivery.
Roasted by DesignCopilot
This is a textbook example of scaling a monolith without understanding the workload. You're hammering Postgres with 30K queries/second for live tracking, processing webhooks inline until they time out, and storing 5 years of GPS coordinates in a table with no indexes. The architecture worked fine at 1K users — at 50K, it's a ticking time bomb. The good news: most fixes are straightforward database and caching wins.
Top issues
- Critical
Every active tracking query runs SELECT * across 800GB orders table every 2 seconds with no proper indexing
This is why your database CPU is pegged at 95% during dinner rush — you're doing full table scans 30,000 times per minute
Add composite index on (status, updated_at) and stop selecting all 47 columns when you only need 5
- Critical
Driver location lookups use ORDER BY timestamp DESC LIMIT 1 with no index on (driver_id, timestamp)
Each location lookup triggers a full table scan of millions of GPS coordinates, creating 4+ second map delays
Add composite index on (driver_id, timestamp) and consider a separate current_locations table
- Critical
Stripe webhooks processed synchronously with no deduplication locks, causing double-charges on retries
You're literally charging customers multiple times when Stripe retries slow webhooks, creating chargeback liability
Process webhooks asynchronously in SQS with idempotency keys and proper row locking
- High
All 50K users share a single Redis session store with no replication
When that t3.medium dies, every customer gets force-logged out during peak hours
Set up Redis cluster with failover or switch to stateless JWT refresh tokens
- High
Customer map view attempts to render 5,000 concurrent delivery markers without clustering
Browser performance tanks when rendering thousands of DOM elements, making the map unusable
Implement marker clustering or viewport-based filtering to show max 100 markers
- Medium
Rate limiting uses in-memory Maps across 12 servers, making it completely ineffective
Attackers can bypass limits by hitting different servers, and normal users get inconsistent limiting
Move rate limiting to Redis or add a proper API gateway like Kong
- Medium
Background job assignment runs on a single EC2 cron with 23-minute mean recovery time
When the job server crashes during dinner rush, no new deliveries get assigned until someone manually notices
Move to SQS with multiple workers or use ECS Scheduled Tasks with health checks
Concerns
OverengineeringNone flagged
Underengineering3 items
- No database connection pooling mentioned for a workload hitting Postgres 30K times per minute
- No CDN caching for restaurant search results that rarely change
- Zero redundancy in critical job processing for a real-time delivery platform
Missing pieces4 items
- Read replicas for the massive SELECT workload destroying your primary database
- Message queue for async webhook processing and notifications
- Application-level caching layer (Redis) for restaurant search and static data
- Error tracking and APM for debugging production performance issues
Single points of failure4 items
- Single Redis session store taking down all 50K active users when it fails
- Single PostgreSQL primary handling all reads and writes with no failover
- One EC2 box running all background job assignment with manual recovery
- Single Stripe webhook endpoint with no backup processing mechanism
Security concerns3 items
- CORS allows * meaning any website can make API calls on behalf of logged-in users
- JWT tokens in localStorage with 7-day expiry are vulnerable to XSS attacks
- No proper webhook signature validation race conditions in Stripe processing
Scaling concerns4 items
- Database will collapse when delivery volume doubles due to unindexed queries
- Frontend polling every 2 seconds doesn't scale past 100K concurrent users
- Synchronous notification sending will block order updates under high load
- Marker rendering will crash browsers when concurrent deliveries exceed 10K
Operational concerns3 items
- 15-minute rolling deploys with mixed versions cause data inconsistencies
- No structured logging makes debugging production issues nearly impossible
- Zero distributed tracing across 12 services makes latency root-causing hopeless
What to change
- 01Add composite indexes on orders(status, updated_at) and driver_locations(driver_id, timestamp)
- 02Move Stripe webhook processing to async SQS queue with idempotency
- 03Implement Redis-based caching for restaurant search with 1-hour TTL
- 04Set up PostgreSQL read replicas for all location and search queries
- 05Replace polling with WebSocket connections for real-time tracking updates
Improved architecture
Monday morning todo
- Add composite index on orders(status, updated_at) and driver_locations(driver_id, timestamp) this week
- Set up one PostgreSQL read replica and route all location queries to it
- Implement Redis caching for restaurant search with 1-hour expiration
- Move Stripe webhook processing to SQS with proper idempotency keys
- Add marker clustering to map view to limit visible markers to 100
Make your own — free at systemdesigncopilot.com
Get started freesystemdesigncopilot.com