Verdict.

A masterclass in how to build a real-time delivery app that can't handle real-time delivery.

Roasted by DesignCopilot

3/ 10

Fragile

This is a textbook example of scaling a monolith without understanding the workload. You're hammering Postgres with 30K queries/second for live tracking, processing webhooks inline until they time out, and storing 5 years of GPS coordinates in a table with no indexes. The architecture worked fine at 1K users — at 50K, it's a ticking time bomb. The good news: most fixes are straightforward database and caching wins.

Top issues

Critical
Every active tracking query runs SELECT * across 800GB orders table every 2 seconds with no proper indexing
This is why your database CPU is pegged at 95% during dinner rush — you're doing full table scans 30,000 times per minute
Add composite index on (status, updated_at) and stop selecting all 47 columns when you only need 5
Critical
Driver location lookups use ORDER BY timestamp DESC LIMIT 1 with no index on (driver_id, timestamp)
Each location lookup triggers a full table scan of millions of GPS coordinates, creating 4+ second map delays
Add composite index on (driver_id, timestamp) and consider a separate current_locations table
Critical
Stripe webhooks processed synchronously with no deduplication locks, causing double-charges on retries
You're literally charging customers multiple times when Stripe retries slow webhooks, creating chargeback liability
Process webhooks asynchronously in SQS with idempotency keys and proper row locking
High
All 50K users share a single Redis session store with no replication
When that t3.medium dies, every customer gets force-logged out during peak hours
Set up Redis cluster with failover or switch to stateless JWT refresh tokens
High
Customer map view attempts to render 5,000 concurrent delivery markers without clustering
Browser performance tanks when rendering thousands of DOM elements, making the map unusable
Implement marker clustering or viewport-based filtering to show max 100 markers
Medium
Rate limiting uses in-memory Maps across 12 servers, making it completely ineffective
Attackers can bypass limits by hitting different servers, and normal users get inconsistent limiting
Move rate limiting to Redis or add a proper API gateway like Kong
Medium
Background job assignment runs on a single EC2 cron with 23-minute mean recovery time
When the job server crashes during dinner rush, no new deliveries get assigned until someone manually notices
Move to SQS with multiple workers or use ECS Scheduled Tasks with health checks

Concerns

OverengineeringNone flagged

Underengineering3 items

No database connection pooling mentioned for a workload hitting Postgres 30K times per minute
No CDN caching for restaurant search results that rarely change
Zero redundancy in critical job processing for a real-time delivery platform

Missing pieces4 items

Read replicas for the massive SELECT workload destroying your primary database
Message queue for async webhook processing and notifications
Application-level caching layer (Redis) for restaurant search and static data
Error tracking and APM for debugging production performance issues

Single points of failure4 items

Single Redis session store taking down all 50K active users when it fails
Single PostgreSQL primary handling all reads and writes with no failover
One EC2 box running all background job assignment with manual recovery
Single Stripe webhook endpoint with no backup processing mechanism

Security concerns3 items

CORS allows * meaning any website can make API calls on behalf of logged-in users
JWT tokens in localStorage with 7-day expiry are vulnerable to XSS attacks
No proper webhook signature validation race conditions in Stripe processing

Scaling concerns4 items

Database will collapse when delivery volume doubles due to unindexed queries
Frontend polling every 2 seconds doesn't scale past 100K concurrent users
Synchronous notification sending will block order updates under high load
Marker rendering will crash browsers when concurrent deliveries exceed 10K

Operational concerns3 items

15-minute rolling deploys with mixed versions cause data inconsistencies
No structured logging makes debugging production issues nearly impossible
Zero distributed tracing across 12 services makes latency root-causing hopeless

What to change

01Add composite indexes on orders(status, updated_at) and driver_locations(driver_id, timestamp)
02Move Stripe webhook processing to async SQS queue with idempotency
03Implement Redis-based caching for restaurant search with 1-hour TTL
04Set up PostgreSQL read replicas for all location and search queries
05Replace polling with WebSocket connections for real-time tracking updates

Improved architecture

Monday morning todo

Add composite index on orders(status, updated_at) and driver_locations(driver_id, timestamp) this week
Set up one PostgreSQL read replica and route all location queries to it
Implement Redis caching for restaurant search with 1-hour expiration
Move Stripe webhook processing to SQS with proper idempotency keys
Add marker clustering to map view to limit visible markers to 100

Make your own — free at systemdesigncopilot.com

Get started free

systemdesigncopilot.com

Verdict.

Top issues

Every active tracking query runs SELECT * across 800GB orders table every 2 seconds with no proper indexing

Driver location lookups use ORDER BY timestamp DESC LIMIT 1 with no index on (driver_id, timestamp)

Stripe webhooks processed synchronously with no deduplication locks, causing double-charges on retries

All 50K users share a single Redis session store with no replication

Customer map view attempts to render 5,000 concurrent delivery markers without clustering

Rate limiting uses in-memory Maps across 12 servers, making it completely ineffective

Background job assignment runs on a single EC2 cron with 23-minute mean recovery time

Concerns

What to change

Improved architecture

Monday morning todo