Event-Driven Microservices: When to Break the Monolith (and When Not To)
We migrated from a modular monolith to event-driven microservices. Here's the honest story — what went well, what went wrong, and the 3 questions you should ask before starting your own migration.
The Problem
Our modular monolith was working fine. That’s the uncomfortable truth. It served 50K requests per minute, had 99.9% uptime, and a team of 8 engineers could ship features in it.
But we could see the ceiling approaching:
- Deploy coordination — A change to the billing module required coordinated deploys with the notifications module
- Scaling asymmetry — The ingestion pipeline needed 10x the compute of the dashboard, but they shared the same process
- Team coupling — Two teams working in the same codebase meant constant merge conflicts and release trains
The Investigation
We spent 3 months measuring before writing a line of new code. The analysis revealed:
- Billing was coupled to 4 downstream services
- Notification was a hidden bottleneck — every team depended on it
- Ingestion had the most distinct scaling requirements
The Solution: Event-Driven Architecture
We chose event-driven over request-driven microservices for one reason: decoupling. With events, the billing service doesn’t call notification — it emits a billing.invoice.paid event, and notification picks it up.
The Event Schema
{
"id": "evt_01J2XYZ...",
"type": "billing.invoice.paid",
"source": "billing-service",
"time": "2026-05-20T14:30:00Z",
"data": {
"invoice_id": "inv_20260520_001",
"customer_id": "cus_abc123",
"amount": 49900,
"currency": "USD"
},
"specversion": "1.0"
}
We used CloudEvents spec for interoperability.
Key Implementation Details
Idempotent Consumers
The hardest lesson: Kafka guarantees at-least-once delivery. You will get duplicate events.
class IdempotentConsumer:
def __init__(self):
self.processed = RedisSet("processed-events", ttl=86400)
async def process(self, event: CloudEvent):
if await self.processed.contains(event.id):
return
await self.handle(event)
await self.processed.add(event.id)
Dead Letter Queue
Not all events can be processed. We built a DLQ with automatic retry:
class DeadLetterQueue:
def __init__(self, topic: str, max_retries: int = 3):
self.dlq_topic = f"{topic}.dlq"
self.retry_topic = f"{topic}.retry"
self.max_retries = max_retries
async def handle_failure(self, event: CloudEvent, error: Exception):
retry_count = event.get("retry_count", 0)
if retry_count < self.max_retries:
event["retry_count"] = retry_count + 1
await self.producer.send(self.retry_topic, event)
else:
await self.producer.send(self.dlq_topic, event)
The Results
| Metric | Before (Monolith) | After (Event-Driven) |
|---|---|---|
| Deploy time | 45 min (coordinated) | 8 min (independent) |
| P95 latency | 320ms | 280ms |
| Team throughput | 3 features/sprint | 7 features/sprint |
| Incidents/month | 4 | 2 |
| Infrastructure cost | $12K/mo | $15K/mo |
What Went Wrong
1. Event schema evolution
We didn’t plan for schema changes. When the billing team added a field to the event payload, the notification service crashed.
Fix: We adopted Avro with Schema Registry, enforcing backward compatibility.
2. Observability debt
In the monolith, a single trace covered the entire request. With events, we lost that.
Fix: We added OpenTelemetry instrumentation with trace context propagation through Kafka message headers.
3. Testing complexity
Testing an event flow requires running Kafka, the producer service, and the consumer service.
Fix: We built a test harness that uses in-memory event bus for unit tests and reserved real Kafka for integration tests.
When NOT to Break the Monolith
If I could go back, I’d ask these 3 questions first:
-
Is the monolith actually the bottleneck? — If your deploys take 10 minutes and your team is 4 people, don’t migrate.
-
Can you modularize in-place? — Strict module boundaries, shared nothing, and clear interfaces can get you 80% of the benefit without the operational cost.
-
Do you have observability? — If you can’t trace a request through your monolith, you definitely can’t trace it through 12 microservices.