Event-Driven Microservices: When to Break the Monolith (and When Not To)

The Problem

Our modular monolith was working fine. That’s the uncomfortable truth. It served 50K requests per minute, had 99.9% uptime, and a team of 8 engineers could ship features in it.

But we could see the ceiling approaching:

Deploy coordination — A change to the billing module required coordinated deploys with the notifications module
Scaling asymmetry — The ingestion pipeline needed 10x the compute of the dashboard, but they shared the same process
Team coupling — Two teams working in the same codebase meant constant merge conflicts and release trains

The Investigation

We spent 3 months measuring before writing a line of new code. The analysis revealed:

Billing was coupled to 4 downstream services
Notification was a hidden bottleneck — every team depended on it
Ingestion had the most distinct scaling requirements

The Solution: Event-Driven Architecture

We chose event-driven over request-driven microservices for one reason: decoupling. With events, the billing service doesn’t call notification — it emits a billing.invoice.paid event, and notification picks it up.

The Event Schema

{
  "id": "evt_01J2XYZ...",
  "type": "billing.invoice.paid",
  "source": "billing-service",
  "time": "2026-05-20T14:30:00Z",
  "data": {
    "invoice_id": "inv_20260520_001",
    "customer_id": "cus_abc123",
    "amount": 49900,
    "currency": "USD"
  },
  "specversion": "1.0"
}

We used CloudEvents spec for interoperability.

Key Implementation Details

Idempotent Consumers

The hardest lesson: Kafka guarantees at-least-once delivery. You will get duplicate events.

class IdempotentConsumer:
    def __init__(self):
        self.processed = RedisSet("processed-events", ttl=86400)

    async def process(self, event: CloudEvent):
        if await self.processed.contains(event.id):
            return
        await self.handle(event)
        await self.processed.add(event.id)

Dead Letter Queue

Not all events can be processed. We built a DLQ with automatic retry:

class DeadLetterQueue:
    def __init__(self, topic: str, max_retries: int = 3):
        self.dlq_topic = f"{topic}.dlq"
        self.retry_topic = f"{topic}.retry"
        self.max_retries = max_retries

    async def handle_failure(self, event: CloudEvent, error: Exception):
        retry_count = event.get("retry_count", 0)
        if retry_count < self.max_retries:
            event["retry_count"] = retry_count + 1
            await self.producer.send(self.retry_topic, event)
        else:
            await self.producer.send(self.dlq_topic, event)

The Results

Metric	Before (Monolith)	After (Event-Driven)
Deploy time	45 min (coordinated)	8 min (independent)
P95 latency	320ms	280ms
Team throughput	3 features/sprint	7 features/sprint
Incidents/month	4	2
Infrastructure cost	$12K/mo	$15K/mo

What Went Wrong

1. Event schema evolution

We didn’t plan for schema changes. When the billing team added a field to the event payload, the notification service crashed.

Fix: We adopted Avro with Schema Registry, enforcing backward compatibility.

2. Observability debt

In the monolith, a single trace covered the entire request. With events, we lost that.

Fix: We added OpenTelemetry instrumentation with trace context propagation through Kafka message headers.

3. Testing complexity

Testing an event flow requires running Kafka, the producer service, and the consumer service.

Fix: We built a test harness that uses in-memory event bus for unit tests and reserved real Kafka for integration tests.

When NOT to Break the Monolith

If I could go back, I’d ask these 3 questions first:

Is the monolith actually the bottleneck? — If your deploys take 10 minutes and your team is 4 people, don’t migrate.
Can you modularize in-place? — Strict module boundaries, shared nothing, and clear interfaces can get you 80% of the benefit without the operational cost.
Do you have observability? — If you can’t trace a request through your monolith, you definitely can’t trace it through 12 microservices.