Engineering

Exactly-Once Processing Is a Myth
(Here's What Actually Works)

9 min read

Key Takeaway

Exactly-once delivery is physically impossible because networks are unreliable. In distributed systems, you either get at-least-once (duplicates) or at-most-once (data loss). To simulate exactly-once results, you must accept retries and use idempotency keys to ensure the side effects only happen once.

Every system diagram says it: "Exactly-once processing." Sounds perfect. Sounds safe. Sounds like what you want.

But in distributed systems, exactly-once is not a guarantee — it's an illusion.

Whether you're dealing with Stripe webhooks, Zapier automations, payment buttons, or AI agents — the fundamental challenge is the same: networks are unreliable, and retries are inevitable.

🚨 The Hard Truth

Networks fail. Packets drop. Connections reset. Responses get lost.

When two systems talk, there are only two things you can know:

  1. You sent the request.
  2. You didn't receive confirmation.

You can never know for sure whether the other side processed it. So systems retry. And retries mean duplicates are inevitable.

The Two Generals Problem:

This is a classic thought experiment in distributed systems. Two armies need to coordinate an attack, but their only communication is through messengers who might be captured. Even if General A receives confirmation from General B, General B doesn't know if General A received that confirmation. This requires another confirmation, which itself needs confirmation—an infinite loop. There is no algorithm that guarantees both parties know the message was received in an unreliable network.

This fundamental limitation of distributed systems is why exactly-once delivery is impossible at the network level.

🔁 What Systems Actually Guarantee

Most real-world systems choose at-least-once, because losing events is far worse than duplicating them.

Guarantee Risk Reality
At-most-once Data Loss Messages may never arrive.
At-least-once Duplicates Messages arrive, possibly multiple times.
Exactly-once Complexity A combination of at-least-once + idempotency.

🧠 Why "Exactly-Once" Marketing Misleads

What providers really mean when they claim "exactly-once" is: "We deliver at-least-once, and you must handle duplicates."

Even systems like Kafka only achieve exactly-once within strict internal boundaries. Once your system touches the outside world — external APIs, webhooks, or third-party services — you're back in retry-land.

Kafka's "Exactly-Once" Explained

Kafka's exactly-once semantics (EOS) work within a controlled environment:

What Kafka Actually Guarantees:

  • Idempotent Producers: Kafka assigns each message a sequence number. If the producer retries, Kafka deduplicates based on this number.
  • Transactional Reads/Writes: Kafka can write multiple messages atomically across partitions and read them transactionally.
  • Consumer Group Offset Management: Offsets are committed transactionally with processing, preventing double-reads.

The boundary: This only works inside Kafka. Once your consumer calls an external API, sends an email, or updates a database, you need application-level idempotency.

How Other Message Queues Handle This

System Default Guarantee How It Works
RabbitMQ At-least-once Messages redelivered if not acked; you must dedupe
AWS SQS At-least-once Retries on visibility timeout; duplicates possible
Google Pub/Sub At-least-once Redelivery on nack or timeout; idempotency required
Kafka At-least-once* EOS within Kafka; external calls need idempotency
Azure Service Bus At-least-once Peek-lock pattern; duplicates on timeout

💥 Real-World Examples

This isn't theoretical. Every production system faces this challenge:

✅ The Approach That Actually Works

You don't stop duplicates. You make duplicates harmless. That's idempotency.

Instead of System A → System B, you do:

System A → Idempotency Layer → System B

The layer accepts an idempotency key, stores the result of the first execution, and blocks duplicate side effects. Retries stop being dangerous.

Implementation Patterns

Here's how to implement idempotency in different scenarios:

Pattern 1: Redis-Based Idempotency (Node.js)

const redis = require('redis');
const client = redis.createClient();

async function executeIdempotent(key, action) {
  // Check if we've seen this key before
  const cached = await client.get(key);
  if (cached) {
    return JSON.parse(cached); // Return cached result
  }

  // Execute the action
  const result = await action();

  // Store result with 24h TTL
  await client.setex(key, 86400, JSON.stringify(result));

  return result;
}

// Usage in payment handler
app.post('/checkout', async (req, res) => {
  const idempotencyKey = req.headers['idempotency-key'];

  const result = await executeIdempotent(idempotencyKey, async () => {
    // This only runs once, even if retried
    const charge = await stripe.charges.create({
      amount: req.body.amount,
      currency: 'usd',
      source: req.body.token
    });

    await db.orders.create({
      chargeId: charge.id,
      userId: req.user.id
    });

    return { orderId: charge.id, status: 'success' };
  });

  res.json(result);
});

Pattern 2: Database-Based Idempotency (Python)

from datetime import datetime, timedelta
import json

def execute_idempotent(db, key, action):
    # Try to fetch existing result
    result = db.query(
        "SELECT result FROM idempotency_keys WHERE key = %s",
        (key,)
    ).first()

    if result:
        return json.loads(result['result'])

    # Execute action
    action_result = action()

    # Store with timestamp
    db.execute(
        """INSERT INTO idempotency_keys (key, result, created_at)
           VALUES (%s, %s, %s)
           ON CONFLICT (key) DO NOTHING""",
        (key, json.dumps(action_result), datetime.now())
    )

    # Clean old keys (optional background job)
    db.execute(
        "DELETE FROM idempotency_keys WHERE created_at < %s",
        (datetime.now() - timedelta(days=1),)
    )

    return action_result

# Usage in webhook handler
@app.route('/webhooks/stripe', methods=['POST'])
def stripe_webhook():
    event = request.json
    event_id = event['id']  # Stripe's unique event ID

    result = execute_idempotent(db, event_id, lambda: {
        'user_id': activate_subscription(event['data']['object']),
        'status': 'activated'
    })

    return jsonify(result)

Pattern 3: Using OnceOnly API

// Delegate duplicate detection to OnceOnly (check-lock)
const key = `order-${userId}-${sessionId}`;

const lockRes = await fetch("https://api.onceonly.tech/v1/check-lock", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.ONCEONLY_API_KEY}`, // once_live_***
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    key,
    ttl: 3600,
    metadata: { userId, sessionId }
  }),
});

const lock = await lockRes.json();

if (lock.status === "duplicate") {
  // Don't re-run side effects. Return your cached result (DB/Redis) if you store it.
  return await getCachedResult(key);
}

// New action: execute once, then store result under the same key.
const result = await processOrder({ userId, items: cartItems, total: cartTotal });
await saveCachedResult(key, result);
return result;

🧩 The Mental Model Shift

Stop thinking: "How do we prevent retries?"
Start thinking: "How do we make retries safe?"

Retries are built into the internet, cloud infrastructure, APIs, and AI systems. They're not bugs — they're reality. The physics of distributed systems require retries for reliability.

⚡ Engineering Principle

In distributed systems, the question is never "Will this request be retried?" The question is "When this request is retried, will my system handle it safely?" Build for retries from day one.

📊 Delivery Guarantees Compared

Aspect At-Most-Once At-Least-Once "Exactly-Once"
Delivery 0 or 1 times 1+ times 1 time (illusion)
Data Loss Risk High None None
Duplicate Risk None High Handled
Complexity Low Medium High
Real Implementation Fire-and-forget Retry without dedup At-least-once + idempotency
Use Cases Metrics, logs Most systems Payments, critical ops
Make Your System Safe

Stop fighting the network. Start using idempotency.

❓ Frequently Asked Questions

Why can't we just make networks reliable?

Networks are physical infrastructure spanning continents. Packets can be lost due to: hardware failures, routing issues, congestion, packet corruption, timeout windows being too short, or even cosmic rays flipping bits in memory. No amount of engineering can eliminate these physical realities—we can only build systems that handle them gracefully.

What about Kafka's "exactly-once" semantics?

Kafka achieves exactly-once within its internal boundaries: from producer to topic to consumer group. But once you call an external API, send a webhook, or trigger an email from your Kafka consumer, you're back to at-least-once delivery. Kafka's exactly-once is really "at-least-once with idempotent producers and transactional reads."

Is at-most-once ever the right choice?

Rarely. At-most-once (fire-and-forget) is acceptable only when losing data is preferable to processing it twice. Examples: real-time metrics where missing a few data points is acceptable, or logging systems where occasional log loss is tolerable. For anything involving money, user data, or state changes, at-most-once is dangerous.

How do distributed transactions (2PC) fit into this?

Two-phase commit (2PC) attempts to provide atomic operations across multiple systems but has serious downsides: it's slow, blocks resources during coordinator failures, and still doesn't solve the duplicate problem if a participant crashes after committing but before acknowledging. Modern systems prefer eventual consistency with idempotency over distributed transactions.

Can I use database transactions instead of idempotency?

Database transactions only ensure atomicity within the database. If your operation involves external services (payment gateways, email APIs, webhooks), transactions can't help. You might successfully roll back a database insert, but you can't roll back an email that's already been sent or a payment that's already been charged.

What systems commonly suffer from duplicate processing?

Nearly all production systems: payment webhooks, automation platforms like Zapier, AI agent tool calls, message queues, microservices architectures, and any API that experiences network failures. If your system doesn't handle duplicates gracefully, it will create them in production.

How long should I cache idempotency results?

It depends on your retry window. For HTTP APIs, 24 hours is typical. For webhooks, match the sending provider's retry period (Stripe retries for up to 3 days). For user-facing actions like checkout, 1-24 hours is usually sufficient. Balance between preventing duplicates and storage costs.

What's the difference between idempotency and deduplication?

Deduplication detects and discards duplicate messages. Idempotency ensures that processing the same message multiple times has the same effect as processing it once. Idempotency is stronger: even if duplicates reach your system, they don't cause problems. Deduplication alone can fail if the dedup check and the processing aren't atomic.

Do I need idempotency for read-only operations?

No. Idempotency matters for operations with side effects: writes, payments, emails, notifications, state changes. Read operations are naturally idempotent—fetching data multiple times doesn't change anything. However, you may still want caching for performance, which is a different concern.