ArchitectureResilience

Fault Tolerance by Default: OTP Meets Markdown

Jan 20, 20269 min

In most frameworks, error handling is an afterthought -- try/catch blocks sprinkled around hoping to catch whatever goes wrong. In goldclaw, fault tolerance is structural. It's built into the architecture at the supervision tree level.

The OTP Advantage

OTP (Open Telecom Platform) has been keeping telephone networks running since the 1980s. Its core insight: instead of trying to prevent all failures, design systems that recover from them automatically. Every process in goldclaw runs under a supervisor that knows how to restart it.

What This Means for You

If your TTS provider has a hiccup, the voice handler restarts in milliseconds. If RabbitMQ drops a connection, the AMCP module reconnects automatically. If an LLM call times out, the circuit breaker trips and falls back to cached responses. All of this happens without you writing a single line of error handling.

markdown
# MODULES.md

## Resilience
**Plugin**: `Resilience.Plugin`
**Circuit Breaker**: threshold: 5, timeout: 30s
**Retry**: exponential backoff, max 3
**Fallback**: cached response

The Numbers

SOS.Support has been running goldclaw in production for Wholesale Computers and Technology. Their system handles tech support calls and sales inquiries 24/7 with 99.99% uptime. When things go wrong (and they always do -- networks fail, APIs timeout), the system self-heals in under 100ms.