Why Resilience Matters
In distributed systems, failures are not exceptional - they're expected. Network calls fail, services become unavailable, and resources become exhausted. Resilience patterns help your application handle these failures gracefully.
Types of Failures
Transient Failures
Temporary failures that resolve themselves:
- Network timeouts
- Service temporarily unavailable (HTTP 503)
- Database connection pool exhausted
- Rate limiting (HTTP 429)
Strategy: Retry after a short delay.
Permanent Failures
Failures that won't resolve by retrying:
- Invalid credentials (HTTP 401)
- Resource not found (HTTP 404)
- Validation errors (HTTP 400)
Strategy: Fail fast, don't retry.
Cascading Failures
One failing service causes others to fail:
1Service A → Service B → Service C2 ↓ (fails)3Service A ← (timeout) ← Service B45When C fails, B waits, then A waits...6Resources exhausted across all services!
Strategy: Circuit breaker, fail fast.
Without Resilience Patterns
csharp1// Naive approach - fails on first error2var response = await httpClient.GetAsync(url);3var data = await response.Content.ReadAsStringAsync();
Problems:
- Single transient failure causes complete failure
- No protection against cascading failures
- User sees cryptic error messages
With Resilience Patterns
csharp1// With retry and timeout2var policy = Policy3 .Handle<HttpRequestException>()4 .WaitAndRetryAsync(3, attempt => TimeSpan.FromSeconds(Math.Pow(2, attempt)));56var response = await policy.ExecuteAsync(() => httpClient.GetAsync(url));
Benefits:
- Automatic recovery from transient failures
- Protection against cascading failures
- Better user experience
Common Resilience Patterns
| Pattern | Purpose |
|---|---|
| Retry | Automatically retry failed operations |
| Exponential Backoff | Increase wait time between retries |
| Circuit Breaker | Stop calling failing services |
| Timeout | Limit how long to wait |
| Bulkhead | Isolate failures to prevent cascade |
| Fallback | Provide default when all else fails |
When to Retry
Retry when:
- Network timeout occurred
- Service returns 5xx status
- Database connection failed
- Rate limit hit (with appropriate delay)
Don't retry when:
- Client error (4xx except 429)
- Invalid input/credentials
- Business logic failure
- Operation is not idempotent
Real-World Example
Without resilience:
1User clicks "Submit Order"2→ Payment service times out (network glitch)3→ Error: "Something went wrong"4→ User retries manually5→ Gets duplicate charges!
With resilience:
1User clicks "Submit Order"2→ Payment service times out3→ System waits 1 second, retries4→ Second attempt succeeds5→ User sees "Order Confirmed"
The Cost of No Resilience
- User frustration: Errors for temporary problems
- Lost revenue: Users abandon failing checkouts
- Support burden: Tickets for transient issues
- Cascading outages: One service takes down others
Introduction to Polly
Polly is a .NET resilience library that provides:
- Retry policies
- Circuit breakers
- Timeouts
- Bulkhead isolation
- Fallback policies
- Policy composition
csharp1// Install: dotnet add package Polly23using Polly;45var policy = Policy6 .Handle<HttpRequestException>()7 .WaitAndRetryAsync(3, attempt => TimeSpan.FromSeconds(attempt));89var result = await policy.ExecuteAsync(async () =>10{11 return await httpClient.GetStringAsync(url);12});
Key Takeaways
- Failures are expected in distributed systems
- Transient failures often resolve with retry
- Permanent failures should fail fast
- Circuit breakers prevent cascading failures
- Polly provides composable resilience policies
- Resilience improves user experience and system stability