Why Resilience Matters

In distributed systems, failures are not exceptional - they're expected. Network calls fail, services become unavailable, and resources become exhausted. Resilience patterns help your application handle these failures gracefully.

Types of Failures

Transient Failures

Temporary failures that resolve themselves:

Network timeouts
Service temporarily unavailable (HTTP 503)
Database connection pool exhausted
Rate limiting (HTTP 429)

Strategy: Retry after a short delay.

Permanent Failures

Failures that won't resolve by retrying:

Invalid credentials (HTTP 401)
Resource not found (HTTP 404)
Validation errors (HTTP 400)

Strategy: Fail fast, don't retry.

Cascading Failures

One failing service causes others to fail:


1Service A → Service B → Service C
2                           ↓ (fails)
3Service A ← (timeout) ← Service B
4
5When C fails, B waits, then A waits...
6Resources exhausted across all services!

Strategy: Circuit breaker, fail fast.

Without Resilience Patterns


csharp
1// Naive approach - fails on first error
2var response = await httpClient.GetAsync(url);
3var data = await response.Content.ReadAsStringAsync();

Problems:

Single transient failure causes complete failure
No protection against cascading failures
User sees cryptic error messages

With Resilience Patterns


csharp
1// With retry and timeout
2var policy = Policy
3    .Handle<HttpRequestException>()
4    .WaitAndRetryAsync(3, attempt => TimeSpan.FromSeconds(Math.Pow(2, attempt)));
5
6var response = await policy.ExecuteAsync(() => httpClient.GetAsync(url));

Benefits:

Automatic recovery from transient failures
Protection against cascading failures
Better user experience

Common Resilience Patterns

Pattern	Purpose
Retry	Automatically retry failed operations
Exponential Backoff	Increase wait time between retries
Circuit Breaker	Stop calling failing services
Timeout	Limit how long to wait
Bulkhead	Isolate failures to prevent cascade
Fallback	Provide default when all else fails

When to Retry

Retry when:

Network timeout occurred
Service returns 5xx status
Database connection failed
Rate limit hit (with appropriate delay)

Don't retry when:

Client error (4xx except 429)
Invalid input/credentials
Business logic failure
Operation is not idempotent

Real-World Example

Without resilience:


1User clicks "Submit Order"
2→ Payment service times out (network glitch)
3→ Error: "Something went wrong"
4→ User retries manually
5→ Gets duplicate charges!

With resilience:


1User clicks "Submit Order"
2→ Payment service times out
3→ System waits 1 second, retries
4→ Second attempt succeeds
5→ User sees "Order Confirmed"

The Cost of No Resilience

User frustration: Errors for temporary problems
Lost revenue: Users abandon failing checkouts
Support burden: Tickets for transient issues
Cascading outages: One service takes down others

Introduction to Polly

Polly is a .NET resilience library that provides:

Retry policies
Circuit breakers
Timeouts
Bulkhead isolation
Fallback policies
Policy composition


csharp
1// Install: dotnet add package Polly
2
3using Polly;
4
5var policy = Policy
6    .Handle<HttpRequestException>()
7    .WaitAndRetryAsync(3, attempt => TimeSpan.FromSeconds(attempt));
8
9var result = await policy.ExecuteAsync(async () =>
10{
11    return await httpClient.GetStringAsync(url);
12});

Key Takeaways

Failures are expected in distributed systems
Transient failures often resolve with retry
Permanent failures should fail fast
Circuit breakers prevent cascading failures
Polly provides composable resilience policies
Resilience improves user experience and system stability