15 minlesson

Why Resilience Matters

Why Resilience Matters

In distributed systems, failures are not exceptional - they're expected. Network calls fail, services become unavailable, and resources become exhausted. Resilience patterns help your application handle these failures gracefully.

Types of Failures

Transient Failures

Temporary failures that resolve themselves:

  • Network timeouts
  • Service temporarily unavailable (HTTP 503)
  • Database connection pool exhausted
  • Rate limiting (HTTP 429)

Strategy: Retry after a short delay.

Permanent Failures

Failures that won't resolve by retrying:

  • Invalid credentials (HTTP 401)
  • Resource not found (HTTP 404)
  • Validation errors (HTTP 400)

Strategy: Fail fast, don't retry.

Cascading Failures

One failing service causes others to fail:

1Service A → Service B → Service C
2 ↓ (fails)
3Service A ← (timeout) ← Service B
4
5When C fails, B waits, then A waits...
6Resources exhausted across all services!

Strategy: Circuit breaker, fail fast.

Without Resilience Patterns

csharp
1// Naive approach - fails on first error
2var response = await httpClient.GetAsync(url);
3var data = await response.Content.ReadAsStringAsync();

Problems:

  • Single transient failure causes complete failure
  • No protection against cascading failures
  • User sees cryptic error messages

With Resilience Patterns

csharp
1// With retry and timeout
2var policy = Policy
3 .Handle<HttpRequestException>()
4 .WaitAndRetryAsync(3, attempt => TimeSpan.FromSeconds(Math.Pow(2, attempt)));
5
6var response = await policy.ExecuteAsync(() => httpClient.GetAsync(url));

Benefits:

  • Automatic recovery from transient failures
  • Protection against cascading failures
  • Better user experience

Common Resilience Patterns

PatternPurpose
RetryAutomatically retry failed operations
Exponential BackoffIncrease wait time between retries
Circuit BreakerStop calling failing services
TimeoutLimit how long to wait
BulkheadIsolate failures to prevent cascade
FallbackProvide default when all else fails

When to Retry

Retry when:

  • Network timeout occurred
  • Service returns 5xx status
  • Database connection failed
  • Rate limit hit (with appropriate delay)

Don't retry when:

  • Client error (4xx except 429)
  • Invalid input/credentials
  • Business logic failure
  • Operation is not idempotent

Real-World Example

Without resilience:

1User clicks "Submit Order"
2→ Payment service times out (network glitch)
3→ Error: "Something went wrong"
4→ User retries manually
5→ Gets duplicate charges!

With resilience:

1User clicks "Submit Order"
2→ Payment service times out
3→ System waits 1 second, retries
4→ Second attempt succeeds
5→ User sees "Order Confirmed"

The Cost of No Resilience

  • User frustration: Errors for temporary problems
  • Lost revenue: Users abandon failing checkouts
  • Support burden: Tickets for transient issues
  • Cascading outages: One service takes down others

Introduction to Polly

Polly is a .NET resilience library that provides:

  • Retry policies
  • Circuit breakers
  • Timeouts
  • Bulkhead isolation
  • Fallback policies
  • Policy composition
csharp
1// Install: dotnet add package Polly
2
3using Polly;
4
5var policy = Policy
6 .Handle<HttpRequestException>()
7 .WaitAndRetryAsync(3, attempt => TimeSpan.FromSeconds(attempt));
8
9var result = await policy.ExecuteAsync(async () =>
10{
11 return await httpClient.GetStringAsync(url);
12});

Key Takeaways

  • Failures are expected in distributed systems
  • Transient failures often resolve with retry
  • Permanent failures should fail fast
  • Circuit breakers prevent cascading failures
  • Polly provides composable resilience policies
  • Resilience improves user experience and system stability