Test for production
Correct code which passes all tests can still fail once deployed in production.
This happens because tests do not account for failures.
Example of failures are external apis being slow to respond or a spike of user requests saturating hardware resources.
To make our code resilient we need to embrace failure and mitigate its impact.
Failure usually happen where our code accesses the network (e.g. database, external api, etc.). The following approaches help to mitigate its impact:
- Fail fast
When an external dependency is down, it is better to present users with errors for some functionality than bringing down the whole application. Use timeouts and circuit breakers on client side. - Isolate failures
Each integration point should have its own pool of resources (e.g. thread pool, connection pool, etc.) Use bulkheads. - Buffer loads
Improve availability by absorbing load in buffers. Use queues and user notifications instead of synchronous responses. - Test error scenarios
Use integration tests that simulates error cases like server timeout, connection error, slow responses, etc.
Recommended reads
- Release It!: Design and Deploy Production-Ready Software (2nd edition) - Michael T. Nygard
- Chaos Engineering - Wikipedia
- It takes more than a Circuit Breaker to create a resilient application - Bilgin Ibryam
- Fault Tolerance in a High Volume, Distributed System - Ben Christensen
- Circuit breaker pattern - Microsoft
- Bulkhead pattern - Microsoft
- Fuzz testing - Gitlab
Teach me back
I really appreciate any feedback about the book and my current understanding of software design.