Test for production

Correct code which passes all tests can still fail once deployed in production. This happens because tests do not account for failures.
Example of failures are external apis being slow to respond or a spike of user requests saturating hardware resources. To make our code resilient we need to embrace failure and mitigate its impact.

Failure usually happen where our code accesses the network (e.g. database, external api, etc.). The following approaches help to mitigate its impact:

  • Fail fast
    When an external dependency is down, it is better to present users with errors for some functionality than bringing down the whole application. Use timeouts and circuit breakers on client side.
  • Isolate failures
    Each integration point should have its own pool of resources (e.g. thread pool, connection pool, etc.) Use bulkheads.
  • Buffer loads
    Improve availability by absorbing load in buffers. Use queues and user notifications instead of synchronous responses.
  • Test error scenarios
    Use integration tests that simulates error cases like server timeout, connection error, slow responses, etc.

Recommended reads

Teach me back

I really appreciate any feedback about the book and my current understanding of software design.