DevOps Basics: Healthchecks
Hello network,
#healthchecks don’t give your services self-healing.
It’s one of the #devops_basics since it’s important for monitoring & resiliency capabilities. Self-healing is a fancy term for essentially “restarting” containers, or running new containers if old ones cannot receive traffic.
Container orchestrators must understand when a container doesn’t work properly. This is vital for it to be able to do something.
Docker native healthchecks are informational. They don’t do anything.
Kubernetes healthchecks on the other hand are useful. There are 3 types of health checks (aka probes):
- Liveness - for restarting a container (to flush deadlocks for example)
- Readiness - for allowing/blocking traffic to a container (you load a giant config file during startup)
- Startup - for slow-starting containers
Let’s unpack these.
Liveness probes are the basic. Many issues in the software world are fixed by restarting your application. However, if your dependency is down (let’s say an underlying API or a database) restart won’t help you. Readiness probes are the most useful. You just don’t serve traffic if you’re not ready. However, if you’re not serving traffic, you have downtime, so what’s the self-healing here? Startup - this is the most hilarious of all. Old apps tend to take time to start. So if you have a liveness probe configured, you might end up in a restart loop. Seems more like “covering software problems with infrastructure”. I’d rather concentrate on reducing the startup time.
What to do?
OK. What should we do then? I think the following:
- Design for failure. Don’t do synchronous communication if you don’t have to.
- Implement retries, and be careful with them (a report generation can be sent to a queue, instead of aggregating all data in the same request, and retrying if a 3rd-party system takes a long time to respond)
- Don’t get throttled by including 3rd-party systems availability in a healthcheck (you can find the opposite advice on the Internet, but if you get throttled your health checks won’t help you)
- Don’t ever do empty healthchecks, they just complicate troubleshooting and produce garbage data
- What matters is: “Is THIS service OK?” (do you need memory flushed, for example?)
And your service should be OK even if a database or a downstream service is down. It has to process errors and anticipate them. Restarting the service itself won’t solve the problem and could make things worse.
If your service cannot process dependencies unavailability, the healthchecks will just produce data you won’t be able to get any help from.
Let’s take examples of databases and downstream unavailability.
Databases Unavailability
Database availability & health is a monitoring task. I see this as “if your database is unavailable, this is not your application’s problem”.
I’d expect the application to respond gracefully with a message similar to “Your request is being processed” and if has to be executed, to queue it for when the database becomes available.
Then dealing with the database becomes a bit less of a fire-fighting.
Of course, everything depends on the context, and such behavior isn’t always possible, if you serve your API to the consumers, all your consumers have to be designed to anticipate failure, and it’s certainly not the case.
Downstream system unavailability
The system’s availability is a problem of that system. Queuing again would be my best answer. And again it’s not always possible, but this is a question of requirements and system design. This just means that you are designed for a happy path.
A good example is auth through a 3rd-party, like Auth0. If it’s down, your healthchecks won’t do anything. But, if you constantly introduce HTTP requests to Auth0 during every healthcheck execution, you’re maximizing your bill, your chance of getting throttled, and your service’s load.
I’d rather introduce caching and do auth only WHEN it’s necessary. I don’t need to make a request to Auth0 for every API call if a token was authorized successfully once.
Conclusion
So, let’s clarify the original statement: you could have self-healing IF you add meaningful healthchecks in the right places, but this won’t help you if your app isn’t designed to anticipate failure.
I’m curious about your comments. I haven’t seen an article mentioning health checks in this form yet.