tl;dr Avoid surprises, restart your workloads often to ensure your app starts as expected.
Server outages can produce some interesting surprises. As we continue to strive for dependable interfaces and services between our squads, I am reminded of the snowflakes and phoenix analogy, a re-branding of pets vs. cattle.
Snowflake: unique, fragile, unpredictable
Phoenix: rise from nothing at any time, reliable, unsurprising
Snowflake servers are unique, fragile, carefully tended to, and manually modified. A snowflake might have a random set of scripts running in a screen or tmux session. A process which is running that only a few people know how to start correctly. If something unexpected happens to these snowflakes which causes a reboot, then it is all-hands-on-deck to revive these snowflakes when their apps do not start. There are lots of scenarios which produce an unexpected outcome.
Phoenix servers can be created from nothing at any time. These servers “rise from the ashes” to serve your application in production. Everything (all configuration files, services, application, etc.) that make up a server is packaged together and applied to it. Phoenixes are repeatable and predictable. These packages can be tested before promoted into a production environment. A phoenix server can be destroyed at any time and a new one can replace it easily.
One way to test whether your workloads are phoenixes or snowflakes is by restarting them frequently. Ensure that all of the necessary services and processes start as expected. If it does not start, make changes to configuration management such as chef, puppet, or ansible and test those changes. Ideally, test these changes by spinning up an ephemeral VM.
To take it another step further, put your application in a container and deploy to a kubernetes cluster. When your application changes, deploy a new container. Your application should
- Align with the 12 factors
- Be stateless volatile application layer i.e. the application container can be rebooted or recreated at any time
- Send logs to a remote location such as an ELK cluster or the AWS elasticsearch service
- Collect metrics using datadog or prometheus so that you can debug your app long after the container has been terminated
Ensure that your workloads are phoenixes, not snowflakes.
This is part of Friday Thoughts, a post series on improving best practices throughout LiveRamp’s engineering organization. Do you like engineering teams that continuously seek to improve themselves? We’re always hiring.