A book I read - Software Engineering at Google (check out my reading list) - had an interesting snippet (emphasis mine):
<…> For example, in 2002, Jeff Dean, one of Google’s most senior engineers, wrote the following about running an automated data-processing task as a part of the release process:
[Running the task] is a logistical, time-consuming nightmare. It currently requires getting a list of 50+ machines, starting up a process on each of these 50+ machines, and monitoring its progress on each of the 50+ machines. There is no support for automatically migrating the computation to another machine if one of the machines dies, and monitoring the progress of the jobs is done in an ad hoc manner […] Furthermore, since processes can interfere with each other, there is a complicated, human-implemented “sign up” file to throttle the use of machines, which results in less-than-optimal scheduling, and increased contention for the scarce machine resources
This was an early trigger in Google’s efforts to tame the compute environment, <…>
This led to Google developing the Borg cluster management system, which influenced the design and development of the open source successor that we all know today - Kubernetes. Why is that interesting?
I enjoy uncovering the reasons for systems existence - the ‘why’ behind it, the original problem. It is a part of the same mindset of trying to deepen understanding I wrote in my fundamentals article. It is not about where I am going to use something, but where it is, or in this case - was, used.
By now there are so many articles on the internet written on why one should use and how to use Kubernetes. From a one person blog to a small startup, to beyond. It is very easy to get an engineer’s FOMO - everyone seems to be using it so I should too. We get onboard, the brakes are off and we ride the hype train like there’s no tomorrow.
Meanwhile, it would always be prudent to consider the original problem some system tried to solve, before committing it as a solution to another. If for no other reason than to confirm the original problem or environment is some permutation of ours. Sometimes potentially other viable solutions might surface too.
Just from the snippet above it can be seen that the original problem was scheduling, resource constraint management, automated recovery and execution of tasks across a large fleet of machines. Which should immediately raise a question - is this the only way these problems can be solved today? Depends on the environment.
For a company that does use a lot of resources, runs many long running batch jobs, has a large engineering team, or many services - it may as well be the best long term choice.
Take for instance, my current workplace - Lyst. It has an engineering team of ~100 people, many different workloads, and services that each team is responsible for managing. By the time I arrived, it had been using a largely obscure container orchestration tool that had been deprecated, and had quite a few limitations. Our platform team executed migration to Kubernetes and it was a massive improvement, from engineering productivity to easier platform management.
On the other hand, for a company that just has a website, runs very few long running jobs, with a two-pizza sized engineering team and one or two services, and not a lot of clients - an early startup - it is objectively not the best choice today.
Before I get written off as the next Kubernetes hater, let me redeem myself, because I’ve been in this situation. What is important in the small company, assuming it is a startup, is the speed of development with very low maintenance overhead, and optimizing for growth means having a plan, not deploying a Google level stack on day 1.
As an example, this was my plan when I worked at Genus AI where we used AWS:
- AWS Lambda. Was good early on, when the product was not massively complex. It allowed shipping code fast and limited the infrastructure maintenance cost. The start-up was new, B2B, so while we were working out the product-market fit, the costs were low.
- AWS Fargate - when product grew and we needed more complex & long running tasks, or when Lambda would’ve been more expensive and the cost trade-off no longer held true
- AWS EKS - I left before this stage, but before I left I was considering this as the next step, when the company gew to multiple engineering teams and I’d needed standardized tools for them, instead of asking the teams to fiddle with AWS directly. As good as it would be, not everyone knows how or enjoys that kind of work.
This shows how trying to find the deeper, fundamental, reason behind certain systems, can yield better results, more creative and adaptable solutions, than following blindly and taking a premade solution with the whole maintenance baggage coming along with it. Beyond Kubernetes, this mindset allows me to shun all mentions of ‘don’t look here - this is legacy system,’ do some code anthropology, and where someone would jump at a chance to rewrite a legacy system from scratch, I see an opportunity to evolve it. Which reminds me of another book and another quote:
A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system.
The infamous Gall’s Law - the reason kubernetes exist.