We had to figure this out the hard way, and ended up with this approach (approximately).
K8S provides two (well three, now) health checks.
How this interacts with ALB is quite important.
Liveness should always return 200 OK unless you have hit some fatal condition where your container considers itself dead and wants to be restarted.
Readiness should only return 200 OK if you are ready to serve traffic.
We configure the ALB to only point to the readiness check.
So our application lifecycle looks like this:
* Container starts
* Application loads
* Liveness begins serving 200
* Some internal health checks run and set readiness state to True
* Readiness checks now return 200
* ALB checks begin passing and so pod is added to the target group
* Pod starts getting traffic.
time passes. Eventually for some reason the pod needs to shut down.
* Kube calls the preStop hook
* PreStop sends SIGUSR1 to app and waits for N seconds.
* App handler for SIGUSR1 tells readiness hook to start failing.
* ALB health checks begin failing, and no new requests should be sent.
* ALB takes the pod out of the target group.
* PreStop hook finishes waiting and returns
* Kube sends SIGTERM
* App wraps up any remaining in-flight requests and shuts down.
This allows the app to do graceful shut down, and ensures the ALB doesn't send traffic to a pod that knows it is being shut down.
Oh, and on the Readiness check - your app can use this to (temporarily) signal that it is too busy to serve more traffic. Handy as another signal you can monitor for scaling.
A lot of this seems like the fault of the ALB, is it? I had the same problem and eventually moved off of it to cloudflare tunnels pointed at service load balancers directly, which changed immediately when pods went bad. With a grace period for normal shutdowns, I haven't seen any downtime for deploys or errors.
The issue with the above setup is (maybe I'm doing it wrong?) but if a pod is removed suddenly, say if it crashes, then some portion of traffic gets errors until the ALB updates. And that can be an agonizingly long time, which seemed because it's pointed at IP addresses in the cluster and not the service. It seemed like a shortcoming of the ALB. GKE doesn't have the same behavior.
I'm not the expert but found something that worked.
> A lot of this seems like the fault of the ALB, is it?
I definitely think the ALB Controller should be taking a more active hand in termination of pods that are targets of an ALB.
But the ALB Controller is exhibiting the same symptom I keep running into throughout Kubernetes.
The amount of "X is a problem because the pod dies too quickly before Y has a chance to clean up/whatever, so we add a preStop sleep of 30 seconds" in the Kubernetes world is truly frustrating.
Presumably, because it'd be annoying waiting for lame duck mode when you actually do want the application to terminate quickly. SIGKILL usually needs special privileges/root and doesn't give the application any time to clean-up/flush/etc. The other workaround I've seen is having the application clean-up immediately upon a second signal, which I reckon could also work, but either solution seems reasonable.
We don't want to kill in-flight requests - terminating while a request is outstanding will result in clients connected to the ALB getting some HTTP 5xx response.
The AWS ALB Controller inside Kubernetes doesn't give us a nice way to specifically say "deregister this target"
The ALB will continue to send us traffic while we return 'healthy' to it's health checks.
So we need some way to signal the application to stop serving 'healthy' responses to the ALB Health Checks, which will force the ALB to mark us as unhealthy in the target group and stop sending us traffic.
SIGUSR1 was an otherwise unused signal that we can send to the application without impacting how other signals might be handled.
Or nginx. In both cases it’s probably more expensive than an ALB but you have better integration with the app side, plus traffic mesh benefits if you’re using istio. The caveat is that you are managing your own public-facing nodes.
I know this won't be helpful to folks committed to EKS, but AWS ECS (i.e. running docker containers with AWS controlling) does a really great job on this, we've been running ECS for years (at multiple companies), and basically no hiccups.
One of my former co-workers went to a K8S shop, and longs for the simplicity of ECS.
No software is a panacea, but ECS seems to be one of those "it just works" technologies.
I agree that ECS works great for stateless containerized workloads. But you will need other AWS-managed services for state (RDS), caching (ElastiCache), and queueing (SQS).
So your application is now suddenly spread across multiple services, and you'll need an IaC tool like Terraform, etc.
The beauty (and the main reason we use K8s) is that everything is inside our cluster. We use cloudnative-pg, Redis pods, and RabbitMQ if needed, so everything is maintained in a GitOps project, and we have no IaC management overhead.
(We do manually provision S3 buckets for backups and object storage, though.)
Many companies run k8s for compute and use rds/sqs/redis outside of it. For example RDS is not just hosted PG, it has a whole bunch of features that don’t come out of the box (you do pay for it, I’m not giving an opinion as to whether it’s worth the price)
In everything you've listed, my conclusion is the opposite. The spread across multiple managed services is not a bad thing, that's actually better considering that using them reduces operational overhead. That is, the spread is irrelevant if the services are managed.
The ugliness of k8s is that you're bringing your points of failure together into one, mega point of failure and complexity.
Final aside - you absolutely should be using IaC for any serious deployments. If you're using clickops or CLI then the context of the discussion is different and the same critera do not apply.
You make a great point that when everything is on kube it’s easier to manage.
But… if you are maintaining storage buckets and stuff elsewhere (to avoid accidental deletion etc, a worthy cause) then you are using terraform regardless. So adding RDS etc to the mix is not as tough as you make it sound.
I see both sides of the fence and both have their pros and cons.
If you have great operational experience with kube though I’d go all in on that. AWS bends you over with management fees… it’s far more affordable to run a DB, RMQ, etc on your own versus RDS, AMQ
> One of my former co-workers went to a K8S shop, and longs for the simplicity of ECS.
I was using K8s previously, and I’m currently using ECS in my current team, and I hate it. I would _much _ rather have K8s back. The UX is all over the place, none of my normal tooling works, deployment configs are so much worse than the K8s equivalent.
I think like a lot of things, once you’re used to having the knobs of k8s and its DX, you’ll want them always. But a lot of teams adopt k8s because they need a containerized service in AWS, and have no real opinions about how, and in those cases ECS is almost always easier (even with all its quirks).
(And it’s free, if you don’t mind the mild lock-in).
Checks out - I was reading the rest of this and thought "geez, I use ECS and it's nowhere near as complicated as this". Glad I wasn't missing anything.
I've never used Kubernetes myself, but ECS seems to "just work" for my use case of run a simple web app with autoscaling and no downtime.
The amount of companies who use K8s when they have no business nor technological justification for it is staggering. It is the number one blocker in moving to bare metal/on prem when costs become too much.
Yes, on prem has its gotchas just like the EKS deployment described in the post, but everything is so much simpler and straightforward it's much easier to grasp the on prem side of things.
I've come at this from a slightly different angle. I've seen many clients running k8s on expensive cloud instances, but to me that is solving the same problems twice. Both k8s and cloud instances solve a highly related and overlapping set of problems.
Instead you can take k8s, deploy it to bare metal, and have a much much more power for a much lower cost. Of course this requires some technical knowledge, but the benefits are significant (lower costs, stable costs, no vendor lock-in, all the postgres extensions you want, response times halved, etc).
k8s smoothes over the vagaries of bare-metal very nicely.
If you'll excuse a quick plug for my work: We [1] offer a middle ground for this, whereby we do and manage all this for you. We take over all DevOps and infrastructure responsibility while also cutting spend by around 50%. (cloud hardware really is that expensive in comparison).
>Instead you can take k8s, deploy it to bare metal, and have a much much more power for a much lower cost. Of course this requires some technical knowledge, but the benefits are significant (lower costs, stable costs, no vendor lock-in, all the postgres extensions you want, response times halved, etc).
>all the postgres extensions you want
You can run Postgres in any managed K8s environment (say AWS EKS) just fine and enable any extensions you want as well. Unless you're conflating managed Postgres solutions like RDS, which would imply that the only way to run databases is by using a managed service of your cloud of choice, which obviously isn't true.
Could you expand a bit on the point of K8S being a blocker to moving to on-prem?
Naively, I would think it be neutral, since I would assume that if a customer gets k8s running on-prem, then apps designed for running in k8s should have a straightforward migration path?
I can expand a little bit, but based on your question, I suspect you may know everything I'm going to type.
In cloud environments, it's pretty common that your cloud provider has specific implementations of Kubernetes objects, either by creating custom resources that you can make use of, or just building opinionated default instances of things like storage classes, load balancers, etc.
It's pretty easy to not think about the implementation details of, say, an object-storage-backed PVC until you need to do it in a K8s instance that doesn't already have your desired storage class. Then you've got to figure out how to map your simple-but-custom $thing from provider-managed to platform-managed. If you're moving into Rancher, for instance, it's relatively batteries-included, but there are definitely considerations you need to make for things like how machines are built from disk storage perspective and where longhorn drives are mapped, for instance.
It's like that for a ton of stuff, and a whole lot of the Kubernetes/OutsideInfra interface is like that. Networking, storage, maybe even certificate management, those all need considerations if you're migrating from cloud to on-prem.
Here is your business justification: K8s / Helm charts have become the de-facto standard for packaging applications for on-premise deployments. If you choose any other deployment option on a setup/support contract, the supplier will likely charge you for additional hours.
This is also what we observe while building Distr. ISVs are in need for a container registry to hand over these images to their customers. Our container registry will be purpose build for this use-case.
> The amount of companies who use K8s when they have no business nor technological justification for it is staggering.
I remember a guy I used to work with telling me he'd been at a consulting shop and they used Kubernetes for everything - including static marketing sites. I assume it was a combination of resume and bill padding.
Out of interest do you recommend any good places to host a machine in the US? A major part of why I like cloud is because it really simplifies the hardware maintenance.
I'm running kubernetes on digital ocean. It was under $100/mo until last week when I upgraded a couple nodes because memory was getting a bit tight. That was just a couple clicks so not a big deal.
We've been with them over 10 years now. Mostly pretty happy. They've had a couple small outages.
I'm not sure why they state "although the AWS Load Balancer Controller is a fantastic piece of software, it is surprisingly tricky to roll out releases without downtime."
The AWS Load Balancer Controller uses readiness gates by default, exactly as described in the article. Am I missing something?
Edit: Ah, it's not by default, it requires a label in the namespace. I'd forgotten about this. To be fair though, the AWS docs tell you to add this label.
I think the "label (edit: annotation) based configuration" has got to be my least favorite thing about the k8s ecosystem. They're super magic, completely undiscoverable outside the documentation, not typed, not validated (for mutually exclusive options), and rely on introspecting the cluster and so aren't part of the k8s solver.
AWS uses them for all of their integrations and they're never not annoying.
I think you mean annotations. Labels and annotations are different things. And btw. Annotations can be validated and can be typed. With validation webhooks.
Yes, that is what we thought as well, but it turns out that the there is still a delay between the load balancer controller registering a target as offline and the pod actually being already terminated. We did some benchmarks to highlight that gap.
A few years ago, while helping build a platform on Google Cloud & GKE for a client, we found the same issues.
At that point we already had a CRD used by most of out tenant apps, which deployed an opinionated (but generally flexible enough) full app stack (Deployment, Service, PodMonitor, many sane defaults for affinity/anti-affinity, etc, lots of which configurable, and other things).
Because we didn't have an opinion on what tenant apps would use in their containers, we needed a way to make the pre-stop sleep small but OS-agnostic.
We ended up with a 1 LOC (plus headers) C app that compiled to a tiny static binary. This was put in a ConfigMap, which the controller mounted on the Pod, from where it could be executed natively.
Perhaps not the most elegant solution, but a simple enough one that got the job done and was left alone with zero required maintenance for years - it might still be there to this day. It was quite fun to watch the reaction of new platform engineers the first time they'd come across it in the codebase. :D
I realized somewhat recently I could put my Nginx and PHP ini in a config map, that seems to work ok. Even that seems a bit dirty though, doesn't it base64 it and save it with all the other yaml configs? Doesn't seem like it's made for files
This is actually a fascinatingly complex problem. Some notes about the article:
* The 20s delay before shutdown is called “lame duck mode.” As implemented it’s close to good, but not perfect.
* When in lame duck mode you should fail the pod’s health check. That way you don’t rely on the ALB controller to remove your pod. Your pod is still serving other requests, but gracefully asking everyone to forget about it.
* Make an effort to close http keep-alive connections. This is more important if you’re running another proxy that won’t listen to the health checks above (eg AWS -> Node -> kube-proxy -> pod). Note that you can only do that when a request comes in - but it’s as simple as a Connection: close header on the response.
* On a fun note, the new-ish kubernetes graceful node shutdown feature won’t remove your pod readiness when shutting down.
More likely they mean "readiness check" - this is the one that removes you from the Kubernetes load balancer service. Liveness check failing does indeed cause the container to restart.
Yes sorry for not qualifying - that’s right. IMO the liveness check is only rarely useful - but I've not really run any bleeding edge services on kube. I assume it’s more useful if you actually working on dangerous code - locking, threading, etc. I’ve mostly only run web apps.
>The truth is that although the AWS Load Balancer Controller is a fantastic piece >of software, it is surprisingly tricky to roll out releases without downtime.
20 years ago we used simple bash scripts using curl to do rest calls to take one host out of our load balancers, then scp to the host and shut down the app gracefully, and updated the app using scp again, then put it back into the load balancer after testing the host on its own. we had 4 or 5 scripts max, straightforward stuff..
They charge $$$ and you get downtime in this simple scenario ?
I used to work in this world, too. What is described here about EKS/K8s sounds tricky but it is actually pretty simple and quite a lot more standardized than what we all used to do. You have two health checks and using those, the app has total control over whether it’s serving traffic or not and gives the scheduler clear guidance about whether or not to restart it. You build it once (20 loc maybe) and then all your apps work the same way. We just have this in our cookie cutter repo.
The fact that the state of the art container orchestration system requires you to run a sleep command in order to not drop traffic on the floor is a travesty of system design.
We had perfectly good rolling deploys before k8s came on the scene, but k8s insistence on a single-phase deployment process means we end up with this silly workaround.
I yelled into the void about this once and I was told that this was inevitable because it's an eventually consistent distributed system. I'm pretty sure it could still have had a 2 phase pod shutdown by encoding a timeout on the first stage. Sure, it would have made some internals require more complex state - but isn't that the point of k8s? Instead everyone has to rediscover the sleep hack over and over again.
In fairness to Kubernetes, this partially due to AWS and how their ALB/NLB interact with Kubernetes. So, when Kubernetes starts to replace Pods, the Amazon ALB/NLB Controller starts reacting, however, it must make calls to Amazon API and wait for ALB/NLB to catch up with changing state of the cluster. Kubernetes is not aware of this and continues on blindly. If Ingress Controller was more integrated into the cluster, you wouldn't have this problem. We run Ingress-Nginx at work instead of ALB for this reason.
Thus, this entire system of "Mark me not ready, wait for ALB/NLB to realize I'm not ready and stop sending traffic, wait for that to finish, terminate and Kubernetes continues with rollout."
You would have same problem if you just started up new images in autoscaling group and randomly SSH into old ones and running "shutdown -h now". ALB would be shocked by sudden departure of VMs and you would probably get traffic going to old VMs until health checks caught up.
EDIT: Azure/GCP have same issue if you use their provided ALBs.
Nginx ingress has the same problem, it's just much faster at switching over when a pod is marked as unready because it's continuously watching the endpoints.
Kubernetes is missing a mechanism for load balancing services (like ingress, gateways) to ack pods being marked as not ready before the pod itself is terminated.
They are a few warts like this with core/apps controllers. Nothing unfixable within general k8s design imho but unfortunately most of the community have moved on to newer shinier things
It shouldn't. I've not had the braincells yet to fully internalize the entire article, but it seems like we go wrong about here:
> The AWS Load Balancer keeps sending new requests to the target for several seconds after the application is sent the termination signal!
And then concluded a wait is required…? Yes, traffic might not cease immediately, but you drain the connections to the load balancer, and then exit. A decent HTTP framework should be doing this by default on SIGTERM.
> I yelled into the void about this once and I was told that this was inevitable because it's an eventually consistent distributed system.
Yeah, I wouldn't agree with that either. A terminating pod is inherently "not ready", that not-ready state should cause the load balancer to remove it from rotation. Similarly, the pod itself can drain its connections to the load balancer. That could take time; there's always going to be some point at which you'd have to give up on a slowloris request.
The fundamental gap in my opinion, is that k8s has no mechanism (that I am aware of) to notify the load balancing mechanism (whether that's a service, ingress or gateway) that it intends to remove a node - and for the load balancer to confirm this is complete.
This is how all pre-k8s rolling deployment systems I've used have worked.
So instead we move the logic to the application, and put a sleep in the shutdown phase to account for the time it takes for the load balancer to process/acknowledge the shutdown and stop routing new traffic to that node.
K8s made simple things complicated, yet it doesn't have obvious safety (or sanity) mechanisms, making everyday life a PITA. I wonder why it was adopted so quickly despite its flaws, and the only thing coming to my mind is, like Java in 90s: massive marketing and propaganda that it's "inevitable"..
> put a sleep in the shutdown phase to account for the time it takes for the load balancer to process/acknowledge the shutdown and stop routing new traffic to that node.
Again, I don't see why the sleep is required. You're removed from the load balancer when the last connection from the LB closes.
Most http frameworks don't do this right. They typically wait until all known in-flight requests complete and then exit. That's usually too fast for a load balancer that's still sending new requests. Instead you should just wait 30 seconds or so while still accepting new requests and replying not ready to load balancer health checks, and then if you want to wait additional time for long running requests, you can. You can also send clients "connection: close" to convince them to reopen connections against different backends.
> That's usually too fast for a load balancer that's still sending new requests.
How?
A load balancer can't send a new request on a connection that doesn't exist. (Existing connections being gracefully torn down as requests conclude on them & as the underlying protocol permits.) If it cannot open a connection to the backend (the backend should not allow new connections when the drain starts) then by definition new requests cannot end up at the backend.
The server in http is limited in its ability to initiate connection closures. Remember that when you close a connection in TCP, that sends a FIN packet, and the other end of the connection doesn't know that that's happened yet and might still be sending data packets. In http, the server can request that the client stop using a connection and close it with the "connection: close" header. If the server closes the connection abruptly, there could be requests in flight on the network. With http pipelining, the server may even receive requests on the same connection after sending "connection: close" since they could have been sent by the client before that header was received. With pipelining, the client needs to close the TCP connection to achieve a graceful shutdown.
K8S is overrated, it's actually pretty terrible but everyone has been convinced it's the solution to all of their problems because it's slightly better than what we had 15 years ago (Ansible/Puppet/Bash/immutable deployments) at 10x the complexity. There are so many weird edge cases just waiting to completely ruin your day. Like subPath mounts. If you use subPath then changes to a ConfigMap don't get reflected into the container. The container doesn't get restarted either of course, so you have config drift built in, unless you install one of those weird hacky controllers that restarts pods for you.
I wouldn't throw away k8s just for subPath weirdness, but I hear your general point about complexity. But if you are throwing away Ansible and Puppet, what is your solution? Also I'm not entirely sure what you are getting at with bash (what does shell scripting have to do with it?) and immutable deployments.
That's only one example of K8s weirdness that can wake you up at 3am. How: change is rolled out during business hours that changes service config inside ConfigMap. Pod doesn't get notified or reload this change. Pod crashes at night, loads the new (bad/invalid) config, takes down production. To add insult to injury, the engineers spend hours debugging the issue because it's completely unintuitive that CM changes are not reflected ONLY when using subPath.
That's totally valid. I understand the desire of k8s maintainers to prevent "cascading changes" from happening, but this one is a very reasonable feature they seem to not support.
There's a pretty common hack to make things restart on a config change by adding a pod annotation with the configmap hash:
That's how I do it, with kustomize. Definitely confused me before I learned that, but hasn't been an issue for years. And if you don't use kustomize, you just do... What was it kubectl rollout? Add that to the end you deploy script and you're good.
We’re using Argo rollouts without issue. It’s a super set of a deployment with configuration based blue green deploy or canary. Works great for us and allows us to get around the problem laid out in this article.
Argo Rollouts is an extra orchestration layer on top of a traffic management provider. Which one are you using? If you use the ALB controller you still have to deal with pod shutdown / target deregistration timing issues.
Does this or any of the strategies listed in the comments properly handle long lived client connections? It's sufficient enough to wait for the LB to stop sending traffic when connections are 100s of ms or less but when connections are minutes or even hours long it doesn't work out well.
Is there a slick strategy for this? Is it possible to have minutes long pre-stop hooks? Is the only option to give client connections an abandon ship message and kick them out hopefully fast enough?
We had to figure this out the hard way, and ended up with this approach (approximately).
K8S provides two (well three, now) health checks.
How this interacts with ALB is quite important.
Liveness should always return 200 OK unless you have hit some fatal condition where your container considers itself dead and wants to be restarted.
Readiness should only return 200 OK if you are ready to serve traffic.
We configure the ALB to only point to the readiness check.
So our application lifecycle looks like this:
* Container starts
* Application loads
* Liveness begins serving 200
* Some internal health checks run and set readiness state to True
* Readiness checks now return 200
* ALB checks begin passing and so pod is added to the target group
* Pod starts getting traffic.
time passes. Eventually for some reason the pod needs to shut down.
* Kube calls the preStop hook
* PreStop sends SIGUSR1 to app and waits for N seconds.
* App handler for SIGUSR1 tells readiness hook to start failing.
* ALB health checks begin failing, and no new requests should be sent.
* ALB takes the pod out of the target group.
* PreStop hook finishes waiting and returns
* Kube sends SIGTERM
* App wraps up any remaining in-flight requests and shuts down.
This allows the app to do graceful shut down, and ensures the ALB doesn't send traffic to a pod that knows it is being shut down.
Oh, and on the Readiness check - your app can use this to (temporarily) signal that it is too busy to serve more traffic. Handy as another signal you can monitor for scaling.
e: Formatting was slightly broken.
A lot of this seems like the fault of the ALB, is it? I had the same problem and eventually moved off of it to cloudflare tunnels pointed at service load balancers directly, which changed immediately when pods went bad. With a grace period for normal shutdowns, I haven't seen any downtime for deploys or errors.
The issue with the above setup is (maybe I'm doing it wrong?) but if a pod is removed suddenly, say if it crashes, then some portion of traffic gets errors until the ALB updates. And that can be an agonizingly long time, which seemed because it's pointed at IP addresses in the cluster and not the service. It seemed like a shortcoming of the ALB. GKE doesn't have the same behavior.
I'm not the expert but found something that worked.
> A lot of this seems like the fault of the ALB, is it?
I definitely think the ALB Controller should be taking a more active hand in termination of pods that are targets of an ALB.
But the ALB Controller is exhibiting the same symptom I keep running into throughout Kubernetes.
The amount of "X is a problem because the pod dies too quickly before Y has a chance to clean up/whatever, so we add a preStop sleep of 30 seconds" in the Kubernetes world is truly frustrating.
> A lot of this seems like the fault of the ALB, is it?
People forget to enable pod readiness gates.
Racing against an ASG/ALB combo is always a horrifying adrenaline rush.
Nobody should be using ASG's anymore. EKS Auto Mode or Karpenter.
Why the additional SUGUSR1 vs just doing those (failing health, sleeping) on SIGTERM?
Presumably, because it'd be annoying waiting for lame duck mode when you actually do want the application to terminate quickly. SIGKILL usually needs special privileges/root and doesn't give the application any time to clean-up/flush/etc. The other workaround I've seen is having the application clean-up immediately upon a second signal, which I reckon could also work, but either solution seems reasonable.
We have a number of concurrent issues.
We don't want to kill in-flight requests - terminating while a request is outstanding will result in clients connected to the ALB getting some HTTP 5xx response.
The AWS ALB Controller inside Kubernetes doesn't give us a nice way to specifically say "deregister this target"
The ALB will continue to send us traffic while we return 'healthy' to it's health checks.
So we need some way to signal the application to stop serving 'healthy' responses to the ALB Health Checks, which will force the ALB to mark us as unhealthy in the target group and stop sending us traffic.
SIGUSR1 was an otherwise unused signal that we can send to the application without impacting how other signals might be handled.
Istio automates this (at the risk of adding more complexity)
Or nginx. In both cases it’s probably more expensive than an ALB but you have better integration with the app side, plus traffic mesh benefits if you’re using istio. The caveat is that you are managing your own public-facing nodes.
> App handler for SIGUSR1 tells readiness hook to start failing.
Doesn't the kubernetes pod shutdown already mark the pod as not-ready before it calls the pre-stop hook?
I know this won't be helpful to folks committed to EKS, but AWS ECS (i.e. running docker containers with AWS controlling) does a really great job on this, we've been running ECS for years (at multiple companies), and basically no hiccups.
One of my former co-workers went to a K8S shop, and longs for the simplicity of ECS.
No software is a panacea, but ECS seems to be one of those "it just works" technologies.
I agree that ECS works great for stateless containerized workloads. But you will need other AWS-managed services for state (RDS), caching (ElastiCache), and queueing (SQS).
So your application is now suddenly spread across multiple services, and you'll need an IaC tool like Terraform, etc.
The beauty (and the main reason we use K8s) is that everything is inside our cluster. We use cloudnative-pg, Redis pods, and RabbitMQ if needed, so everything is maintained in a GitOps project, and we have no IaC management overhead.
(We do manually provision S3 buckets for backups and object storage, though.)
Mentioning “no IaC management overhead” is weird. If you’re not using IaC, you’re doing it wrong.
However, GitOps is IaC, just by another name, so you actually do have IaC “overhead”.
Many companies run k8s for compute and use rds/sqs/redis outside of it. For example RDS is not just hosted PG, it has a whole bunch of features that don’t come out of the box (you do pay for it, I’m not giving an opinion as to whether it’s worth the price)
In everything you've listed, my conclusion is the opposite. The spread across multiple managed services is not a bad thing, that's actually better considering that using them reduces operational overhead. That is, the spread is irrelevant if the services are managed.
The ugliness of k8s is that you're bringing your points of failure together into one, mega point of failure and complexity.
Final aside - you absolutely should be using IaC for any serious deployments. If you're using clickops or CLI then the context of the discussion is different and the same critera do not apply.
You make a great point that when everything is on kube it’s easier to manage.
But… if you are maintaining storage buckets and stuff elsewhere (to avoid accidental deletion etc, a worthy cause) then you are using terraform regardless. So adding RDS etc to the mix is not as tough as you make it sound.
I see both sides of the fence and both have their pros and cons.
If you have great operational experience with kube though I’d go all in on that. AWS bends you over with management fees… it’s far more affordable to run a DB, RMQ, etc on your own versus RDS, AMQ
You’ve replaced IaC overhead with k8s overhead
How do you run all this on developer's machine?
> One of my former co-workers went to a K8S shop, and longs for the simplicity of ECS.
I was using K8s previously, and I’m currently using ECS in my current team, and I hate it. I would _much _ rather have K8s back. The UX is all over the place, none of my normal tooling works, deployment configs are so much worse than the K8s equivalent.
I think like a lot of things, once you’re used to having the knobs of k8s and its DX, you’ll want them always. But a lot of teams adopt k8s because they need a containerized service in AWS, and have no real opinions about how, and in those cases ECS is almost always easier (even with all its quirks).
(And it’s free, if you don’t mind the mild lock-in).
Checks out - I was reading the rest of this and thought "geez, I use ECS and it's nowhere near as complicated as this". Glad I wasn't missing anything.
I've never used Kubernetes myself, but ECS seems to "just work" for my use case of run a simple web app with autoscaling and no downtime.
Completely agree, unless you are operating a platform for others to deploy to, ECS is a lot simpler, and works really well for a lot of common setups.
If you're on GCP, Google Cloud Run also "just works" quite well, too.
Amazing product, doesn’t get nearly the attention it deserves. ECS is a hot spaghetti mess in comparison.
We've been moving away from K8S to ECS...it just works without all the complexity.
I run https://BareMetalSavings.com.
The amount of companies who use K8s when they have no business nor technological justification for it is staggering. It is the number one blocker in moving to bare metal/on prem when costs become too much.
Yes, on prem has its gotchas just like the EKS deployment described in the post, but everything is so much simpler and straightforward it's much easier to grasp the on prem side of things.
I've come at this from a slightly different angle. I've seen many clients running k8s on expensive cloud instances, but to me that is solving the same problems twice. Both k8s and cloud instances solve a highly related and overlapping set of problems.
Instead you can take k8s, deploy it to bare metal, and have a much much more power for a much lower cost. Of course this requires some technical knowledge, but the benefits are significant (lower costs, stable costs, no vendor lock-in, all the postgres extensions you want, response times halved, etc).
k8s smoothes over the vagaries of bare-metal very nicely.
If you'll excuse a quick plug for my work: We [1] offer a middle ground for this, whereby we do and manage all this for you. We take over all DevOps and infrastructure responsibility while also cutting spend by around 50%. (cloud hardware really is that expensive in comparison).
[1]: https://lithus.eu
>Instead you can take k8s, deploy it to bare metal, and have a much much more power for a much lower cost. Of course this requires some technical knowledge, but the benefits are significant (lower costs, stable costs, no vendor lock-in, all the postgres extensions you want, response times halved, etc).
>all the postgres extensions you want
You can run Postgres in any managed K8s environment (say AWS EKS) just fine and enable any extensions you want as well. Unless you're conflating managed Postgres solutions like RDS, which would imply that the only way to run databases is by using a managed service of your cloud of choice, which obviously isn't true.
Could you expand a bit on the point of K8S being a blocker to moving to on-prem?
Naively, I would think it be neutral, since I would assume that if a customer gets k8s running on-prem, then apps designed for running in k8s should have a straightforward migration path?
I can expand a little bit, but based on your question, I suspect you may know everything I'm going to type.
In cloud environments, it's pretty common that your cloud provider has specific implementations of Kubernetes objects, either by creating custom resources that you can make use of, or just building opinionated default instances of things like storage classes, load balancers, etc.
It's pretty easy to not think about the implementation details of, say, an object-storage-backed PVC until you need to do it in a K8s instance that doesn't already have your desired storage class. Then you've got to figure out how to map your simple-but-custom $thing from provider-managed to platform-managed. If you're moving into Rancher, for instance, it's relatively batteries-included, but there are definitely considerations you need to make for things like how machines are built from disk storage perspective and where longhorn drives are mapped, for instance.
It's like that for a ton of stuff, and a whole lot of the Kubernetes/OutsideInfra interface is like that. Networking, storage, maybe even certificate management, those all need considerations if you're migrating from cloud to on-prem.
Here is your business justification: K8s / Helm charts have become the de-facto standard for packaging applications for on-premise deployments. If you choose any other deployment option on a setup/support contract, the supplier will likely charge you for additional hours.
This is also what we observe while building Distr. ISVs are in need for a container registry to hand over these images to their customers. Our container registry will be purpose build for this use-case.
> The amount of companies who use K8s when they have no business nor technological justification for it is staggering.
I remember a guy I used to work with telling me he'd been at a consulting shop and they used Kubernetes for everything - including static marketing sites. I assume it was a combination of resume and bill padding.
I'm using k8s for my static marketing site. It's in the same cluster as my app tho, so I'm not paying extra for it. Don't think I'd do it otherwise.
Out of interest do you recommend any good places to host a machine in the US? A major part of why I like cloud is because it really simplifies the hardware maintenance.
I'm running kubernetes on digital ocean. It was under $100/mo until last week when I upgraded a couple nodes because memory was getting a bit tight. That was just a couple clicks so not a big deal. We've been with them over 10 years now. Mostly pretty happy. They've had a couple small outages.
Talos for on prem k8s is dead simple
I'm not sure why they state "although the AWS Load Balancer Controller is a fantastic piece of software, it is surprisingly tricky to roll out releases without downtime."
The AWS Load Balancer Controller uses readiness gates by default, exactly as described in the article. Am I missing something?
Edit: Ah, it's not by default, it requires a label in the namespace. I'd forgotten about this. To be fair though, the AWS docs tell you to add this label.
I think the "label (edit: annotation) based configuration" has got to be my least favorite thing about the k8s ecosystem. They're super magic, completely undiscoverable outside the documentation, not typed, not validated (for mutually exclusive options), and rely on introspecting the cluster and so aren't part of the k8s solver.
AWS uses them for all of their integrations and they're never not annoying.
I think you mean annotations. Labels and annotations are different things. And btw. Annotations can be validated and can be typed. With validation webhooks.
Yes, that is what we thought as well, but it turns out that the there is still a delay between the load balancer controller registering a target as offline and the pod actually being already terminated. We did some benchmarks to highlight that gap.
You mean the problem you describe in "Part 3" of the article?
Damn it, now you've made me paranoid. I'll have to check the ELB logs for 502 errors during our deployment windows.
A few years ago, while helping build a platform on Google Cloud & GKE for a client, we found the same issues.
At that point we already had a CRD used by most of out tenant apps, which deployed an opinionated (but generally flexible enough) full app stack (Deployment, Service, PodMonitor, many sane defaults for affinity/anti-affinity, etc, lots of which configurable, and other things).
Because we didn't have an opinion on what tenant apps would use in their containers, we needed a way to make the pre-stop sleep small but OS-agnostic.
We ended up with a 1 LOC (plus headers) C app that compiled to a tiny static binary. This was put in a ConfigMap, which the controller mounted on the Pod, from where it could be executed natively.
Perhaps not the most elegant solution, but a simple enough one that got the job done and was left alone with zero required maintenance for years - it might still be there to this day. It was quite fun to watch the reaction of new platform engineers the first time they'd come across it in the codebase. :D
An executable in a ConfigMap? That's interesting.
I realized somewhat recently I could put my Nginx and PHP ini in a config map, that seems to work ok. Even that seems a bit dirty though, doesn't it base64 it and save it with all the other yaml configs? Doesn't seem like it's made for files
> doesn't it base64 it and save it with all the other yaml configs
It does! It's mountable in the filesystem though. In this case, the data key is the filename, and its un-base64'd data, the file contents.
> Even that seems a bit dirty though
As I mentioned in the previous comment, "Perhaps not the most elegant solution" :D
It's been maintenance-free for years though, and since its introduction there were 0 rollout-related 502s.
Yeah, it's been working for me too! Feels weird but if it works it works I guess
This is actually a fascinatingly complex problem. Some notes about the article: * The 20s delay before shutdown is called “lame duck mode.” As implemented it’s close to good, but not perfect. * When in lame duck mode you should fail the pod’s health check. That way you don’t rely on the ALB controller to remove your pod. Your pod is still serving other requests, but gracefully asking everyone to forget about it. * Make an effort to close http keep-alive connections. This is more important if you’re running another proxy that won’t listen to the health checks above (eg AWS -> Node -> kube-proxy -> pod). Note that you can only do that when a request comes in - but it’s as simple as a Connection: close header on the response. * On a fun note, the new-ish kubernetes graceful node shutdown feature won’t remove your pod readiness when shutting down.
With health I presume you mean readiness check. right? Otherwise it will kill the container when the liveness check fails.
By health check, do you mean the kubernetes liveness check? Does that make kube try to kill or restart your container?
More likely they mean "readiness check" - this is the one that removes you from the Kubernetes load balancer service. Liveness check failing does indeed cause the container to restart.
Yes sorry for not qualifying - that’s right. IMO the liveness check is only rarely useful - but I've not really run any bleeding edge services on kube. I assume it’s more useful if you actually working on dangerous code - locking, threading, etc. I’ve mostly only run web apps.
liveness is great for java apps that spend all their time fencing locks. I've seen too many completely deadlock.
>The truth is that although the AWS Load Balancer Controller is a fantastic piece >of software, it is surprisingly tricky to roll out releases without downtime.
20 years ago we used simple bash scripts using curl to do rest calls to take one host out of our load balancers, then scp to the host and shut down the app gracefully, and updated the app using scp again, then put it back into the load balancer after testing the host on its own. we had 4 or 5 scripts max, straightforward stuff..
They charge $$$ and you get downtime in this simple scenario ?
I used to work in this world, too. What is described here about EKS/K8s sounds tricky but it is actually pretty simple and quite a lot more standardized than what we all used to do. You have two health checks and using those, the app has total control over whether it’s serving traffic or not and gives the scheduler clear guidance about whether or not to restart it. You build it once (20 loc maybe) and then all your apps work the same way. We just have this in our cookie cutter repo.
The fact that the state of the art container orchestration system requires you to run a sleep command in order to not drop traffic on the floor is a travesty of system design.
We had perfectly good rolling deploys before k8s came on the scene, but k8s insistence on a single-phase deployment process means we end up with this silly workaround.
I yelled into the void about this once and I was told that this was inevitable because it's an eventually consistent distributed system. I'm pretty sure it could still have had a 2 phase pod shutdown by encoding a timeout on the first stage. Sure, it would have made some internals require more complex state - but isn't that the point of k8s? Instead everyone has to rediscover the sleep hack over and over again.
In fairness to Kubernetes, this partially due to AWS and how their ALB/NLB interact with Kubernetes. So, when Kubernetes starts to replace Pods, the Amazon ALB/NLB Controller starts reacting, however, it must make calls to Amazon API and wait for ALB/NLB to catch up with changing state of the cluster. Kubernetes is not aware of this and continues on blindly. If Ingress Controller was more integrated into the cluster, you wouldn't have this problem. We run Ingress-Nginx at work instead of ALB for this reason.
Thus, this entire system of "Mark me not ready, wait for ALB/NLB to realize I'm not ready and stop sending traffic, wait for that to finish, terminate and Kubernetes continues with rollout."
You would have same problem if you just started up new images in autoscaling group and randomly SSH into old ones and running "shutdown -h now". ALB would be shocked by sudden departure of VMs and you would probably get traffic going to old VMs until health checks caught up.
EDIT: Azure/GCP have same issue if you use their provided ALBs.
Nginx ingress has the same problem, it's just much faster at switching over when a pod is marked as unready because it's continuously watching the endpoints.
Kubernetes is missing a mechanism for load balancing services (like ingress, gateways) to ack pods being marked as not ready before the pod itself is terminated.
They are a few warts like this with core/apps controllers. Nothing unfixable within general k8s design imho but unfortunately most of the community have moved on to newer shinier things
It shouldn't. I've not had the braincells yet to fully internalize the entire article, but it seems like we go wrong about here:
> The AWS Load Balancer keeps sending new requests to the target for several seconds after the application is sent the termination signal!
And then concluded a wait is required…? Yes, traffic might not cease immediately, but you drain the connections to the load balancer, and then exit. A decent HTTP framework should be doing this by default on SIGTERM.
> I yelled into the void about this once and I was told that this was inevitable because it's an eventually consistent distributed system.
Yeah, I wouldn't agree with that either. A terminating pod is inherently "not ready", that not-ready state should cause the load balancer to remove it from rotation. Similarly, the pod itself can drain its connections to the load balancer. That could take time; there's always going to be some point at which you'd have to give up on a slowloris request.
The fundamental gap in my opinion, is that k8s has no mechanism (that I am aware of) to notify the load balancing mechanism (whether that's a service, ingress or gateway) that it intends to remove a node - and for the load balancer to confirm this is complete.
This is how all pre-k8s rolling deployment systems I've used have worked.
So instead we move the logic to the application, and put a sleep in the shutdown phase to account for the time it takes for the load balancer to process/acknowledge the shutdown and stop routing new traffic to that node.
K8s made simple things complicated, yet it doesn't have obvious safety (or sanity) mechanisms, making everyday life a PITA. I wonder why it was adopted so quickly despite its flaws, and the only thing coming to my mind is, like Java in 90s: massive marketing and propaganda that it's "inevitable"..
> put a sleep in the shutdown phase to account for the time it takes for the load balancer to process/acknowledge the shutdown and stop routing new traffic to that node.
Again, I don't see why the sleep is required. You're removed from the load balancer when the last connection from the LB closes.
That’s how you’d expect it to work, but that’s not how pod deletion works.
The pod delete event is sent out, and the load balancer and the pod itself both receive and react to it at the same time.
So unless the LB switchover is very quick, or the pod shutdown is slow - you get dropped requests - usually 502s.
Try googling for graceful k8s deploys and every article will say you have to put a preStop sleep in
Most http frameworks don't do this right. They typically wait until all known in-flight requests complete and then exit. That's usually too fast for a load balancer that's still sending new requests. Instead you should just wait 30 seconds or so while still accepting new requests and replying not ready to load balancer health checks, and then if you want to wait additional time for long running requests, you can. You can also send clients "connection: close" to convince them to reopen connections against different backends.
> That's usually too fast for a load balancer that's still sending new requests.
How?
A load balancer can't send a new request on a connection that doesn't exist. (Existing connections being gracefully torn down as requests conclude on them & as the underlying protocol permits.) If it cannot open a connection to the backend (the backend should not allow new connections when the drain starts) then by definition new requests cannot end up at the backend.
The server in http is limited in its ability to initiate connection closures. Remember that when you close a connection in TCP, that sends a FIN packet, and the other end of the connection doesn't know that that's happened yet and might still be sending data packets. In http, the server can request that the client stop using a connection and close it with the "connection: close" header. If the server closes the connection abruptly, there could be requests in flight on the network. With http pipelining, the server may even receive requests on the same connection after sending "connection: close" since they could have been sent by the client before that header was received. With pipelining, the client needs to close the TCP connection to achieve a graceful shutdown.
K8S is overrated, it's actually pretty terrible but everyone has been convinced it's the solution to all of their problems because it's slightly better than what we had 15 years ago (Ansible/Puppet/Bash/immutable deployments) at 10x the complexity. There are so many weird edge cases just waiting to completely ruin your day. Like subPath mounts. If you use subPath then changes to a ConfigMap don't get reflected into the container. The container doesn't get restarted either of course, so you have config drift built in, unless you install one of those weird hacky controllers that restarts pods for you.
It's not slightly better it's way better than Ansible/Puppet/Bash/immutable deployments, because everything follow the same paterm and is standard.
You get observability pretty much for free, solution from 15 years ago were crap, remember Nagios and the like?
Old solutions would put trash all over the disk in /etc/. How many time did we have to ssh to fix / repair stuff?
All the health check / load balancer is also much better handled on Kubernetes.
[dead]
I wouldn't throw away k8s just for subPath weirdness, but I hear your general point about complexity. But if you are throwing away Ansible and Puppet, what is your solution? Also I'm not entirely sure what you are getting at with bash (what does shell scripting have to do with it?) and immutable deployments.
That's only one example of K8s weirdness that can wake you up at 3am. How: change is rolled out during business hours that changes service config inside ConfigMap. Pod doesn't get notified or reload this change. Pod crashes at night, loads the new (bad/invalid) config, takes down production. To add insult to injury, the engineers spend hours debugging the issue because it's completely unintuitive that CM changes are not reflected ONLY when using subPath.
That's totally valid. I understand the desire of k8s maintainers to prevent "cascading changes" from happening, but this one is a very reasonable feature they seem to not support. There's a pretty common hack to make things restart on a config change by adding a pod annotation with the configmap hash:
But I agree that it shouldn't be needed. There should be builtin and sensible ways to notify of changes and react.This is argument for 12Factor and Env Vars for Config.
Also, Kustomize can help with some of this since it will rotate the name of ConfigMaps so when any change happens, new ConfigMap, new Deployment.
That's how I do it, with kustomize. Definitely confused me before I learned that, but hasn't been an issue for years. And if you don't use kustomize, you just do... What was it kubectl rollout? Add that to the end you deploy script and you're good.
I told you that I hear you on K8s complexity. But since you throw out Ansible/Puppet/etc., what technology are you advocating?
[dead]
[dead]
Nit: "How we archived" subheading should be "How we achieved".
Thanks, fixed
We’re using Argo rollouts without issue. It’s a super set of a deployment with configuration based blue green deploy or canary. Works great for us and allows us to get around the problem laid out in this article.
Argo Rollouts is an extra orchestration layer on top of a traffic management provider. Which one are you using? If you use the ALB controller you still have to deal with pod shutdown / target deregistration timing issues.
https://argoproj.github.io/argo-rollouts/features/traffic-ma...
Does this or any of the strategies listed in the comments properly handle long lived client connections? It's sufficient enough to wait for the LB to stop sending traffic when connections are 100s of ms or less but when connections are minutes or even hours long it doesn't work out well.
Is there a slick strategy for this? Is it possible to have minutes long pre-stop hooks? Is the only option to give client connections an abandon ship message and kick them out hopefully fast enough?
Might be noteworthy that in recent enough k8s lifecycle.preStop.sleep.seconds is implemented https://github.com/kubernetes/enhancements/blob/master/keps/... so no longer any need to run an external sleep command.
highly recommend porter if you are a startup who doesn't wanna think about things like this
https://www.porter.run/
[dead]
[dead]
somewhat related https://architect.run/
> Seamless Migrations with Zero Downtime
(I don't work for them but they are friends ;))