Production is (very) close to the current default kubelet 110 pods / node limit
When doing the massive re-deploy after the update of swh.core/swh.model this week, I butted against the 110 pods/node limit a lot. I had to disable auto sync on argocd, scale down the large deployments, then wait... a long time.
This is even more critical as the local-path-provisioner has to spawn pods to clean up the pvc when another pod ends, so it hit the OutOfpods condition a lot, making recovery even slower. And of course, our k8s-based workload is not going to go down any time soon.
I think we have a few approaches we could look at
-
swh-sysadmin-provisioning!110 (merged): Review whether we could increase the default value from 110 pods/node to something more sensible considering the hardware we're deploying on. This is doable, see for example this blog post from RedHat on optimizing OpenShift's k8s distribution to support 500 pods / node [1] on specs similar to ours [1] [2] -
#5215 (closed), #5226 (closed): Spawn more workloads on the dedicated nodes (e.g. saam and banco, which we'll be joining to the k8s cluster to serve as local objstorage backends, could also run a bunch of storage rpc pods) -
#5224 (closed): Recycle belvedere as a workload node (-> rancher-node-metal04) -
Solve the "scale down kills an active pod instead of an idle one" issue, which would let us downscale the deployments of critical workloads and reduce the "number-of-pods-pressure" -
Find a way to be smarter when deploying new versions, by ramping down the large deployments and then scaling them back up, instead of surging them above their nominal size (especially critical for the loaders, which hang around in the terminating state for a long time) (-> see kubernetes operator probably) - ...?
Edited by Antoine R. Dumont