[dynamic infra] Correctly manage Save Code Now workers autoscaling
Kubernetes is not really smart to downscale running pods, it choses randomly a pod to kill.
An usual scenario is:
- 2 save code now messages arrive in the rabbitmq queue
- task 1 is a long task
- task 2 is a short task
- 2 pods are created
- task 2 is done
- kubernetes choses a random pod to kill
- A warm shutdown message is sent to pod1
- If the task is not finished after 1 hour, the pod is killed
- Pod2 is doing nothing in the interval
Possible solutions:
- Use scaled job instead of scaled object: the loaders must be able to load only one repo and exist
- Find a way to manage the
controller.kubernetes.io/pod-deletion-cost
of the pods doing nothing - ???
The easy workaround is to disable the autoscaling of the high priority loaders until a solution is found.
Edited by Vincent Sellier