components/alerting: Create a custom KubeHpaMaxedOut rule
These modifications will deploy a rule identical to the default rule in kube-prometheus-stack, excluding Keda HPAs.
The default rule KubeHpaMaxedOut
need to be disable (already done in staging environment).
Helm diff
[cluster-components] Comparing changes between branches production and prometheus_rules_hpamaxedout...
Your branch is up to date with 'origin/production'.
[cluster-components] Generate config in production branch for cluster-components/values/admin-rke2.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/archive-production-rke2.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/archive-staging-rke2.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/gitlab-production.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/gitlab-staging.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/minikube.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/rancher.yaml...
[cluster-components] Generate config in production branch for cluster-components/values/test-staging-rke2.yaml...
[cluster-components] Generate config in prometheus_rules_hpamaxedout branch for cluster-components/values/admin-rke2.yaml...
[cluster-components] Generate config in prometheus_rules_hpamaxedout branch for cluster-components/values/archive-production-rke2.yaml...
[cluster-components] Generate config in prometheus_rules_hpamaxedout branch for cluster-components/values/archive-staging-rke2.yaml...
[cluster-components] Generate config in prometheus_rules_hpamaxedout branch for cluster-components/values/gitlab-production.yaml...
[cluster-components] Generate config in prometheus_rules_hpamaxedout branch for cluster-components/values/gitlab-staging.yaml...
[cluster-components] Generate config in prometheus_rules_hpamaxedout branch for cluster-components/values/minikube.yaml...
[cluster-components] Generate config in prometheus_rules_hpamaxedout branch for cluster-components/values/rancher.yaml...
[cluster-components] Generate config in prometheus_rules_hpamaxedout branch for cluster-components/values/test-staging-rke2.yaml...
------------- diff for cluster-components/values/admin-rke2.yaml -------------
_ __ __
_| |_ _ / _|/ _| between /tmp/swh-chart.cluster-components.JiFwlfd4/admin-rke2.yaml.before, 29 documents
/ _' | | | | |_| |_ and /tmp/swh-chart.cluster-components.JiFwlfd4/admin-rke2.yaml.after, 29 documents
| (_| | |_| | _| _|
\__,_|\__, |_| |_| returned no differences
|___/
------------- diff for cluster-components/values/archive-production-rke2.yaml -------------
_ __ __
_| |_ _ / _|/ _| between /tmp/swh-chart.cluster-components.JiFwlfd4/archive-production-rke2.yaml.before, 15 documents
/ _' | | | | |_| |_ and /tmp/swh-chart.cluster-components.JiFwlfd4/archive-production-rke2.yaml.after, 15 documents
| (_| | |_| | _| _|
\__,_|\__, |_| |_| returned one difference
|___/
spec.groups.swh-production.rules.rules (monitoring.coreos.com/v1/PrometheusRule/cattle-monitoring-system/swh-production.rules)
+ one list entry added:
- alert: HPA_Maxed_Out_In_Production
annotations:
description: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} has been running at max replicas for longer than 15 minutes."
runbook_url: "https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubehpamaxedout"
summary: "HPA is running at max replicas"
expr: |
kube_horizontalpodautoscaler_status_current_replicas{horizontalpodautoscaler!~"keda-hpa-.*", job="kube-state-metrics", namespace=~".*"}
==
kube_horizontalpodautoscaler_spec_max_replicas{horizontalpodautoscaler!~"keda-hpa-.*", job="kube-state-metrics", namespace=~".*"}
for: 15m
labels:
severity: warning
namespace: cattle-monitoring-system
------------- diff for cluster-components/values/archive-staging-rke2.yaml -------------
_ __ __
_| |_ _ / _|/ _| between /tmp/swh-chart.cluster-components.JiFwlfd4/archive-staging-rke2.yaml.before, 15 documents
/ _' | | | | |_| |_ and /tmp/swh-chart.cluster-components.JiFwlfd4/archive-staging-rke2.yaml.after, 15 documents
| (_| | |_| | _| _|
\__,_|\__, |_| |_| returned one difference
|___/
spec.groups.swh-staging.rules.rules (monitoring.coreos.com/v1/PrometheusRule/cattle-monitoring-system/swh-staging.rules)
+ one list entry added:
- alert: HPA_Maxed_Out_In_Staging
annotations:
description: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} has been running at max replicas for longer than 15 minutes."
runbook_url: "https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubehpamaxedout"
summary: "HPA is running at max replicas"
expr: |
kube_horizontalpodautoscaler_status_current_replicas{horizontalpodautoscaler!~"keda-hpa-.*", job="kube-state-metrics", namespace=~".*"}
==
kube_horizontalpodautoscaler_spec_max_replicas{horizontalpodautoscaler!~"keda-hpa-.*", job="kube-state-metrics", namespace=~".*"}
for: 15m
labels:
severity: warning
namespace: cattle-monitoring-system
------------- diff for cluster-components/values/gitlab-production.yaml -------------
_ __ __
_| |_ _ / _|/ _| between /tmp/swh-chart.cluster-components.JiFwlfd4/gitlab-production.yaml.before
/ _' | | | | |_| |_ and /tmp/swh-chart.cluster-components.JiFwlfd4/gitlab-production.yaml.after
| (_| | |_| | _| _|
\__,_|\__, |_| |_| returned no differences
|___/
------------- diff for cluster-components/values/gitlab-staging.yaml -------------
_ __ __
_| |_ _ / _|/ _| between /tmp/swh-chart.cluster-components.JiFwlfd4/gitlab-staging.yaml.before
/ _' | | | | |_| |_ and /tmp/swh-chart.cluster-components.JiFwlfd4/gitlab-staging.yaml.after
| (_| | |_| | _| _|
\__,_|\__, |_| |_| returned no differences
|___/
------------- diff for cluster-components/values/minikube.yaml -------------
_ __ __
_| |_ _ / _|/ _| between /tmp/swh-chart.cluster-components.JiFwlfd4/minikube.yaml.before
/ _' | | | | |_| |_ and /tmp/swh-chart.cluster-components.JiFwlfd4/minikube.yaml.after
| (_| | |_| | _| _|
\__,_|\__, |_| |_| returned no differences
|___/
------------- diff for cluster-components/values/rancher.yaml -------------
_ __ __
_| |_ _ / _|/ _| between /tmp/swh-chart.cluster-components.JiFwlfd4/rancher.yaml.before
/ _' | | | | |_| |_ and /tmp/swh-chart.cluster-components.JiFwlfd4/rancher.yaml.after
| (_| | |_| | _| _|
\__,_|\__, |_| |_| returned no differences
|___/
------------- diff for cluster-components/values/test-staging-rke2.yaml -------------
_ __ __
_| |_ _ / _|/ _| between /tmp/swh-chart.cluster-components.JiFwlfd4/test-staging-rke2.yaml.before, four documents
/ _' | | | | |_| |_ and /tmp/swh-chart.cluster-components.JiFwlfd4/test-staging-rke2.yaml.after, four documents
| (_| | |_| | _| _|
\__,_|\__, |_| |_| returned no differences
|___/
Rules validity check
ᐅ cd cluster-components
ᐅ helm template -f values.yaml -f values/archive-production-rke2.yaml alerting . | \
grep groups -A 61 | promtool check rules
Checking standard input
SUCCESS: 6 rules found