Kube cronjobs alerting
These modifications will create two alerts (warning) if:
- a cronjob is no more executed (suspension status);
- a cronjob can execute concurrent jobs.
I define all rules in one group swh.<environment>.rules
.
Diff production
ᐅ diff -u kube-production kube-cronjobs-production
--- kube-production 2023-11-21 14:58:25.604870534 +0100
+++ kube-cronjobs-production 2023-11-21 14:57:51.296344215 +0100
@@ -45,17 +45,15 @@
key: password
name: alertmanager-irc-relay-config
---
-# Source: cluster-config/templates/alerting/cassandra-alerting.yaml
+# Source: cluster-config/templates/alerting/swh-alerting.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
- labels:
- app: cassandra
- name: cassandra-service.rules
+ name: swh.production.rules
namespace: cattle-monitoring-system
spec:
groups:
- - name: cassandra-service.rules
+ - name: swh.production.rules
rules:
- alert: Cassandra_Degraded_Service_In_Production
annotations:
@@ -75,3 +73,21 @@
labels:
severity: critical
namespace: cattle-monitoring-system
+ - alert: Concurrent_Cronjob_Is_Allowed_In_Production
+ annotations:
+ description: "The concurrency_policy of cronjob {{ $labels.cronjob }} is {{ $labels.concurrency_policy }}."
+ summary: "Please set the concurrency_policy of cronjob {{ $labels.cronjob }} to 'Forbid' on cluster {{ $labels.cluster_name }}."
+ expr: kube_cronjob_info{concurrency_policy!="Forbid"}
+ for: 15m
+ labels:
+ severity: warning
+ namespace: cattle-monitoring-system
+ - alert: Cronjob_Is_Suspended_In_Production
+ annotations:
+ description: "The cronjob {{ $labels.cronjob }} is suspended for more than 5 minutes."
+ summary: "Please set the suspension status of cronjob {{ $labels.cronjob }} to 'false' on cluster {{ $labels.cluster_name }}."
+ expr: kube_cronjob_spec_suspend > 0
+ for: 5m
+ labels:
+ severity: warning
+ namespace: cattle-monitoring-system
Diff staging
ᐅ diff -u kube-staging kube-cronjobs-staging
--- kube-staging 2023-11-21 14:58:53.629285154 +0100
+++ kube-cronjobs-staging 2023-11-21 14:58:08.764614846 +0100
@@ -210,17 +210,15 @@
key: password
name: alertmanager-irc-relay-config
---
-# Source: cluster-config/templates/alerting/cassandra-alerting.yaml
+# Source: cluster-config/templates/alerting/swh-alerting.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
- labels:
- app: cassandra
- name: cassandra-service.rules
+ name: swh.staging.rules
namespace: cattle-monitoring-system
spec:
groups:
- - name: cassandra-service.rules
+ - name: swh.staging.rules
rules:
- alert: Cassandra_Degraded_Service_In_Staging
annotations:
@@ -240,3 +238,21 @@
labels:
severity: critical
namespace: cattle-monitoring-system
+ - alert: Concurrent_Cronjob_Is_Allowed_In_Staging
+ annotations:
+ description: "The concurrency_policy of cronjob {{ $labels.cronjob }} is {{ $labels.concurrency_policy }}."
+ summary: "Please set the concurrency_policy of cronjob {{ $labels.cronjob }} to 'Forbid' on cluster {{ $labels.cluster_name }}."
+ expr: kube_cronjob_info{concurrency_policy!="Forbid"}
+ for: 15m
+ labels:
+ severity: warning
+ namespace: cattle-monitoring-system
+ - alert: Cronjob_Is_Suspended_In_Staging
+ annotations:
+ description: "The cronjob {{ $labels.cronjob }} is suspended for more than 5 minutes."
+ summary: "Please set the suspension status of cronjob {{ $labels.cronjob }} to 'false' on cluster {{ $labels.cluster_name }}."
+ expr: kube_cronjob_spec_suspend > 0
+ for: 5m
+ labels:
+ severity: warning
+ namespace: cattle-monitoring-system
Check rules
~/_swh_src/sysadm-environment/swh-charts/cluster-components (kube_cronjobs_alerting ✔) ᐅ helm template -f values.yaml -f values/archive-production-rke2.yaml alerting . | \
grep groups -A 38
groups:
- name: swh.production.rules
rules:
- alert: Cassandra_Degraded_Service_In_Production
annotations:
description: "The {{ $labels.instance }} node is unreachable for more than 15 minutes. This node seems down."
summary: "The {{ $labels.service }} is degraded. Please check the {{ $labels.instance }} status."
expr: up{service="cassandra-servers-svc"} == 0
for: 15m
labels:
severity: warning
namespace: cattle-monitoring-system
- alert: Cassandra_Unrepaired_Table_In_Production
annotations:
description: "The unrepaired bytes of table {{ $labels.table }} is more than 200 Gb."
summary: "Please trigger a repair on the table {{ $labels.table }} in keyspace {{ $labels.keyspace }}."
expr: sum by (keyspace, table) (cassandra_table_bytesunrepaired{table!="",job="cassandra-servers-svc"}) > 2.147483648e+11
for: 5m
labels:
severity: critical
namespace: cattle-monitoring-system
- alert: Concurrent_Cronjob_Is_Allowed_In_Production
annotations:
description: "The concurrency_policy of cronjob {{ $labels.cronjob }} is {{ $labels.concurrency_policy }}."
summary: "Please set the concurrency_policy of cronjob {{ $labels.cronjob }} to 'Forbid' on cluster {{ $labels.cluster_name }}."
expr: kube_cronjob_info{concurrency_policy!="Forbid"}
for: 15m
labels:
severity: warning
namespace: cattle-monitoring-system
- alert: Cronjob_Is_Suspended_In_Production
annotations:
description: "The cronjob {{ $labels.cronjob }} is suspended for more than 5 minutes."
summary: "Please set the suspension status of cronjob {{ $labels.cronjob }} to 'false' on cluster {{ $labels.cluster_name }}."
expr: kube_cronjob_spec_suspend > 0
for: 5m
labels:
severity: warning
namespace: cattle-monitoring-system
~/_swh_src/sysadm-environment/swh-charts/cluster-components (kube_cronjobs_alerting ✔) ᐅ helm template -f values.yaml -f values/archive-production-rke2.yaml alerting . | \
grep groups -A 38 | promtool check rules
Checking standard input
SUCCESS: 4 rules found