Skip to content

Commit ab98e41

Browse files
authored
Integration of the new pod recovery monitoring strategy implemented in krkn-lib (#609)
* pod monitoring integration in plugin scenario Signed-off-by: Tullio Sebastiani <[email protected]> * pod monitoring integration in container scenario Signed-off-by: Tullio Sebastiani <[email protected]> * removed wait-for-pod step from plugin scenario config files Signed-off-by: Tullio Sebastiani <[email protected]> * introduced global pod recovery time Signed-off-by: Tullio Sebastiani <[email protected]> nit Signed-off-by: Tullio Sebastiani <[email protected]> * introduced krkn_pod_recovery_time in plugin scenario and removed all the references to wait-for-pods Signed-off-by: Tullio Sebastiani <[email protected]> fix Signed-off-by: Tullio Sebastiani <[email protected]> * functional test fix Signed-off-by: Tullio Sebastiani <[email protected]> * main branch functional test fix Signed-off-by: Tullio Sebastiani <[email protected]> * increased recovery times Signed-off-by: Tullio Sebastiani <[email protected]> --------- Signed-off-by: Tullio Sebastiani <[email protected]>
1 parent 19ad2d1 commit ab98e41

22 files changed

+124
-85
lines changed

CI/tests/common.sh

+5-3
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
ERRORED=false
22

33
function finish {
4-
if [ $? -eq 1 ] && [ $ERRORED != "true" ]
4+
if [ $? != 0 ] && [ $ERRORED != "true" ]
55
then
66
error
77
fi
@@ -13,8 +13,10 @@ function error {
1313
then
1414
echo "Error caught."
1515
ERRORED=true
16-
else
17-
echo "Exit code greater than zero detected: $exit_code"
16+
elif [ $exit_code == 2 ]
17+
then
18+
echo "Run with exit code 2 detected, it is expected, wrapping the exit code with 0 to avoid pipeline failure"
19+
exit 0
1820
fi
1921
}
2022

CI/tests/test_container.sh

+4-4
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,11 @@ trap finish EXIT
88
pod_file="CI/scenarios/hello_pod.yaml"
99

1010
function functional_test_container_crash {
11-
yq -i '.scenarios[0].namespace="default"' scenarios/openshift/app_outage.yaml
12-
yq -i '.scenarios[0].label_selector="scenario=container"' scenarios/openshift/app_outage.yaml
13-
yq -i '.scenarios[0].container_name="fedtools"' scenarios/openshift/app_outage.yaml
11+
yq -i '.scenarios[0].namespace="default"' scenarios/openshift/container_etcd.yml
12+
yq -i '.scenarios[0].label_selector="scenario=container"' scenarios/openshift/container_etcd.yml
13+
yq -i '.scenarios[0].container_name="fedtools"' scenarios/openshift/container_etcd.yml
1414
export scenario_type="container_scenarios"
15-
export scenario_file="- scenarios/openshift/app_outage.yaml"
15+
export scenario_file="- scenarios/openshift/container_etcd.yml"
1616
export post_config=""
1717
envsubst < CI/config/common_test_config.yaml > CI/config/container_config.yaml
1818

CI/tests/test_telemetry.sh

+4-4
Original file line numberDiff line numberDiff line change
@@ -22,14 +22,14 @@ function functional_test_telemetry {
2222
export scenario_file="scenarios/arcaflow/cpu-hog/input.yaml"
2323
export post_config=""
2424
envsubst < CI/config/common_test_config.yaml > CI/config/telemetry.yaml
25-
python3 -m coverage run -a run_kraken.py -c CI/config/telemetry.yaml
25+
retval=$(python3 -m coverage run -a run_kraken.py -c CI/config/telemetry.yaml)
2626
RUN_FOLDER=`cat CI/out/test_telemetry.out | grep amazonaws.com | sed -rn "s#.*https:\/\/.*\/files/(.*)#\1#p"`
2727
$AWS_CLI s3 ls "s3://$AWS_BUCKET/$RUN_FOLDER/" | awk '{ print $4 }' > s3_remote_files
2828
echo "checking if telemetry files are uploaded on s3"
2929
cat s3_remote_files | grep events-00.json || ( echo "FAILED: events-00.json not uploaded" && exit 1 )
30-
cat s3_remote_files | grep critical-alerts-00.json || ( echo "FAILED: critical-alerts-00.json not uploaded" && exit 1 )
31-
cat s3_remote_files | grep prometheus-00.tar || ( echo "FAILED: prometheus backup not uploaded" && exit 1 )
32-
cat s3_remote_files | grep telemetry.json || ( echo "FAILED: telemetry.json not uploaded" && exit 1 )
30+
cat s3_remote_files | grep critical-alerts-00.log || ( echo "FAILED: critical-alerts-00.log not uploaded" && exit 1 )
31+
cat s3_remote_files | grep prometheus-00.tar || ( echo "FAILED: prometheus backup not uploaded" && exit 1 )
32+
cat s3_remote_files | grep telemetry.json || ( echo "FAILED: telemetry.json not uploaded" && exit 1 )
3333
echo "all files uploaded!"
3434
echo "Telemetry Collection: Success"
3535
}

docs/getting_started.md

+1-5
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,7 @@ For example, for adding a pod level scenario for a new application, refer to the
1414
namespace_pattern: ^<namespace>$
1515
label_selector: <pod label>
1616
kill: <number of pods to kill>
17-
- id: wait-for-pods
18-
config:
19-
namespace_pattern: ^<namespace>$
20-
label_selector: <pod label>
21-
count: <expected number of pods that match namespace and label>
17+
krkn_pod_recovery_time: <expected time for the pod to become ready>
2218
```
2319

2420
#### Node Scenario Yaml Template

docs/pod_scenarios.md

+2-5
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,8 @@ You can then create the scenario file with the following contents:
1717
config:
1818
namespace_pattern: ^kube-system$
1919
label_selector: k8s-app=kube-scheduler
20-
- id: wait-for-pods
21-
config:
22-
namespace_pattern: ^kube-system$
23-
label_selector: k8s-app=kube-scheduler
24-
count: 3
20+
krkn_pod_recovery_time: 120
21+
2522
```
2623
2724
Please adjust the schema reference to point to the [schema file](../scenarios/plugin.schema.json). This file will give you code completion and documentation for the available options in your IDE.

kraken/plugins/__init__.py

+56-3
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,14 @@
22
import json
33
import logging
44
from os.path import abspath
5-
from typing import List, Dict
5+
from typing import List, Dict, Any
66
import time
77

88
from arcaflow_plugin_sdk import schema, serialization, jsonschema
99
from arcaflow_plugin_kill_pod import kill_pods, wait_for_pods
10+
from krkn_lib.k8s import KrknKubernetes
11+
from krkn_lib.k8s.pods_monitor_pool import PodsMonitorPool
12+
1013
import kraken.plugins.node_scenarios.vmware_plugin as vmware_plugin
1114
import kraken.plugins.node_scenarios.ibmcloud_plugin as ibmcloud_plugin
1215
from kraken.plugins.run_python_plugin import run_python_file
@@ -47,11 +50,14 @@ def __init__(self, steps: List[PluginStep]):
4750
)
4851
self.steps_by_id[step.schema.id] = step
4952

53+
def unserialize_scenario(self, file: str) -> Any:
54+
return serialization.load_from_file(abspath(file))
55+
5056
def run(self, file: str, kubeconfig_path: str, kraken_config: str):
5157
"""
5258
Run executes a series of steps
5359
"""
54-
data = serialization.load_from_file(abspath(file))
60+
data = self.unserialize_scenario(abspath(file))
5561
if not isinstance(data, list):
5662
raise Exception(
5763
"Invalid scenario configuration file: {} expected list, found {}".format(file, type(data).__name__)
@@ -241,18 +247,37 @@ def json_schema(self):
241247
)
242248

243249

244-
def run(scenarios: List[str], kubeconfig_path: str, kraken_config: str, failed_post_scenarios: List[str], wait_duration: int, telemetry: KrknTelemetryKubernetes) -> (List[str], list[ScenarioTelemetry]):
250+
def run(scenarios: List[str],
251+
kubeconfig_path: str,
252+
kraken_config: str,
253+
failed_post_scenarios: List[str],
254+
wait_duration: int,
255+
telemetry: KrknTelemetryKubernetes,
256+
kubecli: KrknKubernetes
257+
) -> (List[str], list[ScenarioTelemetry]):
258+
245259
scenario_telemetries: list[ScenarioTelemetry] = []
246260
for scenario in scenarios:
247261
scenario_telemetry = ScenarioTelemetry()
248262
scenario_telemetry.scenario = scenario
249263
scenario_telemetry.startTimeStamp = time.time()
250264
telemetry.set_parameters_base64(scenario_telemetry, scenario)
251265
logging.info('scenario ' + str(scenario))
266+
pool = PodsMonitorPool(kubecli)
267+
kill_scenarios = [kill_scenario for kill_scenario in PLUGINS.unserialize_scenario(scenario) if kill_scenario["id"] == "kill-pods"]
268+
252269
try:
270+
start_monitoring(pool, kill_scenarios)
253271
PLUGINS.run(scenario, kubeconfig_path, kraken_config)
272+
result = pool.join()
273+
scenario_telemetry.affected_pods = result
274+
if result.error:
275+
raise Exception(f"unrecovered pods: {result.error}")
276+
254277
except Exception as e:
278+
logging.error(f"scenario exception: {str(e)}")
255279
scenario_telemetry.exitStatus = 1
280+
pool.cancel()
256281
failed_post_scenarios.append(scenario)
257282
log_exception(scenario)
258283
else:
@@ -263,3 +288,31 @@ def run(scenarios: List[str], kubeconfig_path: str, kraken_config: str, failed_p
263288
scenario_telemetry.endTimeStamp = time.time()
264289

265290
return failed_post_scenarios, scenario_telemetries
291+
292+
293+
def start_monitoring(pool: PodsMonitorPool, scenarios: list[Any]):
294+
for kill_scenario in scenarios:
295+
recovery_time = kill_scenario["config"]["krkn_pod_recovery_time"]
296+
if ("namespace_pattern" in kill_scenario["config"] and
297+
"label_selector" in kill_scenario["config"]):
298+
namespace_pattern = kill_scenario["config"]["namespace_pattern"]
299+
label_selector = kill_scenario["config"]["label_selector"]
300+
pool.select_and_monitor_by_namespace_pattern_and_label(
301+
namespace_pattern=namespace_pattern,
302+
label_selector=label_selector,
303+
max_timeout=recovery_time)
304+
logging.info(
305+
f"waiting {recovery_time} seconds for pod recovery, "
306+
f"pod label selector: {label_selector} namespace pattern: {namespace_pattern}")
307+
308+
elif ("namespace_pattern" in kill_scenario["config"] and
309+
"name_pattern" in kill_scenario["config"]):
310+
namespace_pattern = kill_scenario["config"]["namespace_pattern"]
311+
name_pattern = kill_scenario["config"]["name_pattern"]
312+
pool.select_and_monitor_by_name_pattern_and_namespace_pattern(pod_name_pattern=name_pattern,
313+
namespace_pattern=namespace_pattern,
314+
max_timeout=recovery_time)
315+
logging.info(f"waiting {recovery_time} seconds for pod recovery, "
316+
f"pod name pattern: {name_pattern} namespace pattern: {namespace_pattern}")
317+
else:
318+
raise Exception(f"impossible to determine monitor parameters, check {kill_scenario} configuration")

kraken/pod_scenarios/setup.py

+22-12
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,13 @@
11
import logging
22
import time
3+
from typing import Any
4+
35
import yaml
46
import sys
57
import random
68
import arcaflow_plugin_kill_pod
9+
from krkn_lib.k8s.pods_monitor_pool import PodsMonitorPool
10+
711
import kraken.cerberus.setup as cerberus
812
import kraken.post_actions.actions as post_actions
913
from krkn_lib.k8s import KrknKubernetes
@@ -79,6 +83,7 @@ def container_run(kubeconfig_path,
7983

8084
failed_scenarios = []
8185
scenario_telemetries: list[ScenarioTelemetry] = []
86+
pool = PodsMonitorPool(kubecli)
8287

8388
for container_scenario_config in scenarios_list:
8489
scenario_telemetry = ScenarioTelemetry()
@@ -91,23 +96,17 @@ def container_run(kubeconfig_path,
9196
pre_action_output = ""
9297
with open(container_scenario_config[0], "r") as f:
9398
cont_scenario_config = yaml.full_load(f)
99+
start_monitoring(kill_scenarios=cont_scenario_config["scenarios"], pool=pool)
94100
for cont_scenario in cont_scenario_config["scenarios"]:
95101
# capture start time
96102
start_time = int(time.time())
97103
try:
98104
killed_containers = container_killing_in_pod(cont_scenario, kubecli)
99-
if len(container_scenario_config) > 1:
100-
failed_post_scenarios = post_actions.check_recovery(
101-
kubeconfig_path,
102-
container_scenario_config,
103-
failed_post_scenarios,
104-
pre_action_output
105-
)
106-
else:
107-
failed_post_scenarios = check_failed_containers(
108-
killed_containers, cont_scenario.get("retry_wait", 120), kubecli
109-
)
110-
105+
logging.info(f"killed containers: {str(killed_containers)}")
106+
result = pool.join()
107+
if result.error:
108+
raise Exception(f"pods failed to recovery: {result.error}")
109+
scenario_telemetry.affected_pods = result
111110
logging.info("Waiting for the specified duration: %s" % (wait_duration))
112111
time.sleep(wait_duration)
113112

@@ -117,6 +116,7 @@ def container_run(kubeconfig_path,
117116
# publish cerberus status
118117
cerberus.publish_kraken_status(config, failed_post_scenarios, start_time, end_time)
119118
except (RuntimeError, Exception):
119+
pool.cancel()
120120
failed_scenarios.append(container_scenario_config[0])
121121
log_exception(container_scenario_config[0])
122122
scenario_telemetry.exitStatus = 1
@@ -129,6 +129,16 @@ def container_run(kubeconfig_path,
129129

130130
return failed_scenarios, scenario_telemetries
131131

132+
def start_monitoring(kill_scenarios: list[Any], pool: PodsMonitorPool):
133+
for kill_scenario in kill_scenarios:
134+
namespace_pattern = f"^{kill_scenario['namespace']}$"
135+
label_selector = kill_scenario["label_selector"]
136+
recovery_time = kill_scenario["expected_recovery_time"]
137+
pool.select_and_monitor_by_namespace_pattern_and_label(
138+
namespace_pattern=namespace_pattern,
139+
label_selector=label_selector,
140+
max_timeout=recovery_time)
141+
132142

133143
def container_killing_in_pod(cont_scenario, kubecli: KrknKubernetes):
134144
scenario_name = get_yaml_item_value(cont_scenario, "name", "")

requirements.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ google-api-python-client==2.116.0
1515
ibm_cloud_sdk_core==3.18.0
1616
ibm_vpc==0.20.0
1717
jinja2==3.1.3
18-
krkn-lib==2.1.1
18+
krkn-lib==2.1.2
1919
lxml==5.1.0
2020
kubernetes==26.1.0
2121
oauth2client==4.1.3

run_kraken.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -264,7 +264,8 @@ def main(cfg):
264264
kraken_config,
265265
failed_post_scenarios,
266266
wait_duration,
267-
telemetry_k8s
267+
telemetry_k8s,
268+
kubecli
268269
)
269270
chaos_telemetry.scenarios.extend(scenario_telemetries)
270271
# krkn_lib

scenarios/kind/scheduler.yml

+1-5
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,4 @@
33
config:
44
namespace_pattern: ^kube-system$
55
label_selector: component=kube-scheduler
6-
- id: wait-for-pods
7-
config:
8-
namespace_pattern: ^kube-system$
9-
label_selector: component=kube-scheduler
10-
count: 3
6+
krkn_pod_recovery_time: 120

scenarios/kube/pod.yml

+1
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,4 @@
44
name_pattern: ^nginx-.*$
55
namespace_pattern: ^default$
66
kill: 1
7+
krkn_pod_recovery_time: 120

scenarios/kube/scheduler.yml

+1-5
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,4 @@
33
config:
44
namespace_pattern: ^kube-system$
55
label_selector: k8s-app=kube-scheduler
6-
- id: wait-for-pods
7-
config:
8-
namespace_pattern: ^kube-system$
9-
label_selector: k8s-app=kube-scheduler
10-
count: 3
6+
krkn_pod_recovery_time: 120

scenarios/openshift/container_etcd.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -5,4 +5,4 @@ scenarios:
55
container_name: "etcd"
66
action: 1
77
count: 1
8-
expected_recovery_time: 60
8+
expected_recovery_time: 120

scenarios/openshift/customapp_pod.yaml

+1-5
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,4 @@
33
config:
44
namespace_pattern: ^acme-air$
55
name_pattern: .*
6-
- id: wait-for-pods
7-
config:
8-
namespace_pattern: ^acme-air$
9-
name_pattern: .*
10-
count: 8
6+
krkn_pod_recovery_time: 120

scenarios/openshift/etcd.yml

+1-5
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,4 @@
33
config:
44
namespace_pattern: ^openshift-etcd$
55
label_selector: k8s-app=etcd
6-
- id: wait-for-pods
7-
config:
8-
namespace_pattern: ^openshift-etcd$
9-
label_selector: k8s-app=etcd
10-
count: 3
6+
krkn_pod_recovery_time: 120

scenarios/openshift/openshift-apiserver.yml

+2-5
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,5 @@
33
config:
44
namespace_pattern: ^openshift-apiserver$
55
label_selector: app=openshift-apiserver-a
6-
- id: wait-for-pods
7-
config:
8-
namespace_pattern: ^openshift-apiserver$
9-
label_selector: app=openshift-apiserver-a
10-
count: 3
6+
krkn_pod_recovery_time: 120
7+

scenarios/openshift/openshift-kube-apiserver.yml

+2-5
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,5 @@
33
config:
44
namespace_pattern: ^openshift-kube-apiserver$
55
label_selector: app=openshift-kube-apiserver
6-
- id: wait-for-pods
7-
config:
8-
namespace_pattern: ^openshift-kube-apiserver$
9-
label_selector: app=openshift-kube-apiserver
10-
count: 3
6+
krkn_pod_recovery_time: 120
7+

scenarios/openshift/post_action_prometheus.yml

+1-5
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,4 @@
33
config:
44
namespace_pattern: ^openshift-monitoring$
55
label_selector: app=prometheus
6-
- id: wait-for-pods
7-
config:
8-
namespace_pattern: ^openshift-monitoring$
9-
label_selector: app=prometheus
10-
count: 2
6+
krkn_pod_recovery_time: 120

scenarios/openshift/prom_kill.yml

+1-5
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,4 @@
22
config:
33
namespace_pattern: ^openshift-monitoring$
44
label_selector: statefulset.kubernetes.io/pod-name=prometheus-k8s-0
5-
- id: wait-for-pods
6-
config:
7-
namespace_pattern: ^openshift-monitoring$
8-
label_selector: statefulset.kubernetes.io/pod-name=prometheus-k8s-0
9-
count: 1
5+
krkn_pod_recovery_time: 120

scenarios/openshift/prometheus.yml

+1-6
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,4 @@
33
config:
44
namespace_pattern: ^openshift-monitoring$
55
label_selector: app=prometheus
6-
- id: wait-for-pods
7-
config:
8-
namespace_pattern: ^openshift-monitoring$
9-
label_selector: app=prometheus
10-
count: 2
11-
timeout: 180
6+
krkn_pod_recovery_time: 120

scenarios/openshift/regex_openshift_pod_kill.yml

+1
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,4 @@
44
namespace_pattern: ^openshift-.*$
55
name_pattern: .*
66
kill: 3
7+
krkn_pod_recovery_time: 120

0 commit comments

Comments
 (0)