Skip to content
This repository was archived by the owner on Apr 20, 2024. It is now read-only.

Commit fb9beee

Browse files
authored
Merge pull request #25 from aojea/monitoring
Metrics and docs
2 parents 2b60cc5 + 1061e09 commit fb9beee

12 files changed

+452
-46
lines changed

.github/workflows/e2e.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -127,7 +127,7 @@ jobs:
127127
/usr/local/bin/kubectl get nodes -o wide
128128
/usr/local/bin/kubectl get pods -A
129129
/usr/local/bin/kubectl wait --timeout=1m --for=condition=ready pods --namespace=kube-system -l k8s-app=kube-dns
130-
/usr/local/bin/kubectl wait --timeout=1m --for=condition=ready pods --namespace=kube-system -l app=kube-netpol
130+
/usr/local/bin/kubectl wait --timeout=1m --for=condition=ready pods --namespace=kube-system -l app=kube-network-policies
131131
132132
- name: Run tests
133133
run: |

README.md

+22
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,28 @@ This project takes a different approach. It uses the NFQUEUE functionality imple
99
There are some performance improvements that can be applied, such as to restrict in the dataplane the packets that are sent to userspace to the ones that have network policies only, so only
1010
the Pods affected by network policies will hit the first byte performance.
1111

12+
## Metrics
13+
14+
Prometheus metrics are exposed on the address defined by the flag
15+
16+
```
17+
-metrics-bind-address string
18+
The IP address and port for the metrics server to serve on (default ":9080")
19+
```
20+
21+
Current implemented metrics are:
22+
23+
* packet_process_time: Time it has taken to process each packet (microseconds)
24+
* packet_process_duration_microseconds: A summary of the packet processing durations in microseconds
25+
* packet_count: Number of packets
26+
* nfqueue_queue_total: The number of packets currently queued and waiting to be processed by the application
27+
* nfqueue_queue_dropped: Number of packets that had to be dropped by the kernel because too many packets are already waiting for user space to send back the mandatory accept/drop verdicts
28+
* nfqueue_user_dropped: Number of packets that were dropped within the netlink subsystem. Such drops usually happen when the corresponding socket buffer is full; that is, user space is not able to read messages fast enough
29+
* nfqueue_packet_id: ID of the most recent packet queued
30+
31+
## Testing
32+
33+
See [.docs/testing/README.md]
1234

1335
## References
1436

cmd/main.go

+12-5
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ import (
44
"context"
55
"flag"
66
"fmt"
7+
"net"
78
"net/http"
89
"os"
910
"os/signal"
@@ -20,13 +21,15 @@ import (
2021
)
2122

2223
var (
23-
failOpen bool
24-
queueID int
24+
failOpen bool
25+
queueID int
26+
metricsBindAddress string
2527
)
2628

2729
func init() {
28-
flag.BoolVar(&failOpen, "fail-open", false, "If set, don't drop packets if the controller is not running (default false)")
29-
flag.IntVar(&queueID, "nfqueue-id", 100, "Number of the nfqueue used (default 100)")
30+
flag.BoolVar(&failOpen, "fail-open", false, "If set, don't drop packets if the controller is not running")
31+
flag.IntVar(&queueID, "nfqueue-id", 100, "Number of the nfqueue used")
32+
flag.StringVar(&metricsBindAddress, "metrics-bind-address", ":9080", "The IP address and port for the metrics server to serve on")
3033

3134
flag.Usage = func() {
3235
fmt.Fprint(os.Stderr, "Usage: kube-netpol [options]\n\n")
@@ -39,6 +42,10 @@ func main() {
3942
klog.InitFlags(nil)
4043
flag.Parse()
4144
//
45+
if _, _, err := net.SplitHostPort(metricsBindAddress); err != nil {
46+
klog.Fatalf("error parsing metrics bind address %s : %v", metricsBindAddress, err)
47+
}
48+
4249
cfg := networkpolicy.Config{
4350
FailOpen: failOpen,
4451
QueueID: queueID,
@@ -69,7 +76,7 @@ func main() {
6976
informersFactory := informers.NewSharedInformerFactory(clientset, 0)
7077

7178
http.Handle("/metrics", promhttp.Handler())
72-
go http.ListenAndServe(":9080", nil)
79+
go http.ListenAndServe(metricsBindAddress, nil)
7380

7481
networkPolicyController := networkpolicy.NewController(
7582
clientset,

docs/testing/README.md

+128
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# Testing
2+
3+
This is an example of how to do some microbenchmarking.
4+
5+
1. Collect the existing metrics from the agents
6+
7+
Example [deployment with prometheus](./monitoring.yaml)
8+
9+
2. Deploy some Pods running an http server behind a Service
10+
11+
Since network policies work for the first packet in the connection we need to generate new connections:
12+
* We can not use HTTP keepalives or HTTP2 or protocols that multiplex request over the same connection
13+
* A pair of endpoints will be limited by the number of ephemeral ports in the origin, since the destination IP and Port will be fixed
14+
15+
```
16+
cat /proc/sys/net/ipv4/ip_local_port_range
17+
32768 60999
18+
```
19+
20+
3. Run a [Job that polls the Service created previously](job_poller.yaml)
21+
22+
Each Pod runs request in parallel
23+
24+
```
25+
kubectl logs abtest-t7wjd
26+
This is ApacheBench, Version 2.3 <$Revision: 1913912 $>
27+
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
28+
Licensed to The Apache Software Foundation, http://www.apache.org/
29+
30+
Benchmarking test-service (be patient)
31+
Completed 1000 requests
32+
Completed 2000 requests
33+
Completed 3000 requests
34+
Completed 4000 requests
35+
Completed 5000 requests
36+
Completed 6000 requests
37+
Completed 7000 requests
38+
Completed 8000 requests
39+
Completed 9000 requests
40+
Completed 10000 requests
41+
Finished 10000 requests
42+
43+
44+
Server Software:
45+
Server Hostname: test-service
46+
Server Port: 80
47+
48+
Document Path: /
49+
Document Length: 60 bytes
50+
51+
Concurrency Level: 1000
52+
Time taken for tests: 4.317 seconds
53+
Complete requests: 10000
54+
Failed requests: 1274
55+
(Connect: 0, Receive: 0, Length: 1274, Exceptions: 0)
56+
Total transferred: 1768597 bytes
57+
HTML transferred: 598597 bytes
58+
Requests per second: 2316.61 [#/sec] (mean)
59+
Time per request: 431.666 [ms] (mean)
60+
Time per request: 0.432 [ms] (mean, across all concurrent requests)
61+
Transfer rate: 400.11 [Kbytes/sec] received
62+
63+
Connection Times (ms)
64+
min mean[+/-sd] median max
65+
Connect: 0 188 571.9 4 4121
66+
Processing: 0 2 5.3 0 42
67+
Waiting: 0 1 2.8 0 32
68+
Total: 0 190 571.8 5 4122
69+
70+
Percentage of the requests served within a certain time (ms)
71+
50% 5
72+
66% 7
73+
75% 22
74+
80% 24
75+
90% 1023
76+
95% 1046
77+
98% 2063
78+
99% 3080
79+
100% 4122 (longest request)
80+
```
81+
82+
You have to tune your system as it is most likely you reach limits in some of the different resources, specially in the conntrack table
83+
84+
```
85+
[1825525.815672] net_ratelimit: 411 callbacks suppressed
86+
[1825525.815676] nf_conntrack: nf_conntrack: table full, dropping packet
87+
[1825525.827617] nf_conntrack: nf_conntrack: table full, dropping packet
88+
[1825525.834317] nf_conntrack: nf_conntrack: table full, dropping packet
89+
[1825525.841058] nf_conntrack: nf_conntrack: table full, dropping packet
90+
[1825525.847764] nf_conntrack: nf_conntrack: table full, dropping packet
91+
[1825525.854458] nf_conntrack: nf_conntrack: table full, dropping packet
92+
[1825525.861131] nf_conntrack: nf_conntrack: table full, dropping packet
93+
[1825525.867814] nf_conntrack: nf_conntrack: table full, dropping packet
94+
[1825525.874505] nf_conntrack: nf_conntrack: table full, dropping packet
95+
[1825525.881186] nf_conntrack: nf_conntrack: table full, dropping packet
96+
```
97+
98+
Check the current max number of conntrack entries allowed and tune accordenly
99+
100+
```
101+
cat /proc/sys/net/netfilter/nf_conntrack_max
102+
262144
103+
```
104+
105+
106+
4. Observe the metrics in prometheus or graphana
107+
108+
109+
![Packet Processing Latency](network_policies_latency.png "Packet Processing Latency")
110+
![Packet Rate](network_policies_packet_rate.png "Packet Rate")
111+
112+
113+
## Future work
114+
115+
We are interested in understanding the following variables
116+
117+
* Memory and CPU consumption
118+
* Latency on packet processing
119+
* Latency to apply a network policy since it has been created
120+
121+
This can microbencharked easily, using one Node or a Kind cluster and adding fake nodes and pods https://developer.ibm.com/tutorials/awb-using-kwok-to-simulate-a-large-kubernetes-openshift-cluster/ and running scenarios in just one node with the different variables
122+
123+
124+
Inputs:
125+
126+
* New connections per seconds
127+
* Number of Pods on the cluster (affected or not affected by network policies)
128+
* Number of Network Policies impacting the connections

docs/testing/backend.yaml

+35
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
apiVersion: apps/v1
2+
kind: Deployment
3+
metadata:
4+
name: server-deployment
5+
labels:
6+
app: MyApp
7+
spec:
8+
replicas: 10
9+
selector:
10+
matchLabels:
11+
app: MyApp
12+
template:
13+
metadata:
14+
labels:
15+
app: MyApp
16+
spec:
17+
containers:
18+
- name: agnhost
19+
image: k8s.gcr.io/e2e-test-images/agnhost:2.39
20+
args:
21+
- netexec
22+
- --http-port=80
23+
---
24+
apiVersion: v1
25+
kind: Service
26+
metadata:
27+
name: test-service
28+
spec:
29+
type: ClusterIP
30+
selector:
31+
app: MyApp
32+
ports:
33+
- protocol: TCP
34+
port: 80
35+
targetPort: 80

docs/testing/job_poller.yaml

+14
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
apiVersion: batch/v1
2+
kind: Job
3+
metadata:
4+
name: abtest
5+
spec:
6+
completions: 50
7+
parallelism: 10
8+
template:
9+
spec:
10+
containers:
11+
- name: ab
12+
image: httpd:2
13+
command: ["ab", "-n", "10000", "-c", "1000", "-v", "1", "http://test-service:80/"]
14+
restartPolicy: Never

0 commit comments

Comments
 (0)