Closed
Description
Description
On Arista devices thermalctld
leaks memory (unclear if other vendors are affected).
At each loop iteration of thermalctld
a few more MB are consumed (~3MiB every 60s)
After running for a few hours, pmon
consumes up to 75% of memory at which point the device becomes unresponsive.
- Existing ssh sessions will freeze
- Console becomes unresponsive
- Pings still go through
Steps to reproduce the issue:
- Install latest master image
- Wait for a few hours while monitoring (you can make this faster by filling a tmpfs to reduce available memory)
Rundocker stats
to see memory size of pmon growing
Runwatch -d -n 1 'docker exec -ti pmon sh -c "ps aux | grep -v aux"'
to see memory growing at pmon process level - Witness ssh/console hanging while ping still working
- After some more time kernel will panic
Output seen on the console after kernel panic
[17758.188885] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
[17758.188885]
[17758.305661] CPU: 2 PID: 4197 Comm: supervisord Tainted: G OE 4.19.0-12-2-amd64 #1 Debian 4.19.152-1
[17758.428678] Hardware name: Intel Camelback Mountain CRB/Camelback Mountain CRB, BIOS Aboot-norcal7-rook-2x4--6128821 09/14/2017
[17758.566281] Call Trace:
[17758.595557] dump_stack+0x66/0x90
[17758.635242] panic+0xe7/0x24a
[17758.670765] out_of_memory.cold.33+0x5e/0x82
[17758.721909] __alloc_pages_slowpath+0xbd8/0xcb0
[17758.776180] __alloc_pages_nodemask+0x28b/0x2b0
[17758.830450] filemap_fault+0x333/0x780
[17758.875346] ? alloc_set_pte+0x49e/0x560
[17758.922325] ? filemap_map_pages+0x139/0x3a0
[17758.973494] ext4_filemap_fault+0x2c/0x40 [ext4]
[17759.028809] __do_fault+0x36/0x130
[17759.069538] __handle_mm_fault+0xdf9/0x11f0
[17759.119642] handle_mm_fault+0xd6/0x200
[17759.165579] __do_page_fault+0x249/0x4f0
[17759.212560] ? page_fault+0x8/0x30
[17759.253287] page_fault+0x1e/0x30
[17759.292974] RIP: 0033:0x54e6ba
Additional information you deem important (e.g. issue happens only occasionally):
This issue is happening consistently on master.
It is currently being looked at and this issue opened for awareness.