Skip to content

[master] thermalctld leak on Arista devices makes them unreachable when memory is exhausted #7515

Closed
@Staphylo

Description

@Staphylo

Description

On Arista devices thermalctld leaks memory (unclear if other vendors are affected).
At each loop iteration of thermalctld a few more MB are consumed (~3MiB every 60s)
After running for a few hours, pmon consumes up to 75% of memory at which point the device becomes unresponsive.

  • Existing ssh sessions will freeze
  • Console becomes unresponsive
  • Pings still go through

Steps to reproduce the issue:

  1. Install latest master image
  2. Wait for a few hours while monitoring (you can make this faster by filling a tmpfs to reduce available memory)
    Run docker stats to see memory size of pmon growing
    Run watch -d -n 1 'docker exec -ti pmon sh -c "ps aux | grep -v aux"' to see memory growing at pmon process level
  3. Witness ssh/console hanging while ping still working
  4. After some more time kernel will panic

Output seen on the console after kernel panic

[17758.188885] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled             
[17758.188885]                                                                                            
[17758.305661] CPU: 2 PID: 4197 Comm: supervisord Tainted: G           OE     4.19.0-12-2-amd64 #1 Debian 4.19.152-1                                                                                                
[17758.428678] Hardware name: Intel Camelback Mountain CRB/Camelback Mountain CRB, BIOS Aboot-norcal7-rook-2x4--6128821 09/14/2017                                                                                  
[17758.566281] Call Trace:                                                                                
[17758.595557]  dump_stack+0x66/0x90                                                                      
[17758.635242]  panic+0xe7/0x24a                                                                         
[17758.670765]  out_of_memory.cold.33+0x5e/0x82                                                           
[17758.721909]  __alloc_pages_slowpath+0xbd8/0xcb0                                                        
[17758.776180]  __alloc_pages_nodemask+0x28b/0x2b0                                                       
[17758.830450]  filemap_fault+0x333/0x780                                                                 
[17758.875346]  ? alloc_set_pte+0x49e/0x560         
[17758.922325]  ? filemap_map_pages+0x139/0x3a0     
[17758.973494]  ext4_filemap_fault+0x2c/0x40 [ext4] 
[17759.028809]  __do_fault+0x36/0x130               
[17759.069538]  __handle_mm_fault+0xdf9/0x11f0                                                           
[17759.119642]  handle_mm_fault+0xd6/0x200           
[17759.165579]  __do_page_fault+0x249/0x4f0                                                              
[17759.212560]  ? page_fault+0x8/0x30                                                                     
[17759.253287]  page_fault+0x1e/0x30                                                                     
[17759.292974] RIP: 0033:0x54e6ba

Additional information you deem important (e.g. issue happens only occasionally):

This issue is happening consistently on master.
It is currently being looked at and this issue opened for awareness.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions