-
Notifications
You must be signed in to change notification settings - Fork 92
[WIP] [Deepin-Kernel-SIG] [linux 6.6-y] [Upstream] mm/vmscan: don't try to reclaim hwpoison folio #896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: linux-6.6.y
Are you sure you want to change the base?
Conversation
mainline inclusion from mainline-v6.15-rc1 category: bugfix commit 1b04495 upstream. Syzkaller reports a bug as follows: Injecting memory failure for pfn 0x18b00e at process virtual address 0x20ffd000 Memory failure: 0x18b00e: dirty swapcache page still referenced by 2 users Memory failure: 0x18b00e: recovery action for dirty swapcache page: Failed page: refcount:2 mapcount:0 mapping:0000000000000000 index:0x20ffd pfn:0x18b00e memcg:ffff0000dd6d9000 anon flags: 0x5ffffe00482011(locked|dirty|arch_1|swapbacked|hwpoison|node=0|zone=2|lastcpupid=0xfffff) raw: 005ffffe00482011 dead000000000100 dead000000000122 ffff0000e232a7c9 raw: 0000000000020ffd 0000000000000000 00000002ffffffff ffff0000dd6d9000 page dumped because: VM_BUG_ON_FOLIO(!folio_test_uptodate(folio)) ------------[ cut here ]------------ kernel BUG at mm/swap_state.c:184! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP Modules linked in: CPU: 0 PID: 60 Comm: kswapd0 Not tainted 6.6.0-gcb097e7de84e #3 Hardware name: linux,dummy-virt (DT) pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : add_to_swap+0xbc/0x158 lr : add_to_swap+0xbc/0x158 sp : ffff800087f37340 x29: ffff800087f37340 x28: fffffc00052c0380 x27: ffff800087f37780 x26: ffff800087f37490 x25: ffff800087f37c78 x24: ffff800087f377a0 x23: ffff800087f37c50 x22: 0000000000000000 x21: fffffc00052c03b4 x20: 0000000000000000 x19: fffffc00052c0380 x18: 0000000000000000 x17: 296f696c6f662865 x16: 7461646f7470755f x15: 747365745f6f696c x14: 6f6621284f494c4f x13: 0000000000000001 x12: ffff600036d8b97b x11: 1fffe00036d8b97a x10: ffff600036d8b97a x9 : dfff800000000000 x8 : 00009fffc9274686 x7 : ffff0001b6c5cbd3 x6 : 0000000000000001 x5 : ffff0000c25896c0 x4 : 0000000000000000 x3 : 0000000000000000 x2 : 0000000000000000 x1 : ffff0000c25896c0 x0 : 0000000000000000 Call trace: add_to_swap+0xbc/0x158 shrink_folio_list+0x12ac/0x2648 shrink_inactive_list+0x318/0x948 shrink_lruvec+0x450/0x720 shrink_node_memcgs+0x280/0x4a8 shrink_node+0x128/0x978 balance_pgdat+0x4f0/0xb20 kswapd+0x228/0x438 kthread+0x214/0x230 ret_from_fork+0x10/0x20 I can reproduce this issue with the following steps: 1) When a dirty swapcache page is isolated by reclaim process and the page isn't locked, inject memory failure for the page. me_swapcache_dirty() clears uptodate flag and tries to delete from lru, but fails. Reclaim process will put the hwpoisoned page back to lru. 2) The process that maps the hwpoisoned page exits, the page is deleted the page will never be freed and will be in the lru forever. 3) If we trigger a reclaim again and tries to reclaim the page, add_to_swap() will trigger VM_BUG_ON_FOLIO due to the uptodate flag is cleared. To fix it, skip the hwpoisoned page in shrink_folio_list(). Besides, the hwpoison folio may not be unmapped by hwpoison_user_mappings() yet, unmap it in shrink_folio_list(), otherwise the folio will fail to be unmaped by hwpoison_user_mappings() since the folio isn't in lru list. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jinjiang Tu <[email protected]> Acked-by: Miaohe Lin <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Nanyong Sun <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: <stable@vger,kernel.org> Signed-off-by: Andrew Morton <[email protected]> (cherry picked from commit 1b04495) [Guan Wentao: add helper from commit ("mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper")] Signed-off-by: Wentao Guan <[email protected]>
Reviewer's GuideThis PR introduces a folio_contain_hwpoisoned_page helper and integrates it into the vmscan reclaim path so that hwpoisoned folios are unconditionally skipped and unmapped during shrink operations, preventing BUGs when reclaiming poisoned pages. Sequence diagram for handling hwpoisoned folios during shrink_folio_listsequenceDiagram
participant shrink_folio_list
participant folio
participant folio_contain_hwpoisoned_page
participant unmap_poisoned_folio
shrink_folio_list->>folio: folio_trylock(folio)
alt folio is locked
shrink_folio_list->>folio_contain_hwpoisoned_page: check if folio is hwpoisoned
alt folio is hwpoisoned
shrink_folio_list->>unmap_poisoned_folio: unmap_poisoned_folio(folio, folio_pfn(folio), false)
shrink_folio_list->>folio: folio_unlock(folio)
shrink_folio_list->>folio: folio_put(folio)
Note right of shrink_folio_list: Continue to next folio
else folio is not hwpoisoned
shrink_folio_list->>folio: continue normal reclaim
end
else folio is not locked
shrink_folio_list->>shrink_folio_list: keep folio
end
Class diagram for folio_contain_hwpoisoned_page helper and vmscan changesclassDiagram
class folio {
+bool folio_test_hwpoison()
+bool folio_test_large()
+bool folio_test_has_hwpoisoned()
+void folio_unlock()
+void folio_put()
}
class shrink_folio_list {
+unsigned int shrink_folio_list(struct list_head *folio_list, ...)
}
class folio_contain_hwpoisoned_page {
+bool folio_contain_hwpoisoned_page(struct folio *folio)
}
class unmap_poisoned_folio {
+void unmap_poisoned_folio(struct folio *folio, unsigned long pfn, bool)
}
folio_contain_hwpoisoned_page --|> folio : uses
shrink_folio_list --|> folio_contain_hwpoisoned_page : calls
shrink_folio_list --|> unmap_poisoned_folio : calls
shrink_folio_list --|> folio : uses
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
deepin pr auto review关键摘要:
是否建议立即修改:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a kernel BUG by skipping and unmapping hwpoisoned folios during memory reclamation in the vmscan subsystem.
- Skip hwpoisoned folios in shrink_folio_list to prevent VM_BUG_ON errors.
- Introduce the folio_contain_hwpoisoned_page helper and corresponding unmap call.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
File | Description |
---|---|
mm/vmscan.c | Adds a check to unmap and release hwpoisoned folios during shrink_folio_list. |
include/linux/page-flags.h | Introduces the folio_contain_hwpoisoned_page helper to detect hwpoison conditions. |
Comments suppressed due to low confidence (2)
mm/vmscan.c:1744
- Consider adding an inline comment to explain why unmapping and releasing the folio is necessary for hwpoisoned pages in this reclaim path.
if (folio_contain_hwpoisoned_page(folio)) {
include/linux/page-flags.h:1042
- [nitpick] Consider renaming the function to 'folio_contains_hwpoisoned_page' for improved grammatical clarity.
static inline bool folio_contain_hwpoisoned_page(struct folio *folio)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @opsiff - I've reviewed your changes - here's some feedback:
- Rename folio_contain_hwpoisoned_page to folio_contains_hwpoisoned_page to match the common ‘contains’ naming convention.
- Add a brief comment above the hwpoisoned folio branch in shrink_folio_list to explain why we unmap-and-skip these folios for future maintainers.
- Consider moving the folio_contain_hwpoisoned_page helper into the mm/hwpoison subsystem (or a more relevant header) alongside other hwpoison utilities.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Rename folio_contain_hwpoisoned_page to folio_contains_hwpoisoned_page to match the common ‘contains’ naming convention.
- Add a brief comment above the hwpoisoned folio branch in shrink_folio_list to explain why we unmap-and-skip these folios for future maintainers.
- Consider moving the folio_contain_hwpoisoned_page helper into the mm/hwpoison subsystem (or a more relevant header) alongside other hwpoison utilities.
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
mainline inclusion
from mainline-v6.15-rc1
category: bugfix
commit 1b04495 upstream.
Syzkaller reports a bug as follows:
Injecting memory failure for pfn 0x18b00e at process virtual address 0x20ffd000 Memory failure: 0x18b00e: dirty swapcache page still referenced by 2 users Memory failure: 0x18b00e: recovery action for dirty swapcache page: Failed page: refcount:2 mapcount:0 mapping:0000000000000000 index:0x20ffd pfn:0x18b00e memcg:ffff0000dd6d9000
anon flags: 0x5ffffe00482011(locked|dirty|arch_1|swapbacked|hwpoison|node=0|zone=2|lastcpupid=0xfffff) raw: 005ffffe00482011 dead000000000100 dead000000000122 ffff0000e232a7c9 raw: 0000000000020ffd 0000000000000000 00000002ffffffff ffff0000dd6d9000 page dumped because: VM_BUG_ON_FOLIO(!folio_test_uptodate(folio)) ------------[ cut here ]------------
kernel BUG at mm/swap_state.c:184!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP Modules linked in:
CPU: 0 PID: 60 Comm: kswapd0 Not tainted 6.6.0-gcb097e7de84e #3 Hardware name: linux,dummy-virt (DT)
pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : add_to_swap+0xbc/0x158
lr : add_to_swap+0xbc/0x158
sp : ffff800087f37340
x29: ffff800087f37340 x28: fffffc00052c0380 x27: ffff800087f37780 x26: ffff800087f37490 x25: ffff800087f37c78 x24: ffff800087f377a0 x23: ffff800087f37c50 x22: 0000000000000000 x21: fffffc00052c03b4 x20: 0000000000000000 x19: fffffc00052c0380 x18: 0000000000000000 x17: 296f696c6f662865 x16: 7461646f7470755f x15: 747365745f6f696c x14: 6f6621284f494c4f x13: 0000000000000001 x12: ffff600036d8b97b x11: 1fffe00036d8b97a x10: ffff600036d8b97a x9 : dfff800000000000 x8 : 00009fffc9274686 x7 : ffff0001b6c5cbd3 x6 : 0000000000000001 x5 : ffff0000c25896c0 x4 : 0000000000000000 x3 : 0000000000000000 x2 : 0000000000000000 x1 : ffff0000c25896c0 x0 : 0000000000000000 Call trace:
add_to_swap+0xbc/0x158
shrink_folio_list+0x12ac/0x2648
shrink_inactive_list+0x318/0x948
shrink_lruvec+0x450/0x720
shrink_node_memcgs+0x280/0x4a8
shrink_node+0x128/0x978
balance_pgdat+0x4f0/0xb20
kswapd+0x228/0x438
kthread+0x214/0x230
ret_from_fork+0x10/0x20
I can reproduce this issue with the following steps:
When a dirty swapcache page is isolated by reclaim process and the
page isn't locked, inject memory failure for the page.
me_swapcache_dirty() clears uptodate flag and tries to delete from lru,
but fails. Reclaim process will put the hwpoisoned page back to lru.
The process that maps the hwpoisoned page exits, the page is deleted
the page will never be freed and will be in the lru forever.
If we trigger a reclaim again and tries to reclaim the page,
add_to_swap() will trigger VM_BUG_ON_FOLIO due to the uptodate flag is
cleared.
To fix it, skip the hwpoisoned page in shrink_folio_list(). Besides, the hwpoison folio may not be unmapped by hwpoison_user_mappings() yet, unmap it in shrink_folio_list(), otherwise the folio will fail to be unmaped by hwpoison_user_mappings() since the folio isn't in lru list.
Link: https://lkml.kernel.org/r/[email protected]
Acked-by: Miaohe Lin [email protected]
Cc: David Hildenbrand [email protected]
Cc: Kefeng Wang [email protected]
Cc: Nanyong Sun [email protected]
Cc: Naoya Horiguchi [email protected]
Cc: <stable@vger,kernel.org>
(cherry picked from commit 1b04495) [Guan Wentao: add helper from commit ("mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper")]
Summary by Sourcery
Skip and unmap hwpoisoned folios during reclaim to avoid VM_BUG_ON crashes on poisoned pages and introduce a helper to detect hwpoisoned folios
Bug Fixes:
Enhancements: