Skip to content

[WIP] [Deepin-Kernel-SIG] [linux 6.6-y] [Upstream] mm/vmscan: don't try to reclaim hwpoison folio #896

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: linux-6.6.y
Choose a base branch
from

Conversation

opsiff
Copy link
Member

@opsiff opsiff commented Jun 23, 2025

mainline inclusion
from mainline-v6.15-rc1
category: bugfix

commit 1b04495 upstream.

Syzkaller reports a bug as follows:

Injecting memory failure for pfn 0x18b00e at process virtual address 0x20ffd000 Memory failure: 0x18b00e: dirty swapcache page still referenced by 2 users Memory failure: 0x18b00e: recovery action for dirty swapcache page: Failed page: refcount:2 mapcount:0 mapping:0000000000000000 index:0x20ffd pfn:0x18b00e memcg:ffff0000dd6d9000
anon flags: 0x5ffffe00482011(locked|dirty|arch_1|swapbacked|hwpoison|node=0|zone=2|lastcpupid=0xfffff) raw: 005ffffe00482011 dead000000000100 dead000000000122 ffff0000e232a7c9 raw: 0000000000020ffd 0000000000000000 00000002ffffffff ffff0000dd6d9000 page dumped because: VM_BUG_ON_FOLIO(!folio_test_uptodate(folio)) ------------[ cut here ]------------
kernel BUG at mm/swap_state.c:184!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP Modules linked in:
CPU: 0 PID: 60 Comm: kswapd0 Not tainted 6.6.0-gcb097e7de84e #3 Hardware name: linux,dummy-virt (DT)
pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : add_to_swap+0xbc/0x158
lr : add_to_swap+0xbc/0x158
sp : ffff800087f37340
x29: ffff800087f37340 x28: fffffc00052c0380 x27: ffff800087f37780 x26: ffff800087f37490 x25: ffff800087f37c78 x24: ffff800087f377a0 x23: ffff800087f37c50 x22: 0000000000000000 x21: fffffc00052c03b4 x20: 0000000000000000 x19: fffffc00052c0380 x18: 0000000000000000 x17: 296f696c6f662865 x16: 7461646f7470755f x15: 747365745f6f696c x14: 6f6621284f494c4f x13: 0000000000000001 x12: ffff600036d8b97b x11: 1fffe00036d8b97a x10: ffff600036d8b97a x9 : dfff800000000000 x8 : 00009fffc9274686 x7 : ffff0001b6c5cbd3 x6 : 0000000000000001 x5 : ffff0000c25896c0 x4 : 0000000000000000 x3 : 0000000000000000 x2 : 0000000000000000 x1 : ffff0000c25896c0 x0 : 0000000000000000 Call trace:
add_to_swap+0xbc/0x158
shrink_folio_list+0x12ac/0x2648
shrink_inactive_list+0x318/0x948
shrink_lruvec+0x450/0x720
shrink_node_memcgs+0x280/0x4a8
shrink_node+0x128/0x978
balance_pgdat+0x4f0/0xb20
kswapd+0x228/0x438
kthread+0x214/0x230
ret_from_fork+0x10/0x20

I can reproduce this issue with the following steps:

  1. When a dirty swapcache page is isolated by reclaim process and the
    page isn't locked, inject memory failure for the page.
    me_swapcache_dirty() clears uptodate flag and tries to delete from lru,
    but fails. Reclaim process will put the hwpoisoned page back to lru.

  2. The process that maps the hwpoisoned page exits, the page is deleted
    the page will never be freed and will be in the lru forever.

  3. If we trigger a reclaim again and tries to reclaim the page,
    add_to_swap() will trigger VM_BUG_ON_FOLIO due to the uptodate flag is
    cleared.

To fix it, skip the hwpoisoned page in shrink_folio_list(). Besides, the hwpoison folio may not be unmapped by hwpoison_user_mappings() yet, unmap it in shrink_folio_list(), otherwise the folio will fail to be unmaped by hwpoison_user_mappings() since the folio isn't in lru list.

Link: https://lkml.kernel.org/r/[email protected]

Acked-by: Miaohe Lin [email protected]
Cc: David Hildenbrand [email protected]
Cc: Kefeng Wang [email protected]
Cc: Nanyong Sun [email protected]
Cc: Naoya Horiguchi [email protected]
Cc: <stable@vger,kernel.org>

(cherry picked from commit 1b04495) [Guan Wentao: add helper from commit ("mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper")]

Summary by Sourcery

Skip and unmap hwpoisoned folios during reclaim to avoid VM_BUG_ON crashes on poisoned pages and introduce a helper to detect hwpoisoned folios

Bug Fixes:

  • Prevent kernel BUG in mm/vmscan by skipping and unmapping hwpoisoned folios in shrink_folio_list

Enhancements:

  • Add folio_contain_hwpoisoned_page() helper in page-flags.h to detect hwpoisoned folios

mainline inclusion
from mainline-v6.15-rc1
category: bugfix

commit 1b04495 upstream.

Syzkaller reports a bug as follows:

Injecting memory failure for pfn 0x18b00e at process virtual address 0x20ffd000
Memory failure: 0x18b00e: dirty swapcache page still referenced by 2 users
Memory failure: 0x18b00e: recovery action for dirty swapcache page: Failed
page: refcount:2 mapcount:0 mapping:0000000000000000 index:0x20ffd pfn:0x18b00e
memcg:ffff0000dd6d9000
anon flags: 0x5ffffe00482011(locked|dirty|arch_1|swapbacked|hwpoison|node=0|zone=2|lastcpupid=0xfffff)
raw: 005ffffe00482011 dead000000000100 dead000000000122 ffff0000e232a7c9
raw: 0000000000020ffd 0000000000000000 00000002ffffffff ffff0000dd6d9000
page dumped because: VM_BUG_ON_FOLIO(!folio_test_uptodate(folio))
------------[ cut here ]------------
kernel BUG at mm/swap_state.c:184!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
Modules linked in:
CPU: 0 PID: 60 Comm: kswapd0 Not tainted 6.6.0-gcb097e7de84e #3
Hardware name: linux,dummy-virt (DT)
pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : add_to_swap+0xbc/0x158
lr : add_to_swap+0xbc/0x158
sp : ffff800087f37340
x29: ffff800087f37340 x28: fffffc00052c0380 x27: ffff800087f37780
x26: ffff800087f37490 x25: ffff800087f37c78 x24: ffff800087f377a0
x23: ffff800087f37c50 x22: 0000000000000000 x21: fffffc00052c03b4
x20: 0000000000000000 x19: fffffc00052c0380 x18: 0000000000000000
x17: 296f696c6f662865 x16: 7461646f7470755f x15: 747365745f6f696c
x14: 6f6621284f494c4f x13: 0000000000000001 x12: ffff600036d8b97b
x11: 1fffe00036d8b97a x10: ffff600036d8b97a x9 : dfff800000000000
x8 : 00009fffc9274686 x7 : ffff0001b6c5cbd3 x6 : 0000000000000001
x5 : ffff0000c25896c0 x4 : 0000000000000000 x3 : 0000000000000000
x2 : 0000000000000000 x1 : ffff0000c25896c0 x0 : 0000000000000000
Call trace:
 add_to_swap+0xbc/0x158
 shrink_folio_list+0x12ac/0x2648
 shrink_inactive_list+0x318/0x948
 shrink_lruvec+0x450/0x720
 shrink_node_memcgs+0x280/0x4a8
 shrink_node+0x128/0x978
 balance_pgdat+0x4f0/0xb20
 kswapd+0x228/0x438
 kthread+0x214/0x230
 ret_from_fork+0x10/0x20

I can reproduce this issue with the following steps:

1) When a dirty swapcache page is isolated by reclaim process and the
   page isn't locked, inject memory failure for the page.
   me_swapcache_dirty() clears uptodate flag and tries to delete from lru,
   but fails.  Reclaim process will put the hwpoisoned page back to lru.

2) The process that maps the hwpoisoned page exits, the page is deleted
   the page will never be freed and will be in the lru forever.

3) If we trigger a reclaim again and tries to reclaim the page,
   add_to_swap() will trigger VM_BUG_ON_FOLIO due to the uptodate flag is
   cleared.

To fix it, skip the hwpoisoned page in shrink_folio_list().  Besides, the
hwpoison folio may not be unmapped by hwpoison_user_mappings() yet, unmap
it in shrink_folio_list(), otherwise the folio will fail to be unmaped by
hwpoison_user_mappings() since the folio isn't in lru list.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jinjiang Tu <[email protected]>
Acked-by: Miaohe Lin <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Nanyong Sun <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: <stable@vger,kernel.org>
Signed-off-by: Andrew Morton <[email protected]>
(cherry picked from commit 1b04495)
[Guan Wentao: add helper from commit ("mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper")]
Signed-off-by: Wentao Guan <[email protected]>
Copy link

sourcery-ai bot commented Jun 23, 2025

Reviewer's Guide

This PR introduces a folio_contain_hwpoisoned_page helper and integrates it into the vmscan reclaim path so that hwpoisoned folios are unconditionally skipped and unmapped during shrink operations, preventing BUGs when reclaiming poisoned pages.

Sequence diagram for handling hwpoisoned folios during shrink_folio_list

sequenceDiagram
    participant shrink_folio_list
    participant folio
    participant folio_contain_hwpoisoned_page
    participant unmap_poisoned_folio
    shrink_folio_list->>folio: folio_trylock(folio)
    alt folio is locked
        shrink_folio_list->>folio_contain_hwpoisoned_page: check if folio is hwpoisoned
        alt folio is hwpoisoned
            shrink_folio_list->>unmap_poisoned_folio: unmap_poisoned_folio(folio, folio_pfn(folio), false)
            shrink_folio_list->>folio: folio_unlock(folio)
            shrink_folio_list->>folio: folio_put(folio)
            Note right of shrink_folio_list: Continue to next folio
        else folio is not hwpoisoned
            shrink_folio_list->>folio: continue normal reclaim
        end
    else folio is not locked
        shrink_folio_list->>shrink_folio_list: keep folio
    end
Loading

Class diagram for folio_contain_hwpoisoned_page helper and vmscan changes

classDiagram
    class folio {
        +bool folio_test_hwpoison()
        +bool folio_test_large()
        +bool folio_test_has_hwpoisoned()
        +void folio_unlock()
        +void folio_put()
    }
    class shrink_folio_list {
        +unsigned int shrink_folio_list(struct list_head *folio_list, ...)
    }
    class folio_contain_hwpoisoned_page {
        +bool folio_contain_hwpoisoned_page(struct folio *folio)
    }
    class unmap_poisoned_folio {
        +void unmap_poisoned_folio(struct folio *folio, unsigned long pfn, bool)
    }
    folio_contain_hwpoisoned_page --|> folio : uses
    shrink_folio_list --|> folio_contain_hwpoisoned_page : calls
    shrink_folio_list --|> unmap_poisoned_folio : calls
    shrink_folio_list --|> folio : uses
Loading

File-Level Changes

Change Details Files
Introduce folio_contain_hwpoisoned_page helper
  • Define inline function combining folio_test_hwpoison and large folio hwpoison tests
  • Expose new helper in page-flags.h for use by vmscan
include/linux/page-flags.h
Skip and unmap hwpoisoned folios in shrink_folio_list
  • Check folio_contain_hwpoisoned_page at start of shrink loop
  • Call unmap_poisoned_folio, then folio_unlock and folio_put
  • Continue to next folio without further reclaim steps
mm/vmscan.c

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@deepin-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from opsiff. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@deepin-ci-robot
Copy link

deepin pr auto review

关键摘要:

  • 新增的 folio_contain_hwpoisoned_page 函数中,folio_test_has_hwpoisoned 的使用可能需要确认其定义和作用,以确保逻辑正确。
  • shrink_folio_list 函数中,对 folio_contain_hwpoisoned_page 的调用后没有检查 folio 是否成功锁定,可能会导致竞态条件。
  • unmap_poisoned_folio 函数的调用中,folio_pfn(folio) 的使用需要确认其返回值是否符合预期,以及是否在 folio 锁定的情况下调用。

是否建议立即修改:

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a kernel BUG by skipping and unmapping hwpoisoned folios during memory reclamation in the vmscan subsystem.

  • Skip hwpoisoned folios in shrink_folio_list to prevent VM_BUG_ON errors.
  • Introduce the folio_contain_hwpoisoned_page helper and corresponding unmap call.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
mm/vmscan.c Adds a check to unmap and release hwpoisoned folios during shrink_folio_list.
include/linux/page-flags.h Introduces the folio_contain_hwpoisoned_page helper to detect hwpoison conditions.
Comments suppressed due to low confidence (2)

mm/vmscan.c:1744

  • Consider adding an inline comment to explain why unmapping and releasing the folio is necessary for hwpoisoned pages in this reclaim path.
		if (folio_contain_hwpoisoned_page(folio)) {

include/linux/page-flags.h:1042

  • [nitpick] Consider renaming the function to 'folio_contains_hwpoisoned_page' for improved grammatical clarity.
static inline bool folio_contain_hwpoisoned_page(struct folio *folio)

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @opsiff - I've reviewed your changes - here's some feedback:

  • Rename folio_contain_hwpoisoned_page to folio_contains_hwpoisoned_page to match the common ‘contains’ naming convention.
  • Add a brief comment above the hwpoisoned folio branch in shrink_folio_list to explain why we unmap-and-skip these folios for future maintainers.
  • Consider moving the folio_contain_hwpoisoned_page helper into the mm/hwpoison subsystem (or a more relevant header) alongside other hwpoison utilities.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Rename folio_contain_hwpoisoned_page to folio_contains_hwpoisoned_page to match the common ‘contains’ naming convention.
- Add a brief comment above the hwpoisoned folio branch in shrink_folio_list to explain why we unmap-and-skip these folios for future maintainers.
- Consider moving the folio_contain_hwpoisoned_page helper into the mm/hwpoison subsystem (or a more relevant header) alongside other hwpoison utilities.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@opsiff opsiff marked this pull request as draft June 23, 2025 03:28
@Avenger-285714 Avenger-285714 changed the title [Deepin-Kernel-SIG] [linux 6.6-y] [Upstream] mm/vmscan: don't try to reclaim hwpoison folio [WIP] [Deepin-Kernel-SIG] [linux 6.6-y] [Upstream] mm/vmscan: don't try to reclaim hwpoison folio Jun 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants