-
Notifications
You must be signed in to change notification settings - Fork 92
[linux-6.6.y][Feature]Backport some mm related patches from recent (202504) Linux upstream #775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[linux-6.6.y][Feature]Backport some mm related patches from recent (202504) Linux upstream #775
Conversation
…le nodes mainline inclusion from mainline-v6.7-rc1 category: bugfix --------------------------- commit 2eaa6c2 upstream. The decreasing of hugetlb pages number failed with the following message given: sh: page allocation failure: order:0, mode:0x204cc0(GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_THISNODE) CPU: 1 PID: 112 Comm: sh Not tainted 6.5.0-rc7-... deepin-community#45 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace.part.6+0x84/0xe4 show_stack+0x18/0x24 dump_stack_lvl+0x48/0x60 dump_stack+0x18/0x24 warn_alloc+0x100/0x1bc __alloc_pages_slowpath.constprop.107+0xa40/0xad8 __alloc_pages+0x244/0x2d0 hugetlb_vmemmap_restore+0x104/0x1e4 __update_and_free_hugetlb_folio+0x44/0x1f4 update_and_free_hugetlb_folio+0x20/0x68 update_and_free_pages_bulk+0x4c/0xac set_max_huge_pages+0x198/0x334 nr_hugepages_store_common+0x118/0x178 nr_hugepages_store+0x18/0x24 kobj_attr_store+0x18/0x2c sysfs_kf_write+0x40/0x54 kernfs_fop_write_iter+0x164/0x1dc vfs_write+0x3a8/0x460 ksys_write+0x6c/0x100 __arm64_sys_write+0x1c/0x28 invoke_syscall+0x44/0x100 el0_svc_common.constprop.1+0x6c/0xe4 do_el0_svc+0x38/0x94 el0_svc+0x28/0x74 el0t_64_sync_handler+0xa0/0xc4 el0t_64_sync+0x174/0x178 Mem-Info: ... The reason is that the hugetlb pages being released are allocated from movable nodes, and with hugetlb_optimize_vmemmap enabled, vmemmap pages need to be allocated from the same node during the hugetlb pages releasing. With GFP_KERNEL and __GFP_THISNODE set, allocating from movable node is always failed. Fix this problem by removing __GFP_THISNODE. Link: https://lkml.kernel.org/r/[email protected] Fixes: ad2fa37 ("mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page") Signed-off-by: Yuan Can <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
hygon inclusion category: bugfix CVE: NA --------------------------- The commit 31314f6 ("KVM: SEV: Pin SEV guest memory out of CMA area") has its corresponding upstream version. We'll backport the upstream version to this repo. Fixes: 31314f6 ("KVM: SEV: Pin SEV guest memory out of CMA area") Signed-off-by: hanliyang <[email protected]>
mainline inclusion from mainline-v6.15-rc1 category: feature CVE: NA --------------------------- commit 7e066cb upstream. When registering an encrypted memory region for SEV-MEM/SEV-ES guests, pin the pages with FOLL_TERM so that the pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE. Failure to do so violates the CMA/MOVABLE mechanisms and can result in fragmentation due to unmovable pages, e.g. can make CMA allocations fail. Signed-off-by: Ge Yang <[email protected]> Reviewed-by: Tom Lendacky <[email protected]> Acked-by: David Hildenbrand <[email protected]> Link: https://lore.kernel.org/r/[email protected] [sean: massage changelog, make @flags an unsigned int] Signed-off-by: Sean Christopherson <[email protected]>
hygon inclusion category: bugfix CVE: NA --------------------------- The commit 4d71a47 ("x86/mm: CSV allows CMA allocation concurrently") has its corresponding upstream version. We'll backport the upstream version to this repo. Fixes: 4d71a47 ("x86/mm: CSV allows CMA allocation concurrently") Signed-off-by: hanliyang <[email protected]>
hygon inclusion category: bugfix CVE: NA --------------------------- The commit 83138d4 ("mm/cma: add API to enable concurrent allocation from the CMA") has its corresponding upstream version. We'll backport the upstream version to this repo. Fixes: 83138d4 ("mm/cma: add API to enable concurrent allocation from the CMA") Signed-off-by: hanliyang <[email protected]>
mainline inclusion from mainline-v6.15-rc1 category: feature CVE: NA --------------------------- commit 24ac6fb upstream. For different CMAs, concurrent allocation of CMA memory ideally should not require synchronization using locks. Currently, a global cma_mutex lock is employed to synchronize all CMA allocations, which can impact the performance of concurrent allocations across different CMAs. To test the performance impact, follow these steps: 1. Boot the kernel with the command line argument hugetlb_cma=30G to allocate a 30GB CMA area specifically for huge page allocations. (note: on my machine, which has 3 nodes, each node is initialized with 10G of CMA) 2. Use the dd command with parameters if=/dev/zero of=/dev/shm/file bs=1G count=30 to fully utilize the CMA area by writing zeroes to a file in /dev/shm. 3. Open three terminals and execute the following commands simultaneously: (Note: Each of these commands attempts to allocate 10GB [2621440 * 4KB pages] of CMA memory.) On Terminal 1: time echo 2621440 > /sys/kernel/debug/cma/hugetlb1/alloc On Terminal 2: time echo 2621440 > /sys/kernel/debug/cma/hugetlb2/alloc On Terminal 3: time echo 2621440 > /sys/kernel/debug/cma/hugetlb3/alloc We attempt to allocate pages through the CMA debug interface and use the time command to measure the duration of each allocation. Performance comparison: Without this patch With this patch Terminal1 ~7s ~7s Terminal2 ~14s ~8s Terminal3 ~21s ~7s To solve problem above, we could use per-CMA locks to improve concurrent allocation performance. This would allow each CMA to be managed independently, reducing the need for a global lock and thus improving scalability and performance. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ge Yang <[email protected]> Reviewed-by: Barry Song <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Aisheng Dong <[email protected]> Cc: Baolin Wang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
mainline inclusion from mainline-v6.14-rc1 category: feature CVE: NA --------------------------- commit 04f13d2 upstream. My machine has 4 NUMA nodes, each equipped with 32GB of memory. I have configured each NUMA node with 16GB of CMA and 16GB of in-use hugetlb pages. The allocation of contiguous memory via cma_alloc() can fail probabilistically. When there are free hugetlb folios in the hugetlb pool, during the migration of in-use hugetlb folios, new folios are allocated from the free hugetlb pool. After the migration is completed, the old folios are released back to the free hugetlb pool instead of being returned to the buddy system. This can cause test_pages_isolated() check to fail, ultimately leading to the failure of cma_alloc(). Call trace: cma_alloc() __alloc_contig_migrate_range() // migrate in-use hugepage test_pages_isolated() __test_page_isolated_in_pageblock() PageBuddy(page) // check if the page is in buddy To address this issue, we introduce a function named replace_free_hugepage_folios(). This function will replace the hugepage in the free hugepage pool with a new one and release the old one to the buddy system. After the migration of in-use hugetlb pages is completed, we will invoke replace_free_hugepage_folios() to ensure that these hugepages are properly released to the buddy system. Following this step, when test_pages_isolated() is executed for inspection, it will successfully pass. Additionally, when alloc_contig_range() is used to migrate multiple in-use hugetlb pages, it can result in some in-use hugetlb pages being released back to the free hugetlb pool and subsequently being reallocated and used again. For example: [huge 0] [huge 1] To migrate huge 0, we obtain huge x from the pool. After the migration is completed, we return the now-freed huge 0 back to the pool. When it's time to migrate huge 1, we can simply reuse the now-freed huge 0 from the pool. As a result, when replace_free_hugepage_folios() is executed, it cannot release huge 0 back to the buddy system. To address this issue, we should prevent the reuse of isolated free hugepages during the migration process. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: yangge <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Barry Song <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
mainline inclusion from mainline-v6.14-rc6 category: feature CVE: NA --------------------------- commit 67bab13 upstream. Since the introduction of commit c77c0a8 ("mm/hugetlb: defer freeing of huge pages if in non-task context"), which supports deferring the freeing of hugetlb pages, the allocation of contiguous memory through cma_alloc() may fail probabilistically. In the CMA allocation process, if it is found that the CMA area is occupied by in-use hugetlb folios, these in-use hugetlb folios need to be migrated to another location. When there are no available hugetlb folios in the free hugetlb pool during the migration of in-use hugetlb folios, new folios are allocated from the buddy system. A temporary state is set on the newly allocated folio. Upon completion of the hugetlb folio migration, the temporary state is transferred from the new folios to the old folios. Normally, when the old folios with the temporary state are freed, it is directly released back to the buddy system. However, due to the deferred freeing of hugetlb pages, the PageBuddy() check fails, ultimately leading to the failure of cma_alloc(). Here is a simplified call trace illustrating the process: cma_alloc() ->__alloc_contig_migrate_range() // Migrate in-use hugetlb folios ->unmap_and_move_huge_page() ->folio_putback_hugetlb() // Free old folios ->test_pages_isolated() ->__test_page_isolated_in_pageblock() ->PageBuddy(page) // Check if the page is in buddy To resolve this issue, we have implemented a function named wait_for_freed_hugetlb_folios(). This function ensures that the hugetlb folios are properly released back to the buddy system after their migration is completed. By invoking wait_for_freed_hugetlb_folios() before calling PageBuddy(), we ensure that PageBuddy() will succeed. Link: https://lkml.kernel.org/r/[email protected] Fixes: c77c0a8 ("mm/hugetlb: defer freeing of huge pages if in non-task context") Signed-off-by: Ge Yang <[email protected]> Reviewed-by: Muchun Song <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Barry Song <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
…r target order mainline inclusion from mainline-v6.7-rc1 category: feature CVE: NA -------------------------------- commit e19a3f5 upstream. We always do zone_watermark_ok check and compaction_suitable check together to test if compaction for target order should be ran. Factor these code out to remove repeat code. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kemeng Shi <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
mainline inclusion from mainline-v6.14-rc1 category: feature CVE: NA -------------------------------- commit 6268f0a upstream. There are 4 NUMA nodes on my machine, and each NUMA node has 32GB of memory. I have configured 16GB of CMA memory on each NUMA node, and starting a 32GB virtual machine with device passthrough is extremely slow, taking almost an hour. Long term GUP cannot allocate memory from CMA area, so a maximum of 16 GB of no-CMA memory on a NUMA node can be used as virtual machine memory. There is 16GB of free CMA memory on a NUMA node, which is sufficient to pass the order-0 watermark check, causing the __compaction_suitable() function to consistently return true. For costly allocations, if the __compaction_suitable() function always returns true, it causes the __alloc_pages_slowpath() function to fail to exit at the appropriate point. This prevents timely fallback to allocating memory on other nodes, ultimately resulting in excessively long virtual machine startup times. Call trace: __alloc_pages_slowpath if (compact_result == COMPACT_SKIPPED || compact_result == COMPACT_DEFERRED) goto nopage; // should exit __alloc_pages_slowpath() from here We could use the real unmovable allocation context to have __zone_watermark_unusable_free() subtract CMA pages, and thus we won't pass the order-0 check anymore once the non-CMA part is exhausted. There is some risk that in some different scenario the compaction could in fact migrate pages from the exhausted non-CMA part of the zone to the CMA part and succeed, and we'll skip it instead. But only __GFP_NORETRY allocations should be affected in the immediate "goto nopage" when compaction is skipped, others will attempt with DEF_COMPACT_PRIORITY anyway and won't fail without trying to compact-migrate the non-CMA pageblocks into CMA pageblocks first, so it should be fine. After this fix, it only takes a few tens of seconds to start a 32GB virtual machine with device passthrough functionality. Link: https://lore.kernel.org/lkml/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: yangge <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Barry Song <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
Reviewer's GuideThis pull request backports seven memory management (mm) patches from upstream Linux and reverts three conflicting local patches. Implementation involves introducing new functions for compaction suitability checks and hugetlb folio handling after migration, modifying KVM/SEV memory pinning to use standard flags, replacing the global CMA mutex with per-CMA mutexes, and adjusting memory allocation flags. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Hi @wojiaohanliyang. Thanks for your PR. I'm waiting for a deepin-community member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @wojiaohanliyang - I've reviewed your changes - here's some feedback:
- Consider splitting these backports into smaller PRs based on the subsystem (e.g., compaction, hugetlb, CMA/KVM) for easier review and potential bisection.
Here's what I looked at during the review
- 🟢 General issues: all looks good
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟡 Complexity: 1 issue found
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
* Return COMPACT_SKIPPED if compaction for target order is likely to fail | ||
* Return COMPACT_CONTINUE if compaction for target order should be ran | ||
*/ | ||
static enum compact_result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (complexity): Consider flattening the conditionals in compaction_suit_allocation_order
by extracting the async condition checks into a helper function and using early returns.
Consider flattening the conditionals by extracting the async condition checks into a small helper and using early returns. This can make the control flow clearer without inlining every check. For example:
static inline enum compact_result
check_async_compaction(struct zone *zone, unsigned int order, int highest_zoneidx,
unsigned int alloc_flags, bool async)
{
if (order > PAGE_ALLOC_COSTLY_ORDER && async && !(alloc_flags & ALLOC_CMA)) {
unsigned long watermark = low_wmark_pages(zone) + compact_gap(order);
if (!__zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
0, zone_page_state(zone, NR_FREE_PAGES)))
return COMPACT_SKIPPED;
}
return COMPACT_CONTINUE;
}
Then simplify compaction_suit_allocation_order
by using early returns:
static enum compact_result compaction_suit_allocation_order(struct zone *zone,
unsigned int order, int highest_zoneidx, unsigned int alloc_flags, bool async)
{
unsigned long watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
if (zone_watermark_ok(zone, order, watermark, highest_zoneidx, alloc_flags))
return COMPACT_SUCCESS;
if (check_async_compaction(zone, order, highest_zoneidx, alloc_flags, async) != COMPACT_CONTINUE)
return COMPACT_SKIPPED;
if (!compaction_suitable(zone, order, highest_zoneidx))
return COMPACT_SKIPPED;
return COMPACT_CONTINUE;
}
This approach retains all functionality while clarifying the flow.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: opsiff The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Backport 7 mm related patches from Linux upstream:
Revert 3 non-upstream patches which conflict with the backported patches:
Summary by Sourcery
Backport memory management (mm) related patches from Linux upstream, improving memory allocation, compaction, and handling of huge pages and encrypted memory regions
New Features:
Bug Fixes:
Enhancements:
Chores: