Skip to content

koordlet: add psi qos reconciler #2463

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Wheat2018
Copy link

Ⅰ. Describe what this PR does

add psi qos manager, which now supports 4 operators:

  1. PSIExport: Collects PSI metrics for Pods and reports them via Pod Conditions.
  2. MemorySuppress: Applies pressure to Pod memory allocation, increasing with the growth of allocated memory.
  3. GroupShare: Groups Pods and allows CPU weight sharing within a group.
  4. BudgetBalance: Balances CPU usage among Pods over time, beneficial for burstable Pods.

Ⅱ. Does this pull request fix one issue?

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

V. Checklist

  • I have written necessary docs and comments
  • I have added necessary unit tests and integration tests
  • All checks passed in make test

@koordinator-bot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign eahydra, zwzhang0107 after the PR has been reviewed.
You can assign the PR to them by writing /assign @eahydra @zwzhang0107 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

codecov bot commented Jun 3, 2025

Codecov Report

Attention: Patch coverage is 15.61822% with 389 lines in your changes missing coverage. Please review.

Project coverage is 65.51%. Comparing base (73fa323) to head (d6c9369).
Report is 43 commits behind head on main.

Files with missing lines Patch % Lines
...et/qosmanager/plugins/psi/operator/psi_exporter.go 0.00% 130 Missing ⚠️
...qosmanager/plugins/psi/operator/memory_suppress.go 0.00% 97 Missing ⚠️
.../qosmanager/plugins/psi/operator/budget_balance.go 0.00% 61 Missing ⚠️
pkg/util/sloconfig/nodeslo_config.go 0.00% 39 Missing ⚠️
...let/qosmanager/plugins/psi/operator/group_share.go 0.00% 35 Missing ⚠️
.../koordlet/qosmanager/plugins/psi/operator/types.go 0.00% 13 Missing ⚠️
pkg/slo-controller/nodeslo/resource_strategy.go 84.21% 4 Missing and 2 partials ⚠️
...slo-controller/nodeslo/nodeslo_cm_event_handler.go 42.85% 3 Missing and 1 partial ⚠️
pkg/slo-controller/nodeslo/nodeslo_controller.go 42.85% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2463      +/-   ##
==========================================
- Coverage   65.93%   65.51%   -0.42%     
==========================================
  Files         477      483       +6     
  Lines       56194    56655     +461     
==========================================
+ Hits        37049    37118      +69     
- Misses      16461    16847     +386     
- Partials     2684     2690       +6     
Flag Coverage Δ
unittests 65.51% <15.61%> (-0.42%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Wheat2018 Wheat2018 force-pushed the psi branch 3 times, most recently from 6910235 to 006665f Compare June 4, 2025 09:19
1. PSIExport: Collects PSI metrics for Pods and reports them via Pod Conditions.
2. MemorySuppress: Applies pressure to Pod memory allocation, increasing with the growth of allocated memory.
3. GroupShare: Groups Pods and allows CPU weight sharing within a group.
4. BudgetBalance: Balances CPU usage among Pods over time, beneficial for burstable Pods.

Signed-off-by: wheat2018 <[email protected]>
@saintube
Copy link
Member

saintube commented Jun 6, 2025

++ /cc @songtao98 for PSI collector

@saintube saintube requested review from hormes and songtao98 June 6, 2025 08:09
Copy link
Member

@saintube saintube left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a delicate job, but it's also complex :) So please add a document describing its design and how to use it when you have time.


type PSIThreshold struct {
// Avg10 indicates the average 10-second PSI threshold, range [0,10000] indicating [0%,100%].
Avg10 int64 `json:"avg10,omitempty" validate:"min=0,max=10000"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To enable the CRD validation, it needs some kubebuilder tags like:

// +kubebuilder:validation:Minimum=0

https://book.kubebuilder.io/reference/markers

return int64(float64(new-old) / interval.Seconds())
}

func max(a, b int64) int64 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

math.MaxInt64?

func DefaultPSIStrategy() *slov1alpha1.PSIStrategy {
return &slov1alpha1.PSIStrategy{
PSIExport: &slov1alpha1.PSIExportConfig{
Enable: pointer.Bool(true),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an alpha feature, please disable it by default.

}

func (p *psiReconcile) Enabled() bool {
return features.DefaultKoordletFeatureGate.Enabled(features.BlkIOReconcile) && p.reconcileInterval > 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

features.DefaultKoordletFeatureGate.Enabled(features.BlkIOReconcile)

Please add a new feature gate.

return &CpuQuota{Quota: quota, Period: period}, nil
}

func WriteCpuMax(cgroupPath string, max *CpuQuota) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants