Skip to content

Gpu levelzero sidecar #1803

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Sep 19, 2024
Merged

Gpu levelzero sidecar #1803

merged 6 commits into from
Sep 19, 2024

Conversation

tkatila
Copy link
Contributor

@tkatila tkatila commented Aug 9, 2024

Add GPU Levelzero sidecar to allow fetching health data for the GPU devices from the Levelzero API.

As a bonus, this also adds support for detecting Intel GPUs within Windows Subsystem for Linux (WSL).

@tkatila tkatila force-pushed the gpu-levelzero-sidecar branch 4 times, most recently from fc65eff to d3afded Compare August 13, 2024 11:59
Copy link
Contributor

@eero-t eero-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few doc comments.

PS. I would also suggest running following (untested) command on the code to fix PCI abbreviation use in error messages:
sed -i 's/ pci / PCI /' $(git grep -l ' pci ')

@tkatila tkatila force-pushed the gpu-levelzero-sidecar branch from 4312bf0 to 80440d7 Compare August 20, 2024 11:30
@tkatila tkatila force-pushed the gpu-levelzero-sidecar branch from 82a3877 to 76b1e5a Compare September 6, 2024 06:34
@tkatila
Copy link
Contributor Author

tkatila commented Sep 6, 2024

@eero-t & @uniemimu can you please review?

This probably requires some restructuring after the GPU CDI support has been merged (assuming it will happen first).

(Failed check is due to internal certificate issue, I do not suspect it to be a real problem.)

@eero-t
Copy link
Contributor

eero-t commented Sep 6, 2024

@eero-t & @uniemimu can you please review?

What are the differences to earlier version? (you did force push, so github does not show those.)

@tkatila
Copy link
Contributor Author

tkatila commented Sep 6, 2024

@eero-t & @uniemimu can you please review?

What are the differences to earlier version? (you did force push, so github does not show those.)

https://github.com/intel/intel-device-plugins-for-kubernetes/compare/82a387760e4c2a9a12cc366405e19153f67e7efd..76b1e5a04abc600ec1992a1adf0474d42ffad1b9

~mid third is my changes. Top third and bottom third are baseline changes. But to summarize:

  1. Some logging to C code
  2. Added temperature read from device and its handling in the plugin

@tkatila tkatila force-pushed the gpu-levelzero-sidecar branch 2 times, most recently from 6a923f8 to 0e6ee98 Compare September 11, 2024 08:57
@tkatila tkatila requested a review from ozhuraki as a code owner September 11, 2024 08:57
@tkatila
Copy link
Contributor Author

tkatila commented Sep 11, 2024

Rebased on top of the CDI changes. Also added an e2e test for the levelzero deployment.

Copy link
Contributor

@eero-t eero-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. There were few minor C-code items which could be fixed before merge.

@mythi
Copy link
Contributor

mythi commented Sep 12, 2024

should it be mentioned somewhere that these new features are not available for the operator users?

@tkatila
Copy link
Contributor Author

tkatila commented Sep 12, 2024

should it be mentioned somewhere that these new features are not available for the operator users?

Sure, I'll add a note. This PR is large enough as it is so I didn't want to touch the operator use case. Once this is merged, I'll tackle the operator support.

@tkatila tkatila force-pushed the gpu-levelzero-sidecar branch from 0e6ee98 to 089138a Compare September 13, 2024 13:01
@tkatila
Copy link
Contributor Author

tkatila commented Sep 13, 2024

Rebased the content. Added C fixes, fixed compile flags for the build and decided to change the levelzero enabling to a boolean (instead of a unis socket path).

Copy link
Contributor

@eero-t eero-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, looks OK on quick review.

@tkatila
Copy link
Contributor Author

tkatila commented Sep 17, 2024

Since I broke wsl in the previous commits, I verified it to work on the current HEAD.

Copy link
Contributor

@mythi mythi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

tkatila and others added 6 commits September 19, 2024 14:51
Signed-off-by: Tuomas Katila <[email protected]>
In addition to the levelzero's health data use, this adds support to
scan devices in WSL. Scanning happens by retrieving Intel device
indices from the Level-Zero API.

Signed-off-by: Tuomas Katila <[email protected]>
Co-authored-by: Eero Tamminen <[email protected]>
Signed-off-by: Tuomas Katila <[email protected]>
@tkatila tkatila force-pushed the gpu-levelzero-sidecar branch from 1b2ba17 to 291798e Compare September 19, 2024 11:59
@tkatila
Copy link
Contributor Author

tkatila commented Sep 19, 2024

Squashed commits, no code changes.

@uniemimu uniemimu merged commit 9bd3975 into intel:main Sep 19, 2024
75 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants