Skip to content

Commit 085555b

Browse files
yansun1996sajmera-pensando
authored andcommitted
[DOC] Add v1.2.2 release notes
1 parent fc64952 commit 085555b

File tree

1 file changed

+46
-0
lines changed

1 file changed

+46
-0
lines changed

docs/releasenotes.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,51 @@
11
# Release Notes
22

3+
## GPU Operator v1.2.2 Release Notes
4+
5+
The AMD GPU Operator v1.2.2 release introduces new features to support Device Metrics Exporter's integration with [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator) from ServiceMonitor custom resource and also introduces several bug fixes.
6+
7+
### Release Highlights
8+
9+
- **Enhanced Metrics Integration with Prometheus Operator**
10+
11+
This release introduces a streamlined method for integrating the metrics endpoint of the metrics exporter with the Prometheus Operator.
12+
13+
Users can now leverage the `DeviceConfig` custom resource to specify the necessary configuration for metrics collection. The GPU Operator will automatically read the relevant `DeviceConfig` and manage the creation and lifecycle of a corresponding ServiceMonitor custom resource.
14+
15+
This automation simplifies the process of exposing metrics to the Prometheus Operator, allowing for easier scraping and monitoring of GPU-related metrics within your Kubernetes environment.
16+
17+
### Documentation Updates
18+
19+
- Updated [Release notes](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/releasenotes.html) detailing new features in v1.2.2.
20+
21+
### Known Limitations
22+
23+
> **Note:** All current and historical limitations for the GPU Operator, including their latest statuses and any associated workarounds or fixes, are tracked in the following documentation page: [Known Issues and Limitations](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/knownlimitations.html).
24+
Please refer to this page regularly for the most up-to-date information.
25+
26+
### Fixes
27+
28+
1. **Node labeller failed to report node labels when users are using `DeviceConfig` with `spec.driver.enable=false` and customized node selector in `spec.selector`** [[#183]](https://github.com/ROCm/gpu-operator/issues/183)
29+
- *Issue*: When users are using inbox driver, they will set `spec.driver.enable=false` within the `DeviceConfig` spec. If they are also using customized node selector in `spec.selector`, once node labeller was brought up its GPU properties labels are not showing up among Node resource labels.
30+
- *Root Cause*: When users are using `spec.driver.enable=false` and customized non-default selector `spec.selector`, the operator controller manager is using the wrong selector to clean up node labeller's labels on non-GPU nodes.
31+
- *Resolution*: This issue has been fixed in v1.2.2. Users can upgrade to v1.2.2 and GPU properties node labels will show up once node labeller was brought up again.
32+
33+
2. **Users self-defined node labels under domain `amd.com` are unexpectly removed** [[#151]](https://github.com/ROCm/gpu-operator/issues/151)
34+
- *Issue*: When users created some node labels under amd.com domain (e.g. amd.com/gpu: "true") for their own usage, it is unexpectly getting removed during bootstrapping.
35+
- *Root Cause*:
36+
- When node labeller pod launched it will remove all node labels within `amd.com` and `beta.amd.com` from current node then post the labels managed by itself.
37+
- When operator is executing the reconcile function, the removal of `DevicePlugin` or will remove all node labels under `amd.com` or `beta.amd.com` domain even if they are not managed by node labeller.
38+
- *Resolution*: This issue has been fixed in v1.2.2 for both operator and node labeller side. Users can upgrade to v1.2.2 operator helm chart and use latest node labeller image then only node labeller managed labels will be auto removed. Other users defined labels under `amd.com` or `beta.amd.com` won't be auto removed by operator or node labeller.
39+
40+
3. **During automatic driver upgrade nodes can get stuck in reboot-in-progress**
41+
- *Issue*: When users upgrade the driver version by using `DeviceConfig` automatic upgrade feature with `spec.driver.upgradePolicy.enable=true` and `spec.driver.upgradePolicy.rebootRequired=true`, some nodes may get stuck at reboot-in-progress state.
42+
- *Root Cause*:
43+
- Upgrademgr was checking the generationID of `DeviceConfig` to make sure any spec change during upgrade won't interfere existing upgrade. But if CR changes even for other parts of the device config spec which are unrelated to upgrade, this check will be a problem as new driver upgrade will not start for unrelated CR changes.
44+
- During the driver upgrade when node reboot happened, the controller manager pod could also get affected and rescheduled to another node. When it comes back, in the init phase, it checks for reboot-in-progress and attempts to delete reboot pod. But it is possible that reboot pod has terminated by then already.
45+
- *Resolution*: The controller manager's upgrade manager module implementation has been patched to fix this issue in release v1.2.2, by upgrading to new controller manager image this issue should have been fixed.
46+
47+
</br></br>
48+
349
## GPU Operator v1.2.1 Release Notes
450

551
The AMD GPU Operator v1.2.1 release introduces expanded platform support and new features to enhance GPU workload management. Notably, this release adds support for OpenShift and Microsoft Azure Kubernetes Service (AKS), and introduces two new **beta features**:

0 commit comments

Comments
 (0)