Skip to content

Commit 7154424

Browse files
chore: release notes 0.38.0 (#10231)
1 parent 13e49a7 commit 7154424

13 files changed

+175
-198
lines changed

docs/release-notes.rst

Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,181 @@
66
Release Notes
77
###############
88

9+
**************
10+
Version 0.38
11+
**************
12+
13+
Version 0.38.0
14+
==============
15+
16+
**Release Date:** November 22, 2024
17+
18+
**Breaking Changes**
19+
20+
- ASHA: All experiments using ASHA hyperparameter search must now configure ``max_time`` and
21+
``time_metric`` in the experiment config, instead of ``max_length``. Additionally, training code
22+
must report the configured ``time_metric`` in validation metrics. As a convenience, Determined
23+
training loops now automatically report ``batches`` and ``epochs`` with metrics, which you can
24+
use as your ``time_metric``. ASHA experiments without this modification will no longer run.
25+
26+
- Custom Searchers: All custom searchers including DeepSpeed Autotune were deprecated in ``0.36.0``
27+
and are now being removed. Users are encouraged to use a preset searcher, which can be easily
28+
:ref:`configured <experiment-configuration_searcher>` for any experiment.
29+
30+
- API: Custom Searcher (including DeepSpeed AutoTune) was deprecated in 0.36.0 and is now removed.
31+
We will maintain first-class support for a variety of preset searchers, which can be easily
32+
configured for any experiment. Visit :ref:`search-methods` for details.
33+
34+
**New Features**
35+
36+
- API/CLI: Add support for access tokens. Add the ability to create and administer access tokens
37+
for users to authenticate in automated workflows. Users can define the lifespan of these tokens,
38+
making it easier to securely authenticate and run processes. Users can set global defaults and
39+
limits for the validity of access tokens by configuring ``default_lifespan_days`` and
40+
``max_lifespan_days`` in the master configuration. Setting ``max_lifespan_days`` to ``-1``
41+
indicates an **infinite** lifespan for the access token. This feature enhances automation while
42+
maintaining strong security protocols by allowing tighter control over token usage and
43+
expiration. This feature requires Determined Enterprise Edition.
44+
45+
- CLI:
46+
47+
- ``det token create``: Create a new access token.
48+
- ``det token login``: Sign in with an access token.
49+
- ``det token edit``: Update an access token's description.
50+
- ``det token list``: List all active access tokens, with options for displaying revoked
51+
tokens.
52+
- ``det token describe``: Show details of specific access tokens.
53+
- ``det token revoke``: Revoke an access token.
54+
55+
- API:
56+
57+
- ``POST /api/v1/tokens``: Create a new access token.
58+
- ``GET /api/v1/tokens``: Retrieve a list of access tokens.
59+
- ``PATCH /api/v1/tokens/{token_id}``: Edit an existing access token.
60+
61+
- API: Introduce ``keras.DeterminedCallback``, a new high-level training API for TF Keras that
62+
integrates Keras training code with Determined through a single :ref:`Keras Callback
63+
<api-keras-ug>`.
64+
65+
- API: Introduce ``deepspeed.Trainer``, a new high-level training API for DeepSpeedTrial that
66+
allows for Python-side training loop configurations and includes support for local training.
67+
68+
- Cluster: In the enterprise edition of Determined, add :ref:`config policies <config-policies>` to
69+
enable administrators to set limits on how users can define workloads (e.g., experiments,
70+
notebooks, TensorBoards, shells, and commands). Administrators can define two types of
71+
configurations:
72+
73+
- **Invariant Configs for Experiments**: Settings applied to all experiments within a specific
74+
scope (global or workspace). Invariant configs for other tasks (e.g. notebooks, TensorBoards,
75+
shells, and commands) is not yet supported.
76+
77+
- **Constraints**: Restrictions that prevent users from exceeding resource limits within a
78+
scope. Constraints can be set independently for experiments and tasks.
79+
80+
- Helm: Support configuring ``determined_master_host``, ``determined_master_port``, and
81+
``determined_master_scheme``. These control how tasks address the Determined API server and are
82+
useful when installations span multiple Kubernetes clusters or there are proxies in between tasks
83+
and the master. Also, ``determined_master_host`` now defaults to the service host,
84+
``<det_namespace>.<det_service_name>.svc.cluster.local``, instead of the service IP.
85+
86+
- Helm: Add support for capturing and restoring snapshots of the database persistent volume. Visit
87+
:ref:`helm-config-reference` for more details.
88+
89+
- New RBAC role: In the enterprise edition of Determined, add a ``TokenCreator`` RBAC role, which
90+
allows users to create, view, and revoke their own :ref:`access tokens <access-tokens>`. This
91+
role can only be assigned globally.
92+
93+
- Experiments: Add a ``name`` field to ``log_policies``. When a log policy matches, its name shows
94+
as a label in the WebUI, making it easy to spot specific issues during a run. Labels appear in
95+
both the run table and run detail views.
96+
97+
In addition, there is a new format: ``name`` is required, and ``action`` is now a plain string.
98+
For more details, refer to :ref:`log_policies <config-log-policies>`.
99+
100+
**Improvements**
101+
102+
- Master Configuration: Add support for crypto system configuration for ssh connection.
103+
``security.key_type`` now accepts ``RSA``, ``ECDSA`` or ``ED25519``. Default key type is changed
104+
from ``1024-bit RSA`` to ``ED25519``, since ``ED25519`` keys are faster and more secure than the
105+
old default, and ``ED25519`` is also the default key type for ``ssh-keygen``.
106+
107+
**Removed Features**
108+
109+
- WebUI: "Continue Training" no longer supports configurable number of batches in the Web UI and
110+
will simply resume the trial from the last checkpoint.
111+
112+
**Known Issues**
113+
114+
- PyTorch has `deprecated
115+
<https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html#use-tensorboard-to-view-results-and-analyze-model-performance>`
116+
their Profiler TensorBoard Plugin (``tb_plugin``), so some features may not be compatible with
117+
PyTorch 2.0 and above. Our current default environment image comes with PyTorch 2.3. If users are
118+
experiencing issues with this plugin, we suggest using an image with a PyTorch version earlier
119+
than 2.0.
120+
121+
**Bug Fixes**
122+
123+
- Previously, during a grid search, if a hyperparameter contained an empty nested hyperparameter
124+
(that is, just an empty map), that hyperparameter would not appear in the hparams passed to the
125+
trial.
126+
127+
**Deprecations**
128+
129+
- Experiment Config: The ``max_length`` field of the searcher configuration section has been
130+
deprecated for all experiments and searchers. Users are expected to configure the desired
131+
training length directly in training code.
132+
133+
- Experiment Config: The ``optimizations`` config has been deprecated. Please see :ref:`Training
134+
APIs <apis-howto-overview>` to configure supported optimizations through training code directly.
135+
136+
- Experiment Config: The ``scheduling_unit``, ``min_checkpoint_period``, and
137+
``min_validation_period`` config fields have been deprecated. Instead, these configuration
138+
options should be specified in training code.
139+
140+
- Experiment Config: The ``entrypoint`` field no longer accepts ``model_def:TrialClass`` as trial
141+
definitions. Please invoke your training script directly (``python3 train.py``).
142+
143+
- Core API: The ``SearcherContext`` (``core.searcher``) has been deprecated. Training code no
144+
longer requires ``core.searcher.operations`` to run, and progress should be reported through
145+
``core.train.report_progress``.
146+
147+
- DeepSpeed: The ``num_micro_batches_per_slot`` and ``train_micro_batch_size_per_gpu`` attributes
148+
on ``DeepSpeedContext`` have been replaced with ``get_train_micro_batch_size_per_gpu()`` and
149+
``get_num_micro_batches_per_slot()``.
150+
151+
- Horovod: The Horovod distributed training backend has been deprecated. Users are encouraged to
152+
migrate to the native distributed backend of their training framework (``torch.distributed`` or
153+
``tf.distribute``).
154+
155+
- Trial APIs: ``TFKerasTrial`` has been deprecated. Users are encouraged to migrate to the new
156+
:ref:`Keras Callback <api-keras-ug>`.
157+
158+
- Launchers: The ``--trial`` argument in Determined launchers has been deprecated. Please invoke
159+
your training script directly.
160+
161+
- ASHA: The ``stop_once`` field of the ``searcher`` config for ASHA searchers has been deprecated.
162+
All ASHA searches are now early-stopping based (``stop_once: true``) instead of promotion based.
163+
164+
- CLI: The ``--test`` and ``--local`` flags for ``det experiment create`` have been deprecated. All
165+
training APIs now support local execution (``python3 train.py``). Please see ``training apis``
166+
for details specific to your framework.
167+
168+
- Web UI: Previously, trials that reported an ``epoch`` metric enabled an epoch X-axis in the Web
169+
UI metrics tab. This metric name has been changed to ``epochs``, with ``epoch`` as a fallback
170+
option.
171+
172+
- Database: After Amazon Aurora V1 reaches End of Life, support for Amazon Aurora V1 in ``det
173+
deploy aws`` will be removed. Future deployments will default to the ``simple-rds`` type, which
174+
uses Amazon RDS for PostgreSQL. We recommend that users migrate to Amazon RDS for PostgreSQL. For
175+
more information, visit the `migration instructions
176+
<https://gist.github.com/maxrussell/c67f4f7d586d55c4eb2658cc2dd1c290>`_.
177+
178+
- Database: As a follow-up to the earlier notice, PostgreSQL 12 will reach End of Life on November
179+
14, 2024. Instances still using PostgreSQL 12 or earlier should upgrade to PostgreSQL 13 or later
180+
to maintain compatibility. The application will log a warning if it detects a connection to any
181+
PostgreSQL version older than 12, and this warning will be updated to include PostgreSQL 12 once
182+
it is End of Life.
183+
9184
**************
10185
Version 0.37
11186
**************

docs/release-notes/9966-fix-grid.rst

Lines changed: 0 additions & 7 deletions
This file was deleted.

docs/release-notes/add-host-port-scheme-to-helm.rst

Lines changed: 0 additions & 9 deletions
This file was deleted.

docs/release-notes/api-cli-access-token.rst

Lines changed: 0 additions & 28 deletions
This file was deleted.

docs/release-notes/config-policies.rst

Lines changed: 0 additions & 15 deletions
This file was deleted.

docs/release-notes/helm-db-snapshot.rst

Lines changed: 0 additions & 6 deletions
This file was deleted.

docs/release-notes/log-signal.rst

Lines changed: 0 additions & 10 deletions
This file was deleted.

docs/release-notes/pytorch-tensorboard-plugin.rst

Lines changed: 0 additions & 10 deletions
This file was deleted.

docs/release-notes/rbac-new-tokenCreator-role.rst

Lines changed: 0 additions & 7 deletions
This file was deleted.

docs/release-notes/remove-custom-searcher.rst

Lines changed: 0 additions & 7 deletions
This file was deleted.

docs/release-notes/searcher-context-removal.rst

Lines changed: 0 additions & 72 deletions
This file was deleted.

docs/release-notes/ssh-crypto-system.rst

Lines changed: 0 additions & 8 deletions
This file was deleted.

docs/release-notes/unsupport-aurora-postgres-reminder.rst

Lines changed: 0 additions & 19 deletions
This file was deleted.

0 commit comments

Comments
 (0)