MaxText supports automatic measurement and upload of workload metrics such as Goodput, Badput Breakdown and Step Time Deviation using the ML Goodput Measurement library.
The ML Goodput Measurement library currently supports monitoring workloads running on Google Cloud Platform. For more information on details of the library, visit the Github page or the ml-goodput-measurement PyPI package documentation.
Goodput is the metric that measures the efficiency of model training jobs, i.e. productive time spent on training progress proportional to the total time spent by the workload. It is an actionable way for users to monitor where they can improve to get the most value from their accelerators.
Badput is the metric that measures time that a workload spent on anything that is not productive training proportional to the total time spent by the workload. For example, the time spent in accelerator initialization, training preparation, program startup, data loading, portions of checkpointing, disruptions and wasted progress since the last checkpoint etc. all contribute to Badput.
The ML Goodput Measurement library exposes Badput Breakdown. Further details of each bucket can be found here
Step Time Deviation is the metric that measures deviation of step time from ideal step time.
The ML Goodput Measurement library exposes step time deviation by computing ideal step time or allowing users to configure ideal step time.
The usage of this package requires the setup of a Google Cloud project with billing enabled to properly use Google Cloud Logging. If you don't have a Google Cloud project, or if you don't have billing enabled for your Google Cloud project, then do the following:
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project. Instructions can be found here
-
Enable the Cloud Logging API.
-
To run your training on Cloud accelerator, set up the environment by following instructions here.
-
To learn more about Google Cloud Logging, visit this page.
You will need both read and write access scopes for cloud logging on both the GPU or TPU and CPU node pools. Full cloud logging access is granted by the following access scope during node pool creation:
https://www.googleapis.com/auth/cloud-platform
XPK adds this access scope to the GPU, TPU and CPU node pools, so XPK is the recommended method to create clusters and node-pools in you intend to run your workloads on GKE.
Instructions on how to create clusters using XPK can be found here and how to create workloads using XPK can be found here.
NOTE: Access Scopes are immutable and workloads can only be migrated to new node pools with required access scopes. Access scopes on already created clusters cannot be updated.
IMPORTANT: Ensure unique run_name
for each new experiment or run
Please use a unique workload name, unless you intend to monitor cumulative Goodput/Badput metrics of a previous workload along with your current workload
MaxText enables Goodput recording and monitoring by default with enable_goodput_recording=True
and monitor_goodput=True
. You can configure the goodput upload frequency by setting goodput_upload_interval_seconds
.
python3 -m MaxText.train MaxText/configs/base.yml base_output_directory=$OUTPUT_PATH dataset_path=$DATA_PATH run_name=goodput-test-run steps=200 goodput_upload_interval_seconds=30
MaxText enables step time deviation monitoring by default with monitor_step_time_deviation=True
. You can configure the upload frequency by setting step_deviation_interval_seconds
.
python3 -m MaxText.train MaxText/configs/base.yml base_output_directory=$OUTPUT_PATH dataset_path=$DATA_PATH run_name=goodput-test-run steps=200 step_deviation_interval_seconds=30
MaxText disables Pathways by default for computation of all Goodput metrics with enable_pathways_goodput=False
. You can enable Pathways Goodput by setting this flag to true.
NOTE: Enabling
enable_pathways_goodput
turns on Goodput measurement for Pathways workloads, and does not update any Pathways features.
python3 -m MaxText.train MaxText/configs/base.yml base_output_directory=$OUTPUT_PATH dataset_path=$DATA_PATH run_name=goodput-test-run steps=200 goodput_upload_interval_seconds=30 enable_pathways_goodput=True
Checkpoint logging is currently supported through Orbax. The Goodput library reads these logs to compute checkpointing badput.
To enable checkpoint logging set the enable_checkpoint_cloud_logger
MaxText flag to True
.
If this flag is turned off, the badput due to checkpointing will incorrectly be computed as 0.
If checkpointing is enabled, please enable the enable_checkpoint_cloud_logger
flag for accurate results.
- MaxText installs the required packages on setup:
tensorboard-plugin-profile
,tensorflow
andtensorboard
. - Follow the Tensorboard URL on MaxText logs to view all metrics in one location.
By default, performance data (goodput, badput, and step deviation) is automatically sent to Google Cloud Monitoring, enabling visualization on dashboards.
This feature leverages Google VM metadata (project ID, location, accelerator type) and supports replica IDs for uniquely identifying workloads in multi-replica deployments.
This feature is enabled by default, and no changes to the Monitoring API call are needed if you want to keep it enabled.
gcp_options = goodput_utils.GCPOptions(
project_id=None, # If None, the library will automatically identify from GCE internal metadata
location=None, # If None, the library will automatically identify from GCE internal metadata
replica_id='0', # Default is '0'
acc_type=None, # If None, the library will automatically identify from GCE internal metadata
enable_gcp_goodput_metrics=True,
enable_gcp_step_deviation_metrics=True,
)
goodput_monitor = monitoring.GoodputMonitor(
job_name=config.run_name,
logger_name=logger_name,
tensorboard_dir=config.tensorboard_dir,
upload_interval=config.goodput_upload_interval_seconds,
monitoring_enabled=True,
include_badput_breakdown=True,
include_step_deviation=True,
configured_ideal_step_time=None, # Optional, the library will compute ideal step time if it is not provided
gcp_options=gcp_options
)
If you do not wish to send metrics to Google Cloud Monitoring then please set
the flag enable_gcp_goodput_metrics
to False
for disabling goodput metrics
and enable_gcp_step_deviation_metrics
to False
for disabling step deviation
metrics while creating the GCPOptions object.
Setting monitoring_enabled
to False
will disable both tensorboard and GCM
monitoring.
gcp_options = goodput_utils.GCPOptions(
project_id=None, # If None, the library will automatically identify from GCE internal metadata
location=None, # If None, the library will automatically identify from GCE internal metadata
replica_id='0', # Default is '0'
acc_type=None, # If None, the library will automatically identify from GCE internal metadata
enable_gcp_goodput_metrics=False,
enable_gcp_step_deviation_metrics=False,
)
goodput_monitor = monitoring.GoodputMonitor(
job_name=config.run_name,
logger_name=logger_name,
tensorboard_dir=config.tensorboard_dir,
upload_interval=config.goodput_upload_interval_seconds,
monitoring_enabled=True,
include_badput_breakdown=True,
include_step_deviation=True,
configured_ideal_step_time=None,
gcp_options=gcp_options,
)
Goodput, Badput and Step Time Deviation metrics can be visualized using GCM dashboards:
- Navigate to your projects Google Cloud Monitoring
Dashboards
page. - Create a Custom Dashboard if you do not have one already, and select the metrics you want to monitor.
- compute.googleapis.com/workload/goodput_time (Goodput)
- compute.googleapis.com/workload/badput_time (Badput Breakdown)
- compute.googleapis.com/workload/performance (Step Time Deviation)
Goodput, Badput and Step Time Deviation metrics can be monitored using GCM Metrics Explorer:
- Navigate to your projects Google Cloud Monitoring
Metrics Explorer
page. - Select the metrics you want to monitor:
- compute.googleapis.com/workload/goodput_time (Goodput)
- compute.googleapis.com/workload/badput_time (Badput Breakdown)
- compute.googleapis.com/workload/performance (Step Time Deviation)