ML Goodput Measurement

MaxText supports automatic measurement and upload of workload metrics such as Goodput, Badput Breakdown and Step Time Deviation using the ML Goodput Measurement library.

The ML Goodput Measurement library currently supports monitoring workloads running on Google Cloud Platform. For more information on details of the library, visit the Github page or the ml-goodput-measurement PyPI package documentation.

What is Goodput

Goodput is the metric that measures the efficiency of model training jobs, i.e. productive time spent on training progress proportional to the total time spent by the workload. It is an actionable way for users to monitor where they can improve to get the most value from their accelerators.

What is Badput

Badput is the metric that measures time that a workload spent on anything that is not productive training proportional to the total time spent by the workload. For example, the time spent in accelerator initialization, training preparation, program startup, data loading, portions of checkpointing, disruptions and wasted progress since the last checkpoint etc. all contribute to Badput.

The ML Goodput Measurement library exposes Badput Breakdown. Further details of each bucket can be found here

What is Step Time Deviation

Step Time Deviation is the metric that measures deviation of step time from ideal step time.

The ML Goodput Measurement library exposes step time deviation by computing ideal step time or allowing users to configure ideal step time.

Prerequisites

The usage of this package requires the setup of a Google Cloud project with billing enabled to properly use Google Cloud Logging. If you don't have a Google Cloud project, or if you don't have billing enabled for your Google Cloud project, then do the following:

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Make sure that billing is enabled for your Google Cloud project. Instructions can be found here
Enable the Cloud Logging API.
To run your training on Cloud accelerator, set up the environment by following instructions here.
To learn more about Google Cloud Logging, visit this page.

Access Scopes

You will need both read and write access scopes for cloud logging on both the GPU or TPU and CPU node pools. Full cloud logging access is granted by the following access scope during node pool creation:

https://www.googleapis.com/auth/cloud-platform

XPK adds this access scope to the GPU, TPU and CPU node pools, so XPK is the recommended method to create clusters and node-pools in you intend to run your workloads on GKE.

Instructions on how to create clusters using XPK can be found here and how to create workloads using XPK can be found here.

NOTE: Access Scopes are immutable and workloads can only be migrated to new node pools with required access scopes. Access scopes on already created clusters cannot be updated.

Monitoring

IMPORTANT: Ensure unique run_name for each new experiment or run

Please use a unique workload name, unless you intend to monitor cumulative Goodput/Badput metrics of a previous workload along with your current workload

How to Monitor Goodput and Badput

MaxText enables Goodput recording and monitoring by default with enable_goodput_recording=True and monitor_goodput=True. You can configure the goodput upload frequency by setting goodput_upload_interval_seconds.

python3 -m MaxText.train MaxText/configs/base.yml base_output_directory=$OUTPUT_PATH dataset_path=$DATA_PATH run_name=goodput-test-run steps=200 goodput_upload_interval_seconds=30

How to Monitor Step Time Deviation

MaxText enables step time deviation monitoring by default with monitor_step_time_deviation=True. You can configure the upload frequency by setting step_deviation_interval_seconds.

python3 -m MaxText.train MaxText/configs/base.yml base_output_directory=$OUTPUT_PATH dataset_path=$DATA_PATH run_name=goodput-test-run steps=200 step_deviation_interval_seconds=30

How to enable Pathways Goodput

MaxText disables Pathways by default for computation of all Goodput metrics with enable_pathways_goodput=False. You can enable Pathways Goodput by setting this flag to true.

NOTE: Enabling enable_pathways_goodput turns on Goodput measurement for Pathways workloads, and does not update any Pathways features.

python3 -m MaxText.train MaxText/configs/base.yml base_output_directory=$OUTPUT_PATH dataset_path=$DATA_PATH run_name=goodput-test-run steps=200 goodput_upload_interval_seconds=30 enable_pathways_goodput=True

How to enable Checkpoint Logging

Checkpoint logging is currently supported through Orbax. The Goodput library reads these logs to compute checkpointing badput. To enable checkpoint logging set the enable_checkpoint_cloud_logger MaxText flag to True.

If this flag is turned off, the badput due to checkpointing will incorrectly be computed as 0.

If checkpointing is enabled, please enable the enable_checkpoint_cloud_logger flag for accurate results.

Visualize on Tensorboard

MaxText installs the required packages on setup: tensorboard-plugin-profile, tensorflow and tensorboard.
Follow the Tensorboard URL on MaxText logs to view all metrics in one location.

Visualize Goodput, Badput and Step Deviation on Google Cloud Monitoring

By default, performance data (goodput, badput, and step deviation) is automatically sent to Google Cloud Monitoring, enabling visualization on dashboards.

This feature leverages Google VM metadata (project ID, location, accelerator type) and supports replica IDs for uniquely identifying workloads in multi-replica deployments.

This feature is enabled by default, and no changes to the Monitoring API call are needed if you want to keep it enabled.

gcp_options = goodput_utils.GCPOptions(
      project_id=None, # If None, the library will automatically identify from GCE internal metadata
      location=None, # If None, the library will automatically identify from GCE internal metadata
      replica_id='0', # Default is '0'
      acc_type=None, # If None, the library will automatically identify from GCE internal metadata
      enable_gcp_goodput_metrics=True,
      enable_gcp_step_deviation_metrics=True,
    )

goodput_monitor = monitoring.GoodputMonitor(
      job_name=config.run_name,
      logger_name=logger_name,
      tensorboard_dir=config.tensorboard_dir,
      upload_interval=config.goodput_upload_interval_seconds,
      monitoring_enabled=True,
      include_badput_breakdown=True,
      include_step_deviation=True,
      configured_ideal_step_time=None, # Optional, the library will compute ideal step time if it is not provided
      gcp_options=gcp_options
    )

If you do not wish to send metrics to Google Cloud Monitoring then please set the flag enable_gcp_goodput_metrics to False for disabling goodput metrics and enable_gcp_step_deviation_metrics to False for disabling step deviation metrics while creating the GCPOptions object.

Setting monitoring_enabled to False will disable both tensorboard and GCM monitoring.

gcp_options = goodput_utils.GCPOptions(
      project_id=None, # If None, the library will automatically identify from GCE internal metadata
      location=None, # If None, the library will automatically identify from GCE internal metadata
      replica_id='0', # Default is '0'
      acc_type=None, # If None, the library will automatically identify from GCE internal metadata
      enable_gcp_goodput_metrics=False,
      enable_gcp_step_deviation_metrics=False,
    )


goodput_monitor = monitoring.GoodputMonitor(
      job_name=config.run_name,
      logger_name=logger_name,
      tensorboard_dir=config.tensorboard_dir,
      upload_interval=config.goodput_upload_interval_seconds,
      monitoring_enabled=True,
      include_badput_breakdown=True,
      include_step_deviation=True,
      configured_ideal_step_time=None,
      gcp_options=gcp_options,
    )

Monitoring Dashboards

Goodput, Badput and Step Time Deviation metrics can be visualized using GCM dashboards:

Navigate to your projects Google Cloud Monitoring Dashboards page.
Create a Custom Dashboard if you do not have one already, and select the metrics you want to monitor.
- compute.googleapis.com/workload/goodput_time (Goodput)
- compute.googleapis.com/workload/badput_time (Badput Breakdown)
- compute.googleapis.com/workload/performance (Step Time Deviation)

Monitoring Raw Metrics

Goodput, Badput and Step Time Deviation metrics can be monitored using GCM Metrics Explorer:

Navigate to your projects Google Cloud Monitoring Metrics Explorer page.
Select the metrics you want to monitor:
- compute.googleapis.com/workload/goodput_time (Goodput)
- compute.googleapis.com/workload/badput_time (Badput Breakdown)
- compute.googleapis.com/workload/performance (Step Time Deviation)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitor_Goodput.md

Monitor_Goodput.md

ML Goodput Measurement

What is Goodput

What is Badput

What is Step Time Deviation

Prerequisites

Access Scopes

Monitoring

How to Monitor Goodput and Badput

How to Monitor Step Time Deviation

How to enable Pathways Goodput

How to enable Checkpoint Logging

Visualize on Tensorboard

Visualize Goodput, Badput and Step Deviation on Google Cloud Monitoring

Monitoring Dashboards

Monitoring Raw Metrics

Files

Monitor_Goodput.md

Latest commit

History

Monitor_Goodput.md

File metadata and controls

ML Goodput Measurement

What is Goodput

What is Badput

What is Step Time Deviation

Prerequisites

Access Scopes

Monitoring

How to Monitor Goodput and Badput

How to Monitor Step Time Deviation

How to enable Pathways Goodput

How to enable Checkpoint Logging

Visualize on Tensorboard

Visualize Goodput, Badput and Step Deviation on Google Cloud Monitoring

Monitoring Dashboards

Monitoring Raw Metrics