RegressLM: Easy Text-to-Text Regression

Overview

RegressLM is a library for text-to-text regression, applicable to any input string representation and allows pretraining and fine-tuning over multiple regression tasks.

Example Application: Directly regressing performance metrics from unstructured, textually represented system states from Google's massive compute clusters.

Usage

There are two main stages: inference and pretraining (optional).

Inference

The intended use-case is to import a RegressLM class, which can decode floating-point predictions from a given input, and also fine-tune against new data.

from regress_lm import core
from regress_lm import rlm

# Create RegressLM with max input token length.
reg_lm = rlm.RegressLM.from_default(max_input_len=2048)

# Example (x,y) pairs, which can be fine-tuned against.
examples = [core.Example(x='hello', y=0.3), core.Example(x='world', y=-0.3)]
reg_lm.fine_tune(examples)

# Query inputs.
query1, query2 = core.ExampleInput(x='hi'), core.ExampleInput(x='bye')
samples1, samples2 = reg_lm.sample([query1, query2], num_samples=128)

Pretraining

To produce better initial checkpoints for transfer learning, we recommend the user pretrains over large amounts of their own training data. Example pseudocode with PyTorch:

from torch import optim
from regress_lm.models.pytorch import model as torch_model_lib

model = torch_model_lib.PyTorchModel(...)
optimizer = optim.Adafactor(
    filter(lambda p: p.requires_grad, model.parameters()), lr=1e-4
)
for _ in range(...):
  examples = [Example(x=..., y=...), ...]
  tensor_examples = model.convert(examples)
  optimizer.zero_grad()
  loss, _ = model.compute_loss_and_metrics(tensor_examples)
  loss.backward()
  optimizer.step()

Boosting Performance

Below, we describe many ways to improve performance.

Train Custom Vocabulary

You can generate a custom vocabulary, trained on an offline corpus of data mydata.txt:

encoder_vocab = SentencePieceVocab.from_corpus(corpus_path='mydata.txt', vocab_size=1024)

Larger Sizes

Larger model sizes may increase performance, although with more computational cost:

model = PyTorchModel(num_encoder_layers=12, num_decoder_layers=12)

Contributors and Citation

The codebase was written by: Xingyou Song, Yash Akhauri, Dara Bahri, Michal Lukasik, Arissa Wongpanich, Adrian N. Reyes, and Bryan Lewandowski.

If you find this project useful, please consider citing our work as follows:

@article{akhauri2025performancepredictionlargesystems,
      title={Performance Prediction for Large Systems via Text-to-Text Regression},
      author={Yash Akhauri and Bryan Lewandowski and Cheng-Hsi Lin and Adrian N. Reyes and Grant C. Forbes and Arissa Wongpanich and Bangding Yang and Mohamed S. Abdelfattah and Sagi Perel and Xingyou Song},
      journal={arXiv preprint arXiv:2506.21718},
      year={2025}
}

Disclaimer: This is not an officially supported Google product.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
regress_lm		regress_lm
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements-extras.txt		requirements-extras.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RegressLM: Easy Text-to-Text Regression

Overview

Usage

Inference

Pretraining

Boosting Performance

Train Custom Vocabulary

Larger Sizes

Contributors and Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

google-deepmind/regress-lm

Folders and files

Latest commit

History

Repository files navigation

RegressLM: Easy Text-to-Text Regression

Overview

Usage

Inference

Pretraining

Boosting Performance

Train Custom Vocabulary

Larger Sizes

Contributors and Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages