Skip to content

[29] Logging in json format #68

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 17, 2025
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 19 additions & 13 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,18 +9,21 @@ authors = [

requires-python = ">=3.11,<3.13"
# TODO: split the plotting dependencies into their own dep groups, they are not required.
dependencies = [ 'torch',
'numpy',
'astropy_healpix',
'zarr',
'anemoi-datasets',
'pandas',
'pynvml',
'tqdm',
'matplotlib',
'packaging',
'wheel',
'psutil']
dependencies = [
'torch',
'numpy',
'astropy_healpix',
'zarr',
'anemoi-datasets',
'pandas',
'pynvml',
'tqdm',
'matplotlib',
'packaging',
'wheel',
'psutil',
"flash-attn",
]

[project.urls]
Homepage = "https://www.weathergenerator.eu"
Expand Down Expand Up @@ -91,4 +94,7 @@ ignore = [
"F811",
# To ignore, not relevant for us
"E741",
]
]

[tool.uv.sources]
flash-attn = { url = "https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl" }
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to make this change to use uv on the hpc2020 cluster. I am not sure if this is going to be a breaking change for people. @clessig , do we assume that different HPCs can use different versions of CUDA? That sounds like a nightmare.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not assume it, we know it ;) One can write a script that detects the available CUDA (and the python version if this is a variable) and then assembles the string that defines the wheel to be downloaded. @tjhunter : To what extent could one integrate this into pyproject toml?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And could we open an issues to track this? :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the script in branch of the private repo but not committed yet:
#57

38 changes: 37 additions & 1 deletion src/weathergen/utils/train_logger.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@
# nor does it submit to any jurisdiction.

import datetime
import json
import math
import os.path
import time

import numpy as np

Expand All @@ -19,33 +23,65 @@ class TrainLogger:
def __init__(self, cf, path_run) -> None:
self.cf = cf
self.path_run = path_run
# TODO: add header with col names (loadtxt has an option to skip k header lines)

def log_metrics(self, metrics: dict[str, float]) -> None:
"""
Log metrics to a file.
For now, just scalar values are expected. There is no check.
"""
# Clean all the metrics to convert to float. Any other type (numpy etc.) will trigger a serialization error.
clean_metrics = {
"weathergen.timestamp": time.time_ns() // 1_000_000,
"weathergen.time": int(datetime.datetime.now().strftime("%Y%m%d%H%M%S")),
}
for key, value in metrics.items():
v = float(value)
if math.isnan(v) or math.isinf(v):
v = str(v)
clean_metrics[key] = v

# TODO: performance: we repeatedly open the file for each call. Better for multiprocessing
# but we can probably do better and rely for example on the logging module.
with open(os.path.join(self.path_run, "metrics.json"), "ab") as f:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that we start with this simple version, we can always improve performance if it turns out to be a bottleneck

s = json.dumps(clean_metrics) + "\n"
f.write(s.encode("utf-8"))

#######################################
def add_train(self, samples, lr, loss_avg, stddev_avg, perf_gpu=0.0, perf_mem=0.0) -> None:
"""
Log training data
"""

metrics = dict(num_samples=samples)

log_vals = [int(datetime.datetime.now().strftime("%Y%m%d%H%M%S"))]
log_vals += [samples]

metrics["loss_avg_0_mean"] = loss_avg[0].mean()
metrics["learning_rate"] = lr
log_vals += [loss_avg[0].mean()]
log_vals += [lr]

for i_obs, _rt in enumerate(self.cf.streams):
for j, _ in enumerate(self.cf.loss_fcts):
metrics[f"stream_{i_obs}.loss_{j}.loss_avg"] = loss_avg[j, i_obs]
log_vals += [loss_avg[j, i_obs]]
if len(stddev_avg) > 0:
for i_obs, _rt in enumerate(self.cf.streams):
log_vals += [stddev_avg[i_obs]]
metrics[f"stream_{i_obs}.stddev_avg"] = stddev_avg[i_obs]

with open(self.path_run + self.cf.run_id + "_train_log.txt", "ab") as f:
np.savetxt(f, log_vals)

log_vals = []
log_vals += [perf_gpu]
log_vals += [perf_mem]
if perf_gpu > 0.0:
metrics["perf.gpu"] = perf_gpu
if perf_mem > 0.0:
metrics["perf.memory"] = perf_mem
self.log_metrics(metrics)
with open(self.path_run + self.cf.run_id + "_perf_log.txt", "ab") as f:
np.savetxt(f, log_vals)

Expand Down
102 changes: 99 additions & 3 deletions uv.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.