Remove print statements for logging (#421) #439

enssow · 2025-07-03T15:37:02Z

Description

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update

Issue Number

Closes #421

Removes print statements from files except for src/weathergen/model/model.py and edits other print statements to be logged

Code Compatibility

I have performed a self-review of my code

Code Performance and Testing

I ran the uv run train and (if necessary) uv run evaluate on a least one GPU node and it works
If the new feature introduces modifications at the config level, I have made sure to have notified the other software developers through Mattermost and updated the paths in the $WEATHER_GENERATOR_PRIVATE directory

Dependencies

I have ensured that the code is still pip-installable after the changes and runs
I have tested that new dependencies themselves are pip-installable.
I have not introduced new dependencies in the inference portion of the pipeline

Documentation

My code follows the style guidelines of this project
I have updated the documentation and docstrings to reflect the changes
I have added comments to my code, particularly in hard-to-understand areas

Additional Notes

Failed for integration pytest:
uv run pytest ./integration_tests/small1_test.py --verbose which returned E ImportError: cannot import name 'inference_from_args' from 'weathergen' (unknown location)

clessig · 2025-07-04T10:19:51Z

@enssow : can you make sure the code is properly ruffed.

kacpnowak

Looks good, but it changes the way loss is outputed in slurm mode. Instead of output file it's now printed to error file. Is this intentional?

Personally I'm not a fan because it's makes it impossible to quickly check the training by running tail output_{run_id}.txt

enssow · 2025-07-04T14:37:26Z

@kacpnowak Is that in the train/trainer.py? Maybe @tjhunter could give a bit of insight on this one - I'm happy to change it back to print statements in that file if it messes with the current workflow for model training etc.

clessig · 2025-07-04T14:54:30Z

@kacpnowak Is that in the train/trainer.py? Maybe @tjhunter could give a bit of insight on this one - I'm happy to change it back to print statements in that file if it messes with the current workflow for model training etc.

Yes, we should keep the infos separate from warning, errors etc. There is also a PR to have a proper redirection of the different streams to different files to make sure things are visible: #271 . This would be the ultimate solution.

To which stream does __logger.info go?

tjhunter

Thank you Sorcha for the change. Regarding the failing integration test: indeed, a fix had been reverted, let's do it separately.

Have you been able make a limited run with real data? I am a bit surprised by the comment of @kacpnowak since before this PR, the print and logging statement are both redirected to the output_xxx.txt:

0: 023 : 01020/01024 : 024572 : loss = 2.9625E-02 (lr=9.77E-10, s/sec=0.746)
0: 	ERA5 : 2.9625E-02
0:
1: 2025-07-05 05:01:09,709 183795 trainer.py:321 : INFO     : Epoch 23 of 24: validate.
3: 2025-07-05 05:01:09,709 183797 trainer.py:321 : INFO     : Epoch 23 of 24: validate.

scripts/check_gh_issue.py

src/weathergen/train/trainer.py

tjhunter · 2025-07-07T07:26:04Z

src/weathergen/utils/run_id.py

 if __name__ == "__main__":
-    print(get_run_id())


actually we can fully remove this main block. This is meant for people who want to create their own run ids.This is also implemented in the private repo, and it has no dependencies.

I can't seem to find any dependencies on this file so can I go ahead and delete the whole file?

kacpnowak · 2025-07-07T09:14:23Z

@kacpnowak Is that in the train/trainer.py? Maybe @tjhunter could give a bit of insight on this one - I'm happy to change it back to print statements in that file if it messes with the current workflow for model training etc.

Yes, we should keep the infos separate from warning, errors etc. There is also a PR to have a proper redirection of the different streams to different files to make sure things are visible: #271 . This would be the ultimate solution.

To which stream does __logger.info go?

Due to using default value in StreamHandler it always goes to stderr ch = logging.StreamHandler()

kacpnowak · 2025-07-07T09:20:31Z

Thank you Sorcha for the change. Regarding the failing integration test: indeed, a fix had been reverted, let's do it separately.

Have you been able make a limited run with real data? I am a bit surprised by the comment of @kacpnowak since before this PR, the print and logging statement are both redirected to the output_xxx.txt:
0: 023 : 01020/01024 : 024572 : loss = 2.9625E-02 (lr=9.77E-10, s/sec=0.746)
0: 	ERA5 : 2.9625E-02
0:
1: 2025-07-05 05:01:09,709 183795 trainer.py:321 : INFO     : Epoch 23 of 24: validate.
3: 2025-07-05 05:01:09,709 183797 trainer.py:321 : INFO     : Epoch 23 of 24: validate.

It's really weird. Maybe it's due to different batch scripts? In my case I have set two different files for those streams:

#SBATCH --output=logs/atmorep-%x.%j.out
#SBATCH --error=logs/atmorep-%x.%j.err

I pase below part of my err file

0: 2025-07-04 16:19:01,299 2510508 trainer.py:735 : INFO     : Saved model to models/r06jenam/r06jenam_latest.chkpt
0: 2025-07-04 16:19:13,119 2510508 trainer.py:786 : INFO     : 012 : 00010/00050 : 000610 : loss = 7.8196E-02 (lr=7.40E-05, s/sec=0.315)
0: 2025-07-04 16:19:13,119 2510508 trainer.py:801 : INFO     : 	ERA5 : 7.8196E-02 	
0: 2025-07-04 16:19:23,380 2510508 trainer.py:786 : INFO     : 012 : 00020/00050 : 000620 : loss = 7.8310E-02 (lr=7.26E-05, s/sec=0.975)
0: 2025-07-04 16:19:23,380 2510508 trainer.py:801 : INFO     : 	ERA5 : 7.8310E-02 	
0: 2025-07-04 16:19:34,697 2510508 trainer.py:786 : INFO     : 012 : 00030/00050 : 000630 : loss = 7.6214E-02 (lr=7.12E-05, s/sec=0.884)
0: 2025-07-04 16:19:34,697 2510508 trainer.py:801 : INFO     : 	ERA5 : 7.6214E-02 	
0: 2025-07-04 16:19:48,463 2510508 trainer.py:786 : INFO     : 012 : 00040/00050 : 000640 : loss = 7.9266E-02 (lr=6.98E-05, s/sec=0.726)
0: 2025-07-04 16:19:48,463 2510508 trainer.py:801 : INFO     : 	ERA5 : 7.9266E-02 	
4: 2025-07-04 16:20:13,751 899079 trainer.py:321 : INFO     : Epoch 12 of 24: validate.
7: 2025-07-04 16:20:13,758 899082 trainer.py:321 : INFO     : Epoch 12 of 24: validate.
6: 2025-07-04 16:20:13,759 899081 trainer.py:321 : INFO     : Epoch 12 of 24: validate.
5: 2025-07-04 16:20:13,759 899080 trainer.py:321 : INFO     : Epoch 12 of 24: validate.
3: 2025-07-04 16:20:13,760 2510509 trainer.py:321 : INFO     : Epoch 12 of 24: validate.
2: 2025-07-04 16:20:13,761 2510507 trainer.py:321 : INFO     : Epoch 12 of 24: validate.
1: 2025-07-04 16:20:13,764 2510510 trainer.py:321 : INFO     : Epoch 12 of 24: validate.
0: 2025-07-04 16:20:13,837 2510508 trainer.py:321 : INFO     : Epoch 12 of 24: validate.
4: 2025-07-04 16:20:13,856 904354 multi_stream_data_sampler.py:426 : INFO     : 4::0 : dataset [100,125) : [100,103)
4: 2025-07-04 16:20:13,857 904358 multi_stream_data_sampler.py:426 : INFO     : 4::1 : dataset [100,125) : [103,106)
4: 2025-07-04 16:20:13,857 904362 multi_stream_data_sampler.py:426 : INFO     : 4::2 : dataset [100,125) : [106,109)
4: 2025-07-04 16:20:13,857 904358 multi_stream_data_sampler.py:289 : INFO     : iter_start=103, iter_end=106, len=25
4: 2025-07-04 16:20:13,857 904354 multi_stream_data_sampler.py:289 : INFO     : iter_start=100, iter_end=103, len=25
4: 2025-07-04 16:20:13,857 904366 multi_stream_data_sampler.py:426 : INFO     : 4::3 : dataset [100,125) : [109,112)
4: 2025-07-04 16:20:13,857 904362 multi_stream_data_sampler.py:289 : INFO     : iter_start=106, iter_end=109, len=25
4: 2025-07-04 16:20:13,857 904370 multi_stream_data_sampler.py:426 : INFO     : 4::4 : dataset [100,125) : [112,115)
4: 2025-07-04 16:20:13,858 904374 multi_stream_data_sampler.py:426 : INFO     : 4::5 : dataset [100,125) : [115,118)
4: 2025-07-04 16:20:13,858 904366 multi_stream_data_sampler.py:289 : INFO     : iter_start=109, iter_end=112, len=25
4: 2025-07-04 16:20:13,858 904378 multi_stream_data_sampler.py:426 : INFO     : 4::6 : dataset [100,125) : [118,121)
4: 2025-07-04 16:20:13,858 904370 multi_stream_data_sampler.py:289 : INFO     : iter_start=112, iter_end=115, len=25
4: 2025-07-04 16:20:13,858 904374 multi_stream_data_sampler.py:289 : INFO     : iter_start=115, iter_end=118, len=25
4: 2025-07-04 16:20:13,858 904378 multi_stream_data_sampler.py:289 : INFO     : iter_start=118, iter_end=121, len=25
7: 2025-07-04 16:20:13,860 904355 multi_stream_data_sampler.py:426 : INFO     : 7::0 : dataset [175,200) : [175,178)
7: 2025-07-04 16:20:13,860 904359 multi_stream_data_sampler.py:426 : INFO     : 7::1 : dataset [175,200) : [178,181)
7: 2025-07-04 16:20:13,860 904363 multi_stream_data_sampler.py:426 : INFO     : 7::2 : dataset [175,200) : [181,184)
7: 2025-07-04 16:20:13,860 904355 multi_stream_data_sampler.py:289 : INFO     : iter_start=175, iter_end=178, len=25
7: 2025-07-04 16:20:13,861 904359 multi_stream_data_sampler.py:289 : INFO     : iter_start=178, iter_end=181, len=25
7: 2025-07-04 16:20:13,861 904363 multi_stream_data_sampler.py:289 : INFO     : iter_start=181, iter_end=184, len=25
7: 2025-07-04 16:20:13,861 904371 multi_stream_data_sampler.py:426 : INFO     : 7::4 : dataset [175,200) : [187,190)
7: 2025-07-04 16:20:13,861 904367 multi_stream_data_sampler.py:426 : INFO     : 7::3 : dataset [175,200) : [184,187)
7: 2025-07-04 16:20:13,861 904375 multi_stream_data_sampler.py:426 : INFO     : 7::5 : dataset [175,200) : [190,193)
7: 2025-07-04 16:20:13,861 904367 multi_stream_data_sampler.py:289 : INFO     : iter_start=184, iter_end=187, len=25
7: 2025-07-04 16:20:13,861 904371 multi_stream_data_sampler.py:289 : INFO     : iter_start=187, iter_end=190, len=25
7: 2025-07-04 16:20:13,861 904379 multi_stream_data_sampler.py:426 : INFO     : 7::6 : dataset [175,200) : [193,196)
7: 2025-07-04 16:20:13,862 904375 multi_stream_data_sampler.py:289 : INFO     : iter_start=190, iter_end=193, len=25
7: 2025-07-04 16:20:13,862 904379 multi_stream_data_sampler.py:289 : INFO     : iter_start=193, iter_end=196, len=25
4: 2025-07-04 16:20:13,864 904381 multi_stream_data_sampler.py:426 : INFO     : 4::7 : dataset [100,125) : [121,125)
4: 2025-07-04 16:20:13,865 904381 multi_stream_data_sampler.py:289 : INFO     : iter_start=121, iter_end=125, len=25
6: 2025-07-04 16:20:13,867 904356 multi_stream_data_sampler.py:426 : INFO     : 6::0 : dataset [150,175) : [150,153)
6: 2025-07-04 16:20:13,867 904360 multi_stream_data_sampler.py:426 : INFO     : 6::1 : dataset [150,175) : [153,156)
7: 2025-07-04 16:20:13,868 904391 multi_stream_data_sampler.py:426 : INFO     : 7::7 : dataset [175,200) : [196,200)
6: 2025-07-04 16:20:13,868 904364 multi_stream_data_sampler.py:426 : INFO     : 6::2 : dataset [150,175) : [156,159)
6: 2025-07-04 16:20:13,868 904368 multi_stream_data_sampler.py:426 : INFO     : 6::3 : dataset [150,175) : [159,162)
6: 2025-07-04 16:20:13,868 904360 multi_stream_data_sampler.py:289 : INFO     : iter_start=153, iter_end=156, len=25
6: 2025-07-04 16:20:13,868 904356 multi_stream_data_sampler.py:289 : INFO     : iter_start=150, iter_end=153, len=25
6: 2025-07-04 16:20:13,868 904372 multi_stream_data_sampler.py:426 : INFO     : 6::4 : dataset [150,175) : [162,165)
7: 2025-07-04 16:20:13,868 904391 multi_stream_data_sampler.py:289 : INFO     : iter_start=196, iter_end=200, len=25
6: 2025-07-04 16:20:13,868 904368 multi_stream_data_sampler.py:289 : INFO     : iter_start=159, iter_end=162, len=25
6: 2025-07-04 16:20:13,868 904364 multi_stream_data_sampler.py:289 : INFO     : iter_start=156, iter_end=159, len=25
6: 2025-07-04 16:20:13,868 904376 multi_stream_data_sampler.py:426 : INFO     : 6::5 : dataset [150,175) : [165,168)
6: 2025-07-04 16:20:13,869 904372 multi_stream_data_sampler.py:289 : INFO     : iter_start=162, iter_end=165, len=25
6: 2025-07-04 16:20:13,869 904380 multi_stream_data_sampler.py:426 : INFO     : 6::6 : dataset [150,175) : [168,171)
6: 2025-07-04 16:20:13,869 904376 multi_stream_data_sampler.py:289 : INFO     : iter_start=165, iter_end=168, len=25
5: 2025-07-04 16:20:13,869 904357 multi_stream_data_sampler.py:426 : INFO     : 5::0 : dataset [125,150) : [125,128)
5: 2025-07-04 16:20:13,869 904361 multi_stream_data_sampler.py:426 : INFO     : 5::1 : dataset [125,150) : [128,131)
6: 2025-07-04 16:20:13,869 904380 multi_stream_data_sampler.py:289 : INFO     : iter_start=168, iter_end=171, len=25
5: 2025-07-04 16:20:13,870 904357 multi_stream_data_sampler.py:289 : INFO     : iter_start=125, iter_end=128, len=25
5: 2025-07-04 16:20:13,869 904365 multi_stream_data_sampler.py:426 : INFO     : 5::2 : dataset [125,150) : [131,134)
5: 2025-07-04 16:20:13,870 904361 multi_stream_data_sampler.py:289 : INFO     : iter_start=128, iter_end=131, len=25
5: 2025-07-04 16:20:13,870 904369 multi_stream_data_sampler.py:426 : INFO     : 5::3 : dataset [125,150) : [134,137)
5: 2025-07-04 16:20:13,870 904373 multi_stream_data_sampler.py:426 : INFO     : 5::4 : dataset [125,150) : [137,140)
5: 2025-07-04 16:20:13,870 904365 multi_stream_data_sampler.py:289 : INFO     : iter_start=131, iter_end=134, len=25
5: 2025-07-04 16:20:13,870 904377 multi_stream_data_sampler.py:426 : INFO     : 5::5 : dataset [125,150) : [140,143)
5: 2025-07-04 16:20:13,870 904369 multi_stream_data_sampler.py:289 : INFO     : iter_start=134, iter_end=137, len=25
5: 2025-07-04 16:20:13,870 904373 multi_stream_data_sampler.py:289 : INFO     : iter_start=137, iter_end=140, len=25
5: 2025-07-04 16:20:13,870 904382 multi_stream_data_sampler.py:426 : INFO     : 5::6 : dataset [125,150) : [143,146)
5: 2025-07-04 16:20:13,871 904377 multi_stream_data_sampler.py:289 : INFO     : iter_start=140, iter_end=143, len=25
5: 2025-07-04 16:20:13,871 904382 multi_stream_data_sampler.py:289 : INFO     : iter_start=143, iter_end=146, len=25
6: 2025-07-04 16:20:13,875 904402 multi_stream_data_sampler.py:426 : INFO     : 6::7 : dataset [150,175) : [171,175)
6: 2025-07-04 16:20:13,876 904402 multi_stream_data_sampler.py:289 : INFO     : iter_start=171, iter_end=175, len=25
5: 2025-07-04 16:20:13,878 904406 multi_stream_data_sampler.py:426 : INFO     : 5::7 : dataset [125,150) : [146,150)
5: 2025-07-04 16:20:13,879 904406 multi_stream_data_sampler.py:289 : INFO     : iter_start=146, iter_end=150, len=25
1: 2025-07-04 16:20:13,892 2516027 multi_stream_data_sampler.py:426 : INFO     : 1::0 : dataset [25,50) : [25,28)
1: 2025-07-04 16:20:13,892 2516030 multi_stream_data_sampler.py:426 : INFO     : 1::1 : dataset [25,50) : [28,31)
1: 2025-07-04 16:20:13,892 2516031 multi_stream_data_sampler.py:426 : INFO     : 1::2 : dataset [25,50) : [31,34)
1: 2025-07-04 16:20:13,892 2516034 multi_stream_data_sampler.py:426 : INFO     : 1::3 : dataset [25,50) : [34,37)
1: 2025-07-04 16:20:13,893 2516027 multi_stream_data_sampler.py:289 : INFO     : iter_start=25, iter_end=28, len=25
1: 2025-07-04 16:20:13,893 2516030 multi_stream_data_sampler.py:289 : INFO     : iter_start=28, iter_end=31, len=25
1: 2025-07-04 16:20:13,893 2516037 multi_stream_data_sampler.py:426 : INFO     : 1::4 : dataset [25,50) : [37,40)
1: 2025-07-04 16:20:13,893 2516031 multi_stream_data_sampler.py:289 : INFO     : iter_start=31, iter_end=34, len=25
1: 2025-07-04 16:20:13,893 2516034 multi_stream_data_sampler.py:289 : INFO     : iter_start=34, iter_end=37, len=25
1: 2025-07-04 16:20:13,893 2516040 multi_stream_data_sampler.py:426 : INFO     : 1::5 : dataset [25,50) : [40,43)
1: 2025-07-04 16:20:13,893 2516037 multi_stream_data_sampler.py:289 : INFO     : iter_start=37, iter_end=40, len=25
1: 2025-07-04 16:20:13,893 2516044 multi_stream_data_sampler.py:426 : INFO     : 1::6 : dataset [25,50) : [43,46)
1: 2025-07-04 16:20:13,894 2516040 multi_stream_data_sampler.py:289 : INFO     : iter_start=40, iter_end=43, len=25
1: 2025-07-04 16:20:13,894 2516044 multi_stream_data_sampler.py:289 : INFO     : iter_start=43, iter_end=46, len=25
2: 2025-07-04 16:20:13,898 2516026 multi_stream_data_sampler.py:426 : INFO     : 2::0 : dataset [50,75) : [50,53)
2: 2025-07-04 16:20:13,898 2516029 multi_stream_data_sampler.py:426 : INFO     : 2::1 : dataset [50,75) : [53,56)
3: 2025-07-04 16:20:13,898 2516025 multi_stream_data_sampler.py:426 : INFO     : 3::0 : dataset [75,100) : [75,78)
2: 2025-07-04 16:20:13,899 2516033 multi_stream_data_sampler.py:426 : INFO     : 2::2 : dataset [50,75) : [56,59)
3: 2025-07-04 16:20:13,899 2516028 multi_stream_data_sampler.py:426 : INFO     : 3::1 : dataset [75,100) : [78,81)
1: 2025-07-04 16:20:13,899 2516048 multi_stream_data_sampler.py:426 : INFO     : 1::7 : dataset [25,50) : [46,50)
2: 2025-07-04 16:20:13,899 2516029 multi_stream_data_sampler.py:289 : INFO     : iter_start=53, iter_end=56, len=25
2: 2025-07-04 16:20:13,899 2516026 multi_stream_data_sampler.py:289 : INFO     : iter_start=50, iter_end=53, len=25
2: 2025-07-04 16:20:13,899 2516035 multi_stream_data_sampler.py:426 : INFO     : 2::3 : dataset [50,75) : [59,62)
2: 2025-07-04 16:20:13,899 2516033 multi_stream_data_sampler.py:289 : INFO     : iter_start=56, iter_end=59, len=25
3: 2025-07-04 16:20:13,899 2516025 multi_stream_data_sampler.py:289 : INFO     : iter_start=75, iter_end=78, len=25
3: 2025-07-04 16:20:13,899 2516032 multi_stream_data_sampler.py:426 : INFO     : 3::2 : dataset [75,100) : [81,84)
2: 2025-07-04 16:20:13,899 2516038 multi_stream_data_sampler.py:426 : INFO     : 2::4 : dataset [50,75) : [62,65)
3: 2025-07-04 16:20:13,899 2516036 multi_stream_data_sampler.py:426 : INFO     : 3::3 : dataset [75,100) : [84,87)
1: 2025-07-04 16:20:13,899 2516048 multi_stream_data_sampler.py:289 : INFO     : iter_start=46, iter_end=50, len=25
3: 2025-07-04 16:20:13,899 2516028 multi_stream_data_sampler.py:289 : INFO     : iter_start=78, iter_end=81, len=25
2: 2025-07-04 16:20:13,899 2516042 multi_stream_data_sampler.py:426 : INFO     : 2::5 : dataset [50,75) : [65,68)
3: 2025-07-04 16:20:13,899 2516039 multi_stream_data_sampler.py:426 : INFO     : 3::4 : dataset [75,100) : [87,90)
2: 2025-07-04 16:20:13,900 2516045 multi_stream_data_sampler.py:426 : INFO     : 2::6 : dataset [50,75) : [68,71)
2: 2025-07-04 16:20:13,900 2516035 multi_stream_data_sampler.py:289 : INFO     : iter_start=59, iter_end=62, len=25
2: 2025-07-04 16:20:13,900 2516038 multi_stream_data_sampler.py:289 : INFO     : iter_start=62, iter_end=65, len=25
3: 2025-07-04 16:20:13,900 2516043 multi_stream_data_sampler.py:426 : INFO     : 3::5 : dataset [75,100) : [90,93)
3: 2025-07-04 16:20:13,900 2516032 multi_stream_data_sampler.py:289 : INFO     : iter_start=81, iter_end=84, len=25
3: 2025-07-04 16:20:13,900 2516036 multi_stream_data_sampler.py:289 : INFO     : iter_start=84, iter_end=87, len=25
3: 2025-07-04 16:20:13,900 2516039 multi_stream_data_sampler.py:289 : INFO     : iter_start=87, iter_end=90, len=25
2: 2025-07-04 16:20:13,900 2516042 multi_stream_data_sampler.py:289 : INFO     : iter_start=65, iter_end=68, len=25
3: 2025-07-04 16:20:13,900 2516046 multi_stream_data_sampler.py:426 : INFO     : 3::6 : dataset [75,100) : [93,96)
2: 2025-07-04 16:20:13,900 2516045 multi_stream_data_sampler.py:289 : INFO     : iter_start=68, iter_end=71, len=25
3: 2025-07-04 16:20:13,900 2516043 multi_stream_data_sampler.py:289 : INFO     : iter_start=90, iter_end=93, len=25
3: 2025-07-04 16:20:13,901 2516046 multi_stream_data_sampler.py:289 : INFO     : iter_start=93, iter_end=96, len=25
2: 2025-07-04 16:20:13,905 2516058 multi_stream_data_sampler.py:426 : INFO     : 2::7 : dataset [50,75) : [71,75)
2: 2025-07-04 16:20:13,906 2516058 multi_stream_data_sampler.py:289 : INFO     : iter_start=71, iter_end=75, len=25
3: 2025-07-04 16:20:13,906 2516059 multi_stream_data_sampler.py:426 : INFO     : 3::7 : dataset [75,100) : [96,100)
3: 2025-07-04 16:20:13,907 2516059 multi_stream_data_sampler.py:289 : INFO     : iter_start=96, iter_end=100, len=25
0: 2025-07-04 16:20:14,006 2516041 multi_stream_data_sampler.py:426 : INFO     : 0::0 : dataset [0,25) : [0,3)
0: 2025-07-04 16:20:14,007 2516047 multi_stream_data_sampler.py:426 : INFO     : 0::1 : dataset [0,25) : [3,6)
0: 2025-07-04 16:20:14,007 2516078 multi_stream_data_sampler.py:426 : INFO     : 0::2 : dataset [0,25) : [6,9)
0: 2025-07-04 16:20:14,007 2516041 multi_stream_data_sampler.py:289 : INFO     : iter_start=0, iter_end=3, len=25
0: 2025-07-04 16:20:14,007 2516047 multi_stream_data_sampler.py:289 : INFO     : iter_start=3, iter_end=6, len=25
0: 2025-07-04 16:20:14,007 2516079 multi_stream_data_sampler.py:426 : INFO     : 0::3 : dataset [0,25) : [9,12)
0: 2025-07-04 16:20:14,007 2516080 multi_stream_data_sampler.py:426 : INFO     : 0::4 : dataset [0,25) : [12,15)
0: 2025-07-04 16:20:14,008 2516078 multi_stream_data_sampler.py:289 : INFO     : iter_start=6, iter_end=9, len=25
0: 2025-07-04 16:20:14,008 2516081 multi_stream_data_sampler.py:426 : INFO     : 0::5 : dataset [0,25) : [15,18)
0: 2025-07-04 16:20:14,008 2516082 multi_stream_data_sampler.py:426 : INFO     : 0::6 : dataset [0,25) : [18,21)
0: 2025-07-04 16:20:14,008 2516079 multi_stream_data_sampler.py:289 : INFO     : iter_start=9, iter_end=12, len=25
0: 2025-07-04 16:20:14,008 2516080 multi_stream_data_sampler.py:289 : INFO     : iter_start=12, iter_end=15, len=25
0: 2025-07-04 16:20:14,008 2516082 multi_stream_data_sampler.py:289 : INFO     : iter_start=18, iter_end=21, len=25
0: 2025-07-04 16:20:14,008 2516081 multi_stream_data_sampler.py:289 : INFO     : iter_start=15, iter_end=18, len=25
0: 2025-07-04 16:20:14,016 2516083 multi_stream_data_sampler.py:426 : INFO     : 0::7 : dataset [0,25) : [21,25)
0: 2025-07-04 16:20:14,017 2516083 multi_stream_data_sampler.py:289 : INFO     : iter_start=21, iter_end=25, len=25
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
0: slurmstepd: error: *** STEP 18136338.0 ON l50060 CANCELLED AT 2025-07-04T16:20:40 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 18136338 ON l50060 CANCELLED AT 2025-07-04T16:20:40 DUE TO TIME LIMIT ***

clessig · 2025-07-07T10:04:16Z

That's how the default SLURM looks like

#SBATCH --output=logs/obs-%x.%j.out
#SBATCH --error=logs/obs-%x.%j.err

srun --label train > output/output_${SLURM_JOBID}.txt

Maybe the mixing with the sbatch and piping explains the behavior

enssow · 2025-07-08T10:14:47Z

Following yesterday's meeting I've reverted this to a draft as there are still some questions I have to sort out,

Thank you Sorcha for the change. Regarding the failing integration test: indeed, a fix had been reverted, let's do it separately.

Have you been able make a limited run with real data? I am a bit surprised by the comment of @kacpnowak since before this PR, the print and logging statement are both redirected to the output_xxx.txt:
0: 023 : 01020/01024 : 024572 : loss = 2.9625E-02 (lr=9.77E-10, s/sec=0.746)
0: 	ERA5 : 2.9625E-02
0:
1: 2025-07-05 05:01:09,709 183795 trainer.py:321 : INFO     : Epoch 23 of 24: validate.
3: 2025-07-05 05:01:09,709 183797 trainer.py:321 : INFO     : Epoch 23 of 24: validate.

So I had been using the uv run --offline train command in an interactive gpu session. But this wasn't making the output and log folders so I'm pretty sure this is a problem on my side as I set up my forked repo with a different name. I'm fixing this now and will follow up on the question about how the log/ouptut is being directed.

clessig · 2025-07-08T11:58:27Z

Following yesterday's meeting I've reverted this to a draft as there are still some questions I have to sort out,
Thank you Sorcha for the change. Regarding the failing integration test: indeed, a fix had been reverted, let's do it separately.
Have you been able make a limited run with real data? I am a bit surprised by the comment of @kacpnowak since before this PR, the print and logging statement are both redirected to the output_xxx.txt:
0: 023 : 01020/01024 : 024572 : loss = 2.9625E-02 (lr=9.77E-10, s/sec=0.746)
0: 	ERA5 : 2.9625E-02
0:
1: 2025-07-05 05:01:09,709 183795 trainer.py:321 : INFO     : Epoch 23 of 24: validate.
3: 2025-07-05 05:01:09,709 183797 trainer.py:321 : INFO     : Epoch 23 of 24: validate.
So I had been using the uv run --offline train command in an interactive gpu session. But this wasn't making the output and log folders so I'm pretty sure this is a problem on my side as I set up my forked repo with a different name. I'm fixing this now and will follow up on the question about how the log/ouptut is being directed.

The content of the log and output folders comes from the above lines in a batch slurm script. In an interactive session everything is just written to the terminal.

enssow · 2025-07-09T15:09:32Z

I've done the ruff fixes and removed src/weathergen/utils/run_id.py. I have done a limited run on an interactive gpu using the slurm commands in the wiki everything was directed to output_xxx.txt as expected. Is that all the changes necessary now?

tjhunter · 2025-07-10T07:48:05Z

scripts/check_gh_issue.py

@@ -1,3 +1,5 @@
+# ruff: noqa: T201


this breaks the script

tjhunter

Approved

owens1 and others added 5 commits July 2, 2025 13:05

first change

625cf49

removed all prints

94b5cbe

merging in 238 PR

b2f82ea

changed model.py back

cfea9aa

adding comments and fixes@

d703de6

added ruff fixes

16f2770

kacpnowak reviewed Jul 4, 2025

View reviewed changes

tjhunter reviewed Jul 7, 2025

View reviewed changes

reverting files for PR

d811ff2

enssow marked this pull request as draft July 7, 2025 15:29

enssow added 3 commits July 9, 2025 15:56

ruff fixes

d913552

removing run_id.py

a425e40

formatting changes

9a6e325

Merge branch 'develop' into sorcha/dev/421

02a9350

tjhunter mentioned this pull request Jul 10, 2025

Fix to the script running the integration test #490

Merged

tjhunter reviewed Jul 10, 2025

View reviewed changes

changing comments in check_gh_issue script

ceeaf3d

tjhunter marked this pull request as ready for review July 10, 2025 10:53

tjhunter approved these changes Jul 10, 2025

View reviewed changes

tjhunter merged commit ca09ca7 into ecmwf:develop Jul 10, 2025
3 checks passed

Remove print statements for logging (#421) #439

Remove print statements for logging (#421) #439

Uh oh!

Conversation

enssow commented Jul 3, 2025

Description

Type of Change

Issue Number

Code Compatibility

Code Performance and Testing

Dependencies

Documentation

Additional Notes

Uh oh!

clessig commented Jul 4, 2025

Uh oh!

kacpnowak left a comment

Choose a reason for hiding this comment

Uh oh!

enssow commented Jul 4, 2025

Uh oh!

clessig commented Jul 4, 2025

Uh oh!

tjhunter left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tjhunter Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

enssow Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

kacpnowak commented Jul 7, 2025

Uh oh!

kacpnowak commented Jul 7, 2025

Uh oh!

clessig commented Jul 7, 2025

Uh oh!

enssow commented Jul 8, 2025

Uh oh!

clessig commented Jul 8, 2025

Uh oh!

enssow commented Jul 9, 2025

Uh oh!

tjhunter Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

tjhunter left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!