Skip to content

Remove print statements for logging (#421) #439

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jul 10, 2025

Conversation

enssow
Copy link
Contributor

@enssow enssow commented Jul 3, 2025

Description

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Issue Number

Closes #421

Removes print statements from files except for src/weathergen/model/model.py and edits other print statements to be logged

Code Compatibility

  • I have performed a self-review of my code

Code Performance and Testing

  • I ran the uv run train and (if necessary) uv run evaluate on a least one GPU node and it works
  • If the new feature introduces modifications at the config level, I have made sure to have notified the other software developers through Mattermost and updated the paths in the $WEATHER_GENERATOR_PRIVATE directory

Dependencies

  • I have ensured that the code is still pip-installable after the changes and runs
  • I have tested that new dependencies themselves are pip-installable.
  • I have not introduced new dependencies in the inference portion of the pipeline

Documentation

  • My code follows the style guidelines of this project
  • I have updated the documentation and docstrings to reflect the changes
  • I have added comments to my code, particularly in hard-to-understand areas

Additional Notes

  • Failed for integration pytest:
  • uv run pytest ./integration_tests/small1_test.py --verbose which returned E ImportError: cannot import name 'inference_from_args' from 'weathergen' (unknown location)

@clessig
Copy link
Collaborator

clessig commented Jul 4, 2025

@enssow : can you make sure the code is properly ruffed.

Copy link
Contributor

@kacpnowak kacpnowak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but it changes the way loss is outputed in slurm mode. Instead of output file it's now printed to error file. Is this intentional?

Personally I'm not a fan because it's makes it impossible to quickly check the training by running tail output_{run_id}.txt

@enssow
Copy link
Contributor Author

enssow commented Jul 4, 2025

@kacpnowak Is that in the train/trainer.py? Maybe @tjhunter could give a bit of insight on this one - I'm happy to change it back to print statements in that file if it messes with the current workflow for model training etc.

@clessig
Copy link
Collaborator

clessig commented Jul 4, 2025

@kacpnowak Is that in the train/trainer.py? Maybe @tjhunter could give a bit of insight on this one - I'm happy to change it back to print statements in that file if it messes with the current workflow for model training etc.

Yes, we should keep the infos separate from warning, errors etc. There is also a PR to have a proper redirection of the different streams to different files to make sure things are visible: #271 . This would be the ultimate solution.

To which stream does __logger.info go?

Copy link
Collaborator

@tjhunter tjhunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Sorcha for the change. Regarding the failing integration test: indeed, a fix had been reverted, let's do it separately.

Have you been able make a limited run with real data? I am a bit surprised by the comment of @kacpnowak since before this PR, the print and logging statement are both redirected to the output_xxx.txt:

0: 023 : 01020/01024 : 024572 : loss = 2.9625E-02 (lr=9.77E-10, s/sec=0.746)
0: 	ERA5 : 2.9625E-02
0:
1: 2025-07-05 05:01:09,709 183795 trainer.py:321 : INFO     : Epoch 23 of 24: validate.
3: 2025-07-05 05:01:09,709 183797 trainer.py:321 : INFO     : Epoch 23 of 24: validate.

if __name__ == "__main__":
print(get_run_id())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we can fully remove this main block. This is meant for people who want to create their own run ids.This is also implemented in the private repo, and it has no dependencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't seem to find any dependencies on this file so can I go ahead and delete the whole file?

@kacpnowak
Copy link
Contributor

@kacpnowak Is that in the train/trainer.py? Maybe @tjhunter could give a bit of insight on this one - I'm happy to change it back to print statements in that file if it messes with the current workflow for model training etc.

Yes, we should keep the infos separate from warning, errors etc. There is also a PR to have a proper redirection of the different streams to different files to make sure things are visible: #271 . This would be the ultimate solution.

To which stream does __logger.info go?

Due to using default value in StreamHandler it always goes to stderr ch = logging.StreamHandler()

@kacpnowak
Copy link
Contributor

Thank you Sorcha for the change. Regarding the failing integration test: indeed, a fix had been reverted, let's do it separately.

Have you been able make a limited run with real data? I am a bit surprised by the comment of @kacpnowak since before this PR, the print and logging statement are both redirected to the output_xxx.txt:

0: 023 : 01020/01024 : 024572 : loss = 2.9625E-02 (lr=9.77E-10, s/sec=0.746)
0: 	ERA5 : 2.9625E-02
0:
1: 2025-07-05 05:01:09,709 183795 trainer.py:321 : INFO     : Epoch 23 of 24: validate.
3: 2025-07-05 05:01:09,709 183797 trainer.py:321 : INFO     : Epoch 23 of 24: validate.

It's really weird. Maybe it's due to different batch scripts? In my case I have set two different files for those streams:

#SBATCH --output=logs/atmorep-%x.%j.out
#SBATCH --error=logs/atmorep-%x.%j.err

I pase below part of my err file

0: 2025-07-04 16:19:01,299 2510508 trainer.py:735 : INFO     : Saved model to models/r06jenam/r06jenam_latest.chkpt
0: 2025-07-04 16:19:13,119 2510508 trainer.py:786 : INFO     : 012 : 00010/00050 : 000610 : loss = 7.8196E-02 (lr=7.40E-05, s/sec=0.315)
0: 2025-07-04 16:19:13,119 2510508 trainer.py:801 : INFO     : 	ERA5 : 7.8196E-02 	
0: 2025-07-04 16:19:23,380 2510508 trainer.py:786 : INFO     : 012 : 00020/00050 : 000620 : loss = 7.8310E-02 (lr=7.26E-05, s/sec=0.975)
0: 2025-07-04 16:19:23,380 2510508 trainer.py:801 : INFO     : 	ERA5 : 7.8310E-02 	
0: 2025-07-04 16:19:34,697 2510508 trainer.py:786 : INFO     : 012 : 00030/00050 : 000630 : loss = 7.6214E-02 (lr=7.12E-05, s/sec=0.884)
0: 2025-07-04 16:19:34,697 2510508 trainer.py:801 : INFO     : 	ERA5 : 7.6214E-02 	
0: 2025-07-04 16:19:48,463 2510508 trainer.py:786 : INFO     : 012 : 00040/00050 : 000640 : loss = 7.9266E-02 (lr=6.98E-05, s/sec=0.726)
0: 2025-07-04 16:19:48,463 2510508 trainer.py:801 : INFO     : 	ERA5 : 7.9266E-02 	
4: 2025-07-04 16:20:13,751 899079 trainer.py:321 : INFO     : Epoch 12 of 24: validate.
7: 2025-07-04 16:20:13,758 899082 trainer.py:321 : INFO     : Epoch 12 of 24: validate.
6: 2025-07-04 16:20:13,759 899081 trainer.py:321 : INFO     : Epoch 12 of 24: validate.
5: 2025-07-04 16:20:13,759 899080 trainer.py:321 : INFO     : Epoch 12 of 24: validate.
3: 2025-07-04 16:20:13,760 2510509 trainer.py:321 : INFO     : Epoch 12 of 24: validate.
2: 2025-07-04 16:20:13,761 2510507 trainer.py:321 : INFO     : Epoch 12 of 24: validate.
1: 2025-07-04 16:20:13,764 2510510 trainer.py:321 : INFO     : Epoch 12 of 24: validate.
0: 2025-07-04 16:20:13,837 2510508 trainer.py:321 : INFO     : Epoch 12 of 24: validate.
4: 2025-07-04 16:20:13,856 904354 multi_stream_data_sampler.py:426 : INFO     : 4::0 : dataset [100,125) : [100,103)
4: 2025-07-04 16:20:13,857 904358 multi_stream_data_sampler.py:426 : INFO     : 4::1 : dataset [100,125) : [103,106)
4: 2025-07-04 16:20:13,857 904362 multi_stream_data_sampler.py:426 : INFO     : 4::2 : dataset [100,125) : [106,109)
4: 2025-07-04 16:20:13,857 904358 multi_stream_data_sampler.py:289 : INFO     : iter_start=103, iter_end=106, len=25
4: 2025-07-04 16:20:13,857 904354 multi_stream_data_sampler.py:289 : INFO     : iter_start=100, iter_end=103, len=25
4: 2025-07-04 16:20:13,857 904366 multi_stream_data_sampler.py:426 : INFO     : 4::3 : dataset [100,125) : [109,112)
4: 2025-07-04 16:20:13,857 904362 multi_stream_data_sampler.py:289 : INFO     : iter_start=106, iter_end=109, len=25
4: 2025-07-04 16:20:13,857 904370 multi_stream_data_sampler.py:426 : INFO     : 4::4 : dataset [100,125) : [112,115)
4: 2025-07-04 16:20:13,858 904374 multi_stream_data_sampler.py:426 : INFO     : 4::5 : dataset [100,125) : [115,118)
4: 2025-07-04 16:20:13,858 904366 multi_stream_data_sampler.py:289 : INFO     : iter_start=109, iter_end=112, len=25
4: 2025-07-04 16:20:13,858 904378 multi_stream_data_sampler.py:426 : INFO     : 4::6 : dataset [100,125) : [118,121)
4: 2025-07-04 16:20:13,858 904370 multi_stream_data_sampler.py:289 : INFO     : iter_start=112, iter_end=115, len=25
4: 2025-07-04 16:20:13,858 904374 multi_stream_data_sampler.py:289 : INFO     : iter_start=115, iter_end=118, len=25
4: 2025-07-04 16:20:13,858 904378 multi_stream_data_sampler.py:289 : INFO     : iter_start=118, iter_end=121, len=25
7: 2025-07-04 16:20:13,860 904355 multi_stream_data_sampler.py:426 : INFO     : 7::0 : dataset [175,200) : [175,178)
7: 2025-07-04 16:20:13,860 904359 multi_stream_data_sampler.py:426 : INFO     : 7::1 : dataset [175,200) : [178,181)
7: 2025-07-04 16:20:13,860 904363 multi_stream_data_sampler.py:426 : INFO     : 7::2 : dataset [175,200) : [181,184)
7: 2025-07-04 16:20:13,860 904355 multi_stream_data_sampler.py:289 : INFO     : iter_start=175, iter_end=178, len=25
7: 2025-07-04 16:20:13,861 904359 multi_stream_data_sampler.py:289 : INFO     : iter_start=178, iter_end=181, len=25
7: 2025-07-04 16:20:13,861 904363 multi_stream_data_sampler.py:289 : INFO     : iter_start=181, iter_end=184, len=25
7: 2025-07-04 16:20:13,861 904371 multi_stream_data_sampler.py:426 : INFO     : 7::4 : dataset [175,200) : [187,190)
7: 2025-07-04 16:20:13,861 904367 multi_stream_data_sampler.py:426 : INFO     : 7::3 : dataset [175,200) : [184,187)
7: 2025-07-04 16:20:13,861 904375 multi_stream_data_sampler.py:426 : INFO     : 7::5 : dataset [175,200) : [190,193)
7: 2025-07-04 16:20:13,861 904367 multi_stream_data_sampler.py:289 : INFO     : iter_start=184, iter_end=187, len=25
7: 2025-07-04 16:20:13,861 904371 multi_stream_data_sampler.py:289 : INFO     : iter_start=187, iter_end=190, len=25
7: 2025-07-04 16:20:13,861 904379 multi_stream_data_sampler.py:426 : INFO     : 7::6 : dataset [175,200) : [193,196)
7: 2025-07-04 16:20:13,862 904375 multi_stream_data_sampler.py:289 : INFO     : iter_start=190, iter_end=193, len=25
7: 2025-07-04 16:20:13,862 904379 multi_stream_data_sampler.py:289 : INFO     : iter_start=193, iter_end=196, len=25
4: 2025-07-04 16:20:13,864 904381 multi_stream_data_sampler.py:426 : INFO     : 4::7 : dataset [100,125) : [121,125)
4: 2025-07-04 16:20:13,865 904381 multi_stream_data_sampler.py:289 : INFO     : iter_start=121, iter_end=125, len=25
6: 2025-07-04 16:20:13,867 904356 multi_stream_data_sampler.py:426 : INFO     : 6::0 : dataset [150,175) : [150,153)
6: 2025-07-04 16:20:13,867 904360 multi_stream_data_sampler.py:426 : INFO     : 6::1 : dataset [150,175) : [153,156)
7: 2025-07-04 16:20:13,868 904391 multi_stream_data_sampler.py:426 : INFO     : 7::7 : dataset [175,200) : [196,200)
6: 2025-07-04 16:20:13,868 904364 multi_stream_data_sampler.py:426 : INFO     : 6::2 : dataset [150,175) : [156,159)
6: 2025-07-04 16:20:13,868 904368 multi_stream_data_sampler.py:426 : INFO     : 6::3 : dataset [150,175) : [159,162)
6: 2025-07-04 16:20:13,868 904360 multi_stream_data_sampler.py:289 : INFO     : iter_start=153, iter_end=156, len=25
6: 2025-07-04 16:20:13,868 904356 multi_stream_data_sampler.py:289 : INFO     : iter_start=150, iter_end=153, len=25
6: 2025-07-04 16:20:13,868 904372 multi_stream_data_sampler.py:426 : INFO     : 6::4 : dataset [150,175) : [162,165)
7: 2025-07-04 16:20:13,868 904391 multi_stream_data_sampler.py:289 : INFO     : iter_start=196, iter_end=200, len=25
6: 2025-07-04 16:20:13,868 904368 multi_stream_data_sampler.py:289 : INFO     : iter_start=159, iter_end=162, len=25
6: 2025-07-04 16:20:13,868 904364 multi_stream_data_sampler.py:289 : INFO     : iter_start=156, iter_end=159, len=25
6: 2025-07-04 16:20:13,868 904376 multi_stream_data_sampler.py:426 : INFO     : 6::5 : dataset [150,175) : [165,168)
6: 2025-07-04 16:20:13,869 904372 multi_stream_data_sampler.py:289 : INFO     : iter_start=162, iter_end=165, len=25
6: 2025-07-04 16:20:13,869 904380 multi_stream_data_sampler.py:426 : INFO     : 6::6 : dataset [150,175) : [168,171)
6: 2025-07-04 16:20:13,869 904376 multi_stream_data_sampler.py:289 : INFO     : iter_start=165, iter_end=168, len=25
5: 2025-07-04 16:20:13,869 904357 multi_stream_data_sampler.py:426 : INFO     : 5::0 : dataset [125,150) : [125,128)
5: 2025-07-04 16:20:13,869 904361 multi_stream_data_sampler.py:426 : INFO     : 5::1 : dataset [125,150) : [128,131)
6: 2025-07-04 16:20:13,869 904380 multi_stream_data_sampler.py:289 : INFO     : iter_start=168, iter_end=171, len=25
5: 2025-07-04 16:20:13,870 904357 multi_stream_data_sampler.py:289 : INFO     : iter_start=125, iter_end=128, len=25
5: 2025-07-04 16:20:13,869 904365 multi_stream_data_sampler.py:426 : INFO     : 5::2 : dataset [125,150) : [131,134)
5: 2025-07-04 16:20:13,870 904361 multi_stream_data_sampler.py:289 : INFO     : iter_start=128, iter_end=131, len=25
5: 2025-07-04 16:20:13,870 904369 multi_stream_data_sampler.py:426 : INFO     : 5::3 : dataset [125,150) : [134,137)
5: 2025-07-04 16:20:13,870 904373 multi_stream_data_sampler.py:426 : INFO     : 5::4 : dataset [125,150) : [137,140)
5: 2025-07-04 16:20:13,870 904365 multi_stream_data_sampler.py:289 : INFO     : iter_start=131, iter_end=134, len=25
5: 2025-07-04 16:20:13,870 904377 multi_stream_data_sampler.py:426 : INFO     : 5::5 : dataset [125,150) : [140,143)
5: 2025-07-04 16:20:13,870 904369 multi_stream_data_sampler.py:289 : INFO     : iter_start=134, iter_end=137, len=25
5: 2025-07-04 16:20:13,870 904373 multi_stream_data_sampler.py:289 : INFO     : iter_start=137, iter_end=140, len=25
5: 2025-07-04 16:20:13,870 904382 multi_stream_data_sampler.py:426 : INFO     : 5::6 : dataset [125,150) : [143,146)
5: 2025-07-04 16:20:13,871 904377 multi_stream_data_sampler.py:289 : INFO     : iter_start=140, iter_end=143, len=25
5: 2025-07-04 16:20:13,871 904382 multi_stream_data_sampler.py:289 : INFO     : iter_start=143, iter_end=146, len=25
6: 2025-07-04 16:20:13,875 904402 multi_stream_data_sampler.py:426 : INFO     : 6::7 : dataset [150,175) : [171,175)
6: 2025-07-04 16:20:13,876 904402 multi_stream_data_sampler.py:289 : INFO     : iter_start=171, iter_end=175, len=25
5: 2025-07-04 16:20:13,878 904406 multi_stream_data_sampler.py:426 : INFO     : 5::7 : dataset [125,150) : [146,150)
5: 2025-07-04 16:20:13,879 904406 multi_stream_data_sampler.py:289 : INFO     : iter_start=146, iter_end=150, len=25
1: 2025-07-04 16:20:13,892 2516027 multi_stream_data_sampler.py:426 : INFO     : 1::0 : dataset [25,50) : [25,28)
1: 2025-07-04 16:20:13,892 2516030 multi_stream_data_sampler.py:426 : INFO     : 1::1 : dataset [25,50) : [28,31)
1: 2025-07-04 16:20:13,892 2516031 multi_stream_data_sampler.py:426 : INFO     : 1::2 : dataset [25,50) : [31,34)
1: 2025-07-04 16:20:13,892 2516034 multi_stream_data_sampler.py:426 : INFO     : 1::3 : dataset [25,50) : [34,37)
1: 2025-07-04 16:20:13,893 2516027 multi_stream_data_sampler.py:289 : INFO     : iter_start=25, iter_end=28, len=25
1: 2025-07-04 16:20:13,893 2516030 multi_stream_data_sampler.py:289 : INFO     : iter_start=28, iter_end=31, len=25
1: 2025-07-04 16:20:13,893 2516037 multi_stream_data_sampler.py:426 : INFO     : 1::4 : dataset [25,50) : [37,40)
1: 2025-07-04 16:20:13,893 2516031 multi_stream_data_sampler.py:289 : INFO     : iter_start=31, iter_end=34, len=25
1: 2025-07-04 16:20:13,893 2516034 multi_stream_data_sampler.py:289 : INFO     : iter_start=34, iter_end=37, len=25
1: 2025-07-04 16:20:13,893 2516040 multi_stream_data_sampler.py:426 : INFO     : 1::5 : dataset [25,50) : [40,43)
1: 2025-07-04 16:20:13,893 2516037 multi_stream_data_sampler.py:289 : INFO     : iter_start=37, iter_end=40, len=25
1: 2025-07-04 16:20:13,893 2516044 multi_stream_data_sampler.py:426 : INFO     : 1::6 : dataset [25,50) : [43,46)
1: 2025-07-04 16:20:13,894 2516040 multi_stream_data_sampler.py:289 : INFO     : iter_start=40, iter_end=43, len=25
1: 2025-07-04 16:20:13,894 2516044 multi_stream_data_sampler.py:289 : INFO     : iter_start=43, iter_end=46, len=25
2: 2025-07-04 16:20:13,898 2516026 multi_stream_data_sampler.py:426 : INFO     : 2::0 : dataset [50,75) : [50,53)
2: 2025-07-04 16:20:13,898 2516029 multi_stream_data_sampler.py:426 : INFO     : 2::1 : dataset [50,75) : [53,56)
3: 2025-07-04 16:20:13,898 2516025 multi_stream_data_sampler.py:426 : INFO     : 3::0 : dataset [75,100) : [75,78)
2: 2025-07-04 16:20:13,899 2516033 multi_stream_data_sampler.py:426 : INFO     : 2::2 : dataset [50,75) : [56,59)
3: 2025-07-04 16:20:13,899 2516028 multi_stream_data_sampler.py:426 : INFO     : 3::1 : dataset [75,100) : [78,81)
1: 2025-07-04 16:20:13,899 2516048 multi_stream_data_sampler.py:426 : INFO     : 1::7 : dataset [25,50) : [46,50)
2: 2025-07-04 16:20:13,899 2516029 multi_stream_data_sampler.py:289 : INFO     : iter_start=53, iter_end=56, len=25
2: 2025-07-04 16:20:13,899 2516026 multi_stream_data_sampler.py:289 : INFO     : iter_start=50, iter_end=53, len=25
2: 2025-07-04 16:20:13,899 2516035 multi_stream_data_sampler.py:426 : INFO     : 2::3 : dataset [50,75) : [59,62)
2: 2025-07-04 16:20:13,899 2516033 multi_stream_data_sampler.py:289 : INFO     : iter_start=56, iter_end=59, len=25
3: 2025-07-04 16:20:13,899 2516025 multi_stream_data_sampler.py:289 : INFO     : iter_start=75, iter_end=78, len=25
3: 2025-07-04 16:20:13,899 2516032 multi_stream_data_sampler.py:426 : INFO     : 3::2 : dataset [75,100) : [81,84)
2: 2025-07-04 16:20:13,899 2516038 multi_stream_data_sampler.py:426 : INFO     : 2::4 : dataset [50,75) : [62,65)
3: 2025-07-04 16:20:13,899 2516036 multi_stream_data_sampler.py:426 : INFO     : 3::3 : dataset [75,100) : [84,87)
1: 2025-07-04 16:20:13,899 2516048 multi_stream_data_sampler.py:289 : INFO     : iter_start=46, iter_end=50, len=25
3: 2025-07-04 16:20:13,899 2516028 multi_stream_data_sampler.py:289 : INFO     : iter_start=78, iter_end=81, len=25
2: 2025-07-04 16:20:13,899 2516042 multi_stream_data_sampler.py:426 : INFO     : 2::5 : dataset [50,75) : [65,68)
3: 2025-07-04 16:20:13,899 2516039 multi_stream_data_sampler.py:426 : INFO     : 3::4 : dataset [75,100) : [87,90)
2: 2025-07-04 16:20:13,900 2516045 multi_stream_data_sampler.py:426 : INFO     : 2::6 : dataset [50,75) : [68,71)
2: 2025-07-04 16:20:13,900 2516035 multi_stream_data_sampler.py:289 : INFO     : iter_start=59, iter_end=62, len=25
2: 2025-07-04 16:20:13,900 2516038 multi_stream_data_sampler.py:289 : INFO     : iter_start=62, iter_end=65, len=25
3: 2025-07-04 16:20:13,900 2516043 multi_stream_data_sampler.py:426 : INFO     : 3::5 : dataset [75,100) : [90,93)
3: 2025-07-04 16:20:13,900 2516032 multi_stream_data_sampler.py:289 : INFO     : iter_start=81, iter_end=84, len=25
3: 2025-07-04 16:20:13,900 2516036 multi_stream_data_sampler.py:289 : INFO     : iter_start=84, iter_end=87, len=25
3: 2025-07-04 16:20:13,900 2516039 multi_stream_data_sampler.py:289 : INFO     : iter_start=87, iter_end=90, len=25
2: 2025-07-04 16:20:13,900 2516042 multi_stream_data_sampler.py:289 : INFO     : iter_start=65, iter_end=68, len=25
3: 2025-07-04 16:20:13,900 2516046 multi_stream_data_sampler.py:426 : INFO     : 3::6 : dataset [75,100) : [93,96)
2: 2025-07-04 16:20:13,900 2516045 multi_stream_data_sampler.py:289 : INFO     : iter_start=68, iter_end=71, len=25
3: 2025-07-04 16:20:13,900 2516043 multi_stream_data_sampler.py:289 : INFO     : iter_start=90, iter_end=93, len=25
3: 2025-07-04 16:20:13,901 2516046 multi_stream_data_sampler.py:289 : INFO     : iter_start=93, iter_end=96, len=25
2: 2025-07-04 16:20:13,905 2516058 multi_stream_data_sampler.py:426 : INFO     : 2::7 : dataset [50,75) : [71,75)
2: 2025-07-04 16:20:13,906 2516058 multi_stream_data_sampler.py:289 : INFO     : iter_start=71, iter_end=75, len=25
3: 2025-07-04 16:20:13,906 2516059 multi_stream_data_sampler.py:426 : INFO     : 3::7 : dataset [75,100) : [96,100)
3: 2025-07-04 16:20:13,907 2516059 multi_stream_data_sampler.py:289 : INFO     : iter_start=96, iter_end=100, len=25
0: 2025-07-04 16:20:14,006 2516041 multi_stream_data_sampler.py:426 : INFO     : 0::0 : dataset [0,25) : [0,3)
0: 2025-07-04 16:20:14,007 2516047 multi_stream_data_sampler.py:426 : INFO     : 0::1 : dataset [0,25) : [3,6)
0: 2025-07-04 16:20:14,007 2516078 multi_stream_data_sampler.py:426 : INFO     : 0::2 : dataset [0,25) : [6,9)
0: 2025-07-04 16:20:14,007 2516041 multi_stream_data_sampler.py:289 : INFO     : iter_start=0, iter_end=3, len=25
0: 2025-07-04 16:20:14,007 2516047 multi_stream_data_sampler.py:289 : INFO     : iter_start=3, iter_end=6, len=25
0: 2025-07-04 16:20:14,007 2516079 multi_stream_data_sampler.py:426 : INFO     : 0::3 : dataset [0,25) : [9,12)
0: 2025-07-04 16:20:14,007 2516080 multi_stream_data_sampler.py:426 : INFO     : 0::4 : dataset [0,25) : [12,15)
0: 2025-07-04 16:20:14,008 2516078 multi_stream_data_sampler.py:289 : INFO     : iter_start=6, iter_end=9, len=25
0: 2025-07-04 16:20:14,008 2516081 multi_stream_data_sampler.py:426 : INFO     : 0::5 : dataset [0,25) : [15,18)
0: 2025-07-04 16:20:14,008 2516082 multi_stream_data_sampler.py:426 : INFO     : 0::6 : dataset [0,25) : [18,21)
0: 2025-07-04 16:20:14,008 2516079 multi_stream_data_sampler.py:289 : INFO     : iter_start=9, iter_end=12, len=25
0: 2025-07-04 16:20:14,008 2516080 multi_stream_data_sampler.py:289 : INFO     : iter_start=12, iter_end=15, len=25
0: 2025-07-04 16:20:14,008 2516082 multi_stream_data_sampler.py:289 : INFO     : iter_start=18, iter_end=21, len=25
0: 2025-07-04 16:20:14,008 2516081 multi_stream_data_sampler.py:289 : INFO     : iter_start=15, iter_end=18, len=25
0: 2025-07-04 16:20:14,016 2516083 multi_stream_data_sampler.py:426 : INFO     : 0::7 : dataset [0,25) : [21,25)
0: 2025-07-04 16:20:14,017 2516083 multi_stream_data_sampler.py:289 : INFO     : iter_start=21, iter_end=25, len=25
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
0: slurmstepd: error: *** STEP 18136338.0 ON l50060 CANCELLED AT 2025-07-04T16:20:40 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 18136338 ON l50060 CANCELLED AT 2025-07-04T16:20:40 DUE TO TIME LIMIT ***

@clessig
Copy link
Collaborator

clessig commented Jul 7, 2025

That's how the default SLURM looks like

#SBATCH --output=logs/obs-%x.%j.out
#SBATCH --error=logs/obs-%x.%j.err

srun --label train > output/output_${SLURM_JOBID}.txt

Maybe the mixing with the sbatch and piping explains the behavior

@enssow enssow marked this pull request as draft July 7, 2025 15:29
@enssow
Copy link
Contributor Author

enssow commented Jul 8, 2025

Following yesterday's meeting I've reverted this to a draft as there are still some questions I have to sort out,

Thank you Sorcha for the change. Regarding the failing integration test: indeed, a fix had been reverted, let's do it separately.

Have you been able make a limited run with real data? I am a bit surprised by the comment of @kacpnowak since before this PR, the print and logging statement are both redirected to the output_xxx.txt:

0: 023 : 01020/01024 : 024572 : loss = 2.9625E-02 (lr=9.77E-10, s/sec=0.746)
0: 	ERA5 : 2.9625E-02
0:
1: 2025-07-05 05:01:09,709 183795 trainer.py:321 : INFO     : Epoch 23 of 24: validate.
3: 2025-07-05 05:01:09,709 183797 trainer.py:321 : INFO     : Epoch 23 of 24: validate.

So I had been using the uv run --offline train command in an interactive gpu session. But this wasn't making the output and log folders so I'm pretty sure this is a problem on my side as I set up my forked repo with a different name. I'm fixing this now and will follow up on the question about how the log/ouptut is being directed.

@clessig
Copy link
Collaborator

clessig commented Jul 8, 2025

Following yesterday's meeting I've reverted this to a draft as there are still some questions I have to sort out,

Thank you Sorcha for the change. Regarding the failing integration test: indeed, a fix had been reverted, let's do it separately.
Have you been able make a limited run with real data? I am a bit surprised by the comment of @kacpnowak since before this PR, the print and logging statement are both redirected to the output_xxx.txt:

0: 023 : 01020/01024 : 024572 : loss = 2.9625E-02 (lr=9.77E-10, s/sec=0.746)
0: 	ERA5 : 2.9625E-02
0:
1: 2025-07-05 05:01:09,709 183795 trainer.py:321 : INFO     : Epoch 23 of 24: validate.
3: 2025-07-05 05:01:09,709 183797 trainer.py:321 : INFO     : Epoch 23 of 24: validate.

So I had been using the uv run --offline train command in an interactive gpu session. But this wasn't making the output and log folders so I'm pretty sure this is a problem on my side as I set up my forked repo with a different name. I'm fixing this now and will follow up on the question about how the log/ouptut is being directed.

The content of the log and output folders comes from the above lines in a batch slurm script. In an interactive session everything is just written to the terminal.

@enssow
Copy link
Contributor Author

enssow commented Jul 9, 2025

I've done the ruff fixes and removed src/weathergen/utils/run_id.py. I have done a limited run on an interactive gpu using the slurm commands in the wiki everything was directed to output_xxx.txt as expected. Is that all the changes necessary now?

@@ -1,3 +1,5 @@
# ruff: noqa: T201
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this breaks the script

@tjhunter tjhunter marked this pull request as ready for review July 10, 2025 10:53
Copy link
Collaborator

@tjhunter tjhunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved

@tjhunter tjhunter merged commit ca09ca7 into ecmwf:develop Jul 10, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change the print() statements to logger statements (except in the train loop)
4 participants