Warning message regarding free surface when using checkpoint is not consistent #4516

roxyboy · 2025-05-16T15:28:11Z

Hi Oceananigans,

I would like to report a warning message that could be confusing.
Continuing the discussion, I am getting the following error when restarting the simulation from a pickup file:

julia> chkpt = 67500
julia> run!(simulation, pickup=chkpt)
------------------------------------
┌ Warning: Tendencies for η do not exist in checkpoint and could not be restored.
└ @ Oceananigans.OutputWriters [~/.julia/packages/Oceananigans/YRhAo/src/OutputWriters/checkpointer.jl:277](http://127.0.0.1:8989/lab/tree/Oceananigans/ocean_wind_mixing_and_convection/~/.julia/packages/Oceananigans/YRhAo/src/OutputWriters/checkpointer.jl#line=276)

The set up I have is the following:

using Oceananigans
using Oceananigans.Units

Lx = 100kilometers # east-west extent [m]
Ly = 100kilometers # north-south extent [m]
Lz = 500    # depth [m]

grid = RectilinearGrid(GPU(), size = (512, 512, 128),
                       x = (0, Lx),
                       y = (-Ly/2, Ly/2),
                       z = (-Lz, 0),
                       topology = (Periodic, Periodic, Bounded)
)

model = HydrostaticFreeSurfaceModel(; grid, 
            			    buoyancy = BuoyancyTracer(),
                                    coriolis = FPlane(f=1e-4),
                                    closure = CATKEVerticalDiffusivity(),
                                    tracers = (:b, :e), 
            			    boundary_conditions = (u=u_bcs, b=b_bcs),
                                    momentum_advection = WENO(order=5),
                                    tracer_advection = WENO(order=5)
)

The model seems to run despite the warning message.

The text was updated successfully, but these errors were encountered:

glwagner · 2025-05-16T19:33:59Z

@navidcy has been looking at the checkpointer so he might have some info

navidcy · 2025-05-18T06:04:51Z

Hm... What's confusing about this message? I think the message it's clear.

The Quasi-Adams-Bashforth time stepper requires the tendencies of the previous time step. So what happens here (if I am deducing from the warning and assuming that there is no bug or something) is that the pickup from checkpointer occurs but the tendencies from the previous timestep of the free surface are zero. This must result in slight discrepancy between eg if you run up to iteration 1000 or if you run up to 500 and then pick up and run for another 500 iterations, right?

roxyboy · 2025-05-18T07:28:00Z

The Quasi-Adams-Bashforth time stepper requires the tendencies of the previous time step. So what happens here (if I am deducing from the warning and assuming that there is no bug or something) is that the pickup from checkpointer occurs but the tendencies from the previous timestep of the free surface are zero. This must result in slight discrepancy between eg if you run up to iteration 1000 or if you run up to 500 and then pick up and run for another 500 iterations, right?

This seems to imply that I should explicitly save the tendencies of free surface in the pickup files. I've been saving the pickups by

simulation.output_writers[:checkpointer] = Checkpointer(model, 
    schedule=IterationInterval(22500), 
    prefix="buoyancy_case_model_checkpoint"
)

but how should I change this so that it saves the free surface tendency?
As a user, I think it would better if the "discrepancy between eg if you run up to iteration 1000 or if you run up to 500 and then pick up and run for another 500 iterations" doesn't happen.

navidcy · 2025-05-18T09:49:08Z

Of course. I don’t disagree. I think the warning was trying to communicate that you shouldn’t expect bitwise reproducibility. Perhaps we can rephrase it?

navidcy · 2025-05-18T09:50:17Z

I’ll have a look into that. There’s a good chance that it’s an old warning because saving a reduced field wasn’t easy then. Perhaps saving the tendencies is straightforward now.

glwagner · 2025-05-18T16:06:15Z

Yes the question in my mind is more: "why don't we save the tendencies of the free surface displacement?"

simone-silvestri · 2025-05-18T16:41:25Z

In this case, the model uses a SplitExplicitFreeSurface, so we don't need to save down / store tendencies for η.

Oceananigans.jl/src/Models/HydrostaticFreeSurfaceModels/SplitExplicitFreeSurfaces/split_explicit_free_surface.jl

Lines 163 to 178 in e611d00

    
           function hydrostatic_tendency_fields(velocities, free_surface::SplitExplicitFreeSurface, grid, tracer_names, bcs) 
        
               u = XFaceField(grid, boundary_conditions=bcs.u) 
        
               v = YFaceField(grid, boundary_conditions=bcs.v) 
        
               @apply_regionally U_bcs = barotropic_velocity_boundary_conditions(velocities.u) 
        
               @apply_regionally V_bcs = barotropic_velocity_boundary_conditions(velocities.v) 
        
               free_surface_grid = free_surface.η.grid 
        
               U = Field{Face, Center, Nothing}(free_surface_grid, boundary_conditions=U_bcs) 
        
               V = Field{Center, Face, Nothing}(free_surface_grid, boundary_conditions=V_bcs) 
        
               tracers = TracerFields(tracer_names, grid, bcs) 
        
               return merge((u=u, v=v, U=U, V=V), tracers) 
        
           end

I think here

Oceananigans.jl/src/OutputWriters/checkpointer.jl

Line 264 in e611d00

    
           if string(name) ∈ keys(file["$addr/timestepper/Gⁿ"]) # Test if variable tendencies exist in checkpoint

we should also check if a property like that exists in the timestepper:

if string(name) ∈ keys(file["$addr/timestepper/Gⁿ"]) && hasproperty(timestepper.Gⁿ, name) # Test if variable tendencies exist in checkpoint _and_ in the timestepper

glwagner · 2025-05-18T17:14:01Z

Hm, so we only need tendencies for ExplicitFreeSurface -- is that right?

glwagner · 2025-05-18T17:16:40Z

The follow-up question for @roxyboy is whether bitwise checkpointing succeeds or not. To test this, do something like

# control run
simulation1.stop_time = T
run!(simulation1)

# checkpointed run
simulation2.stop_time = T/2
run!(simulation2)
simulation2.stop_time = T
run!(simulation2, pickup=true)

u1 = simulation1.model.velocities.u
u2 = simulation2.model.velocities.u
@show parent(u1) == parent(u2)

this can be done at low resolution on a laptop with the setup being developed to test

roxyboy · 2025-05-19T18:08:53Z

> # control run
> simulation1.stop_time = T
> run!(simulation1)
> 
> # checkpointed run
> simulation2.stop_time = T/2
> run!(simulation2)
> simulation2.stop_time = T
> run!(simulation2, pickup=true)
> 
> u1 = simulation1.model.velocities.u
> u2 = simulation2.model.velocities.u
> @show parent(u1) == parent(u2)
> this can be done at low resolution on a laptop with the setup being developed to test

I'm testing this now but the time stepping based on the CFL condition changes every time I pick up from the same checkpoint so I'm skeptical that bit-wise reproducibility is being achieved...

simone-silvestri · 2025-05-19T18:22:29Z

@roxyboy, you shouldn't use a CFL-based timestepping with the default timestepper (AB2) because with a variable timestep AB2 is not valid, and the timestepping reduces to forward Euler, which is much worse than AB2. Try setting a fixed timestep.

roxyboy · 2025-05-19T18:29:08Z

@roxyboy, you shouldn't use a CFL-based timestepping with the default timestepper (AB2) because with a variable timestep AB2 is not valid, and the timestepping reduces to forward Euler, which is much worse than AB2. Try setting a fixed timestep.

This is very good to know, thank you! But still, (I think) the fact that CFL-based timestepping gives different time steps every time when the run is restarted from a pickup file is not ideal.

glwagner · 2025-05-19T22:05:51Z

I think you're right that is shows the states are not identical

glwagner · 2025-05-19T23:23:29Z

@simone-silvestri couldn't we still update the time-step, say, every 100 iterations without doing too much damage to the state? It does not reduce the time-stepping to forward Euler every time step just because we change the time-step once, does it?

roxyboy · 2025-05-20T07:16:05Z

> # control run
> simulation1.stop_time = T
> run!(simulation1)
> 
> # checkpointed run
> simulation2.stop_time = T/2
> run!(simulation2)
> simulation2.stop_time = T
> run!(simulation2, pickup=true)
> 
> u1 = simulation1.model.velocities.u
> u2 = simulation2.model.velocities.u
> @show parent(u1) == parent(u2)
> this can be done at low resolution on a laptop with the setup being developed to test
I'm testing this now but the time stepping based on the CFL condition changes every time I pick up from the same checkpoint so I'm skeptical that bit-wise reproducibility is being achieved...

So, it seems that @show parent(u1) == parent(u2) gives true so I may have been wrong. I think bit-wise reproducibility is being achieved without the free-surface tendencies being saved in the pick up files under constant time stepping. Below is a reproducible example taken from the tutorial:

using Oceananigans
using Oceananigans.Units

Lx = 100kilometers # east-west extent [m]
Ly = 100kilometers # north-south extent [m]
Lz = 500    # depth [m]

grid = RectilinearGrid(GPU(), size = (64, 64, 32),
                       x = (0, Lx),
                       y = (-Ly/2, Ly/2),
                       z = (-Lz, 0),
                       topology = (Periodic, Bounded, Bounded)
)

model = HydrostaticFreeSurfaceModel(; grid, 
            			    buoyancy = BuoyancyTracer(),
                                    coriolis = BetaPlane(latitude = -45),
                                    closure = CATKEVerticalDiffusivity(),
                                    tracers = (:b, :e), 
                                    momentum_advection = WENO(order=5),
                                    tracer_advection = WENO(order=5)
)

ramp(y, Δy) = min(max(0, y/Δy + 1/2), 1)

N² = 1e-5 # [s⁻²] buoyancy frequency / stratification
M² = 1e-7 # [s⁻²] horizontal buoyancy gradient

Δy = 100kilometers # width of the region of the front
Δb = Δy * M²       # buoyancy jump associated with the front
ϵb = 1e-2 * Δb     # noise amplitude

bᵢ(x, y, z) = N² * z + Δb * ramp(y, Δy) + ϵb * randn()

set!(model, b=bᵢ)

simulation1 = Simulation(model, Δt=10seconds)
T = 1day
simulation1.stop_time = T
run!(simulation1)

simulation2 = Simulation(model, Δt=10seconds)

simulation2.output_writers[:checkpointer] = Checkpointer(model, 
    schedule=IterationInterval(4320), 
    prefix="bitwise_checkpoint"
)

# checkpointed run
# simulation2.stop_time = T/2
# run!(simulation2)
simulation2.stop_time = T
run!(simulation2, pickup=4320)

u1 = simulation1.model.velocities.u
u2 = simulation2.model.velocities.u
@show parent(u1) == parent(u2)
---------------------------------
┌ Warning: This simulation will run forever as stop iteration = stop time = wall time limit = Inf.
└ @ Oceananigans.Simulations [~/.julia/packages/Oceananigans/YRhAo/src/Simulations/simulation.jl:64](http://127.0.0.1:8989/lab/tree/Oceananigans/ocean_wind_mixing_and_convection/~/.julia/packages/Oceananigans/YRhAo/src/Simulations/simulation.jl#line=63)
┌ Warning: Tendencies for η do not exist in checkpoint and could not be restored.
└ @ Oceananigans.OutputWriters [~/.julia/packages/Oceananigans/YRhAo/src/OutputWriters/checkpointer.jl:277](http://127.0.0.1:8989/lab/tree/Oceananigans/ocean_wind_mixing_and_convection/~/.julia/packages/Oceananigans/YRhAo/src/OutputWriters/checkpointer.jl#line=276)
[ Info: Initializing simulation...
[ Info:     ... simulation initialization complete (24.418 ms)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (3.061 ms).
[ Info: Simulation is stopping after running for 13.521 seconds.
[ Info: Simulation time 1 day equals or exceeds stop time 1 day.
Iteration: 8640, time: 1 day, Δt: 10 seconds, max(|w|) = 3.7e-03 ms⁻¹, wall time: 13.521 seconds
parent(u1) == parent(u2) = true
true

simone-silvestri · 2025-05-20T08:07:19Z

@simone-silvestri couldn't we still update the time-step, say, every 100 iterations without doing too much damage to the state? It does not reduce the time-stepping to forward Euler every time step just because we change the time-step once, does it?

I am unsure how much damage 1 in a 100 timestep can do. I guess it is a matter of experimentation. I gave up trying to use a wizard with AB2 because I found out that in most practical cases (talking only about large scale ocean modeling here...) it is more detrimental than anything, (even on 100 timesteps) it creates velocity extrema that, to be cleaned up, require reducing the time step in the next wizard pass, hence effectively running at a lower timestep that what would be necessary without wizard. For the cases I typically ran, a wizard does not necessarily destabilize the simulation (because the following 99 AB2 steps clean up the errors in the 1 FE step), but it slows down the computation, defeating the purpose of having a wizard.

A problem of the wizard limiting to FE is that, typically, you want to set the wizard to the typical CFL limit of AB2 (not the one of FE, which is roughly 50% lower than what we can achieve with AB2), so that one FE step will be by definition outside of the recommended stability limit and create problems. The effect on the simulation is, for sure, on a case-to-case basis, and probably more rigorous testing needs to be done. If a user is willing to try and tune the CFL limit and the update schedule of the wizard, he might find good results; it is something to be aware of, though.

Anecdotally, I remember some time ago @sandreza told me that speedyweather used FE at the first time step to restart, and that single FE timestep was enough to make some simulations completely diverge and crash when compared to their counterpart without checkpointing. To be fair, I think those were simulations at the limit of what the numerics could handle.

simone-silvestri · 2025-05-20T08:37:23Z

Just whipped up an example for a 1D tracer advection solution, probably representative of advection-dominated problems. I am running with RK3 at CFL=0.5 (below the required timestep size, RK3 can handle CFL 0.7). I interject a FE step every 100 timesteps to see how the solution evolves. This is without adaptive timestep, but the velocity is constant (u=1), so the timestep size would not change anyway. The presence of false extrema is quite striking, even at 1 in 100 timesteps

evolution.mp4

navidcy · 2025-05-20T08:42:36Z

Wow. I wouldn't expect to have such a long-lasting effect...

simone-silvestri · 2025-05-20T08:42:44Z

This is the same test case with CFL 0.3 and FE steps every 20 iterations.

evolution.mp4

glwagner · 2025-05-20T11:14:37Z

So, it seems that @show parent(u1) == parent(u2) gives true so I may have been wrong. I think bit-wise reproducibility is being achieved without the free-surface tendencies being saved in the pick up files under constant time stepping. Below is a reproducible example taken from the tutorial:

Very, very helpful! I am actually somewhat surprised, because I believe there is some history that CATKE needs which is not recorded right now, so this is useful to know. I think CATKE's memory terms may prevent bitwise reproducibility (something we are going to try to fix this week), but I do not think it would affect any science conclusions (parameterization or god forbid advection scheme errors like the ones @simone-silvestri is illustrating greatly exceed checkpointing differences, probably).

@simone-silvestri that is illuminating --- it highlights the urgency to switch to RK3 for the hydrostatic model ASAP.

roxyboy changed the title ~~Warning message regarding free surface with using checkpoint is not consistent~~ Warning message regarding free surface when using checkpoint is not consistent May 16, 2025

navidcy added output 💾 user interface/experience 💻 labels May 18, 2025

simone-silvestri mentioned this issue May 20, 2025

Fix warning message for set timestepper #4528

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warning message regarding free surface when using checkpoint is not consistent #4516

Warning message regarding free surface when using checkpoint is not consistent #4516

roxyboy commented May 16, 2025 •

edited by navidcy

Loading

glwagner commented May 16, 2025

navidcy commented May 18, 2025

roxyboy commented May 18, 2025 •

edited

Loading

navidcy commented May 18, 2025

navidcy commented May 18, 2025

glwagner commented May 18, 2025

simone-silvestri commented May 18, 2025 •

edited

Loading

glwagner commented May 18, 2025

glwagner commented May 18, 2025 •

edited

Loading

roxyboy commented May 19, 2025 •

edited

Loading

simone-silvestri commented May 19, 2025

roxyboy commented May 19, 2025

glwagner commented May 19, 2025

glwagner commented May 19, 2025

roxyboy commented May 20, 2025 •

edited

Loading

simone-silvestri commented May 20, 2025 •

edited

Loading

simone-silvestri commented May 20, 2025 •

edited

Loading

navidcy commented May 20, 2025

simone-silvestri commented May 20, 2025 •

edited

Loading

glwagner commented May 20, 2025

Warning message regarding free surface when using checkpoint is not consistent #4516

Warning message regarding free surface when using checkpoint is not consistent #4516

Comments

roxyboy commented May 16, 2025 • edited by navidcy Loading

glwagner commented May 16, 2025

navidcy commented May 18, 2025

roxyboy commented May 18, 2025 • edited Loading

navidcy commented May 18, 2025

navidcy commented May 18, 2025

glwagner commented May 18, 2025

simone-silvestri commented May 18, 2025 • edited Loading

glwagner commented May 18, 2025

glwagner commented May 18, 2025 • edited Loading

roxyboy commented May 19, 2025 • edited Loading

simone-silvestri commented May 19, 2025

roxyboy commented May 19, 2025

glwagner commented May 19, 2025

glwagner commented May 19, 2025

roxyboy commented May 20, 2025 • edited Loading

simone-silvestri commented May 20, 2025 • edited Loading

simone-silvestri commented May 20, 2025 • edited Loading

navidcy commented May 20, 2025

simone-silvestri commented May 20, 2025 • edited Loading

glwagner commented May 20, 2025

roxyboy commented May 16, 2025 •

edited by navidcy

Loading

roxyboy commented May 18, 2025 •

edited

Loading

simone-silvestri commented May 18, 2025 •

edited

Loading

glwagner commented May 18, 2025 •

edited

Loading

roxyboy commented May 19, 2025 •

edited

Loading

roxyboy commented May 20, 2025 •

edited

Loading

simone-silvestri commented May 20, 2025 •

edited

Loading

simone-silvestri commented May 20, 2025 •

edited

Loading

simone-silvestri commented May 20, 2025 •

edited

Loading