Skip to content

Warning message regarding free surface when using checkpoint is not consistent #4516

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
roxyboy opened this issue May 16, 2025 · 20 comments
Open

Comments

@roxyboy
Copy link

roxyboy commented May 16, 2025

Hi Oceananigans,

I would like to report a warning message that could be confusing.
Continuing the discussion, I am getting the following error when restarting the simulation from a pickup file:

julia> chkpt = 67500
julia> run!(simulation, pickup=chkpt)
------------------------------------
┌ Warning: Tendencies for η do not exist in checkpoint and could not be restored.
└ @ Oceananigans.OutputWriters [~/.julia/packages/Oceananigans/YRhAo/src/OutputWriters/checkpointer.jl:277](http://127.0.0.1:8989/lab/tree/Oceananigans/ocean_wind_mixing_and_convection/~/.julia/packages/Oceananigans/YRhAo/src/OutputWriters/checkpointer.jl#line=276)

The set up I have is the following:

using Oceananigans
using Oceananigans.Units

Lx = 100kilometers # east-west extent [m]
Ly = 100kilometers # north-south extent [m]
Lz = 500    # depth [m]

grid = RectilinearGrid(GPU(), size = (512, 512, 128),
                       x = (0, Lx),
                       y = (-Ly/2, Ly/2),
                       z = (-Lz, 0),
                       topology = (Periodic, Periodic, Bounded)
)

model = HydrostaticFreeSurfaceModel(; grid, 
            			    buoyancy = BuoyancyTracer(),
                                    coriolis = FPlane(f=1e-4),
                                    closure = CATKEVerticalDiffusivity(),
                                    tracers = (:b, :e), 
            			    boundary_conditions = (u=u_bcs, b=b_bcs),
                                    momentum_advection = WENO(order=5),
                                    tracer_advection = WENO(order=5)
)

The model seems to run despite the warning message.

@roxyboy roxyboy changed the title Warning message regarding free surface with using checkpoint is not consistent Warning message regarding free surface when using checkpoint is not consistent May 16, 2025
@glwagner
Copy link
Member

@navidcy has been looking at the checkpointer so he might have some info

@navidcy
Copy link
Collaborator

navidcy commented May 18, 2025

Hm... What's confusing about this message? I think the message it's clear.

The Quasi-Adams-Bashforth time stepper requires the tendencies of the previous time step. So what happens here (if I am deducing from the warning and assuming that there is no bug or something) is that the pickup from checkpointer occurs but the tendencies from the previous timestep of the free surface are zero. This must result in slight discrepancy between eg if you run up to iteration 1000 or if you run up to 500 and then pick up and run for another 500 iterations, right?

@roxyboy
Copy link
Author

roxyboy commented May 18, 2025

The Quasi-Adams-Bashforth time stepper requires the tendencies of the previous time step. So what happens here (if I am deducing from the warning and assuming that there is no bug or something) is that the pickup from checkpointer occurs but the tendencies from the previous timestep of the free surface are zero. This must result in slight discrepancy between eg if you run up to iteration 1000 or if you run up to 500 and then pick up and run for another 500 iterations, right?

This seems to imply that I should explicitly save the tendencies of free surface in the pickup files. I've been saving the pickups by

simulation.output_writers[:checkpointer] = Checkpointer(model, 
    schedule=IterationInterval(22500), 
    prefix="buoyancy_case_model_checkpoint"
)

but how should I change this so that it saves the free surface tendency?
As a user, I think it would better if the "discrepancy between eg if you run up to iteration 1000 or if you run up to 500 and then pick up and run for another 500 iterations" doesn't happen.

@navidcy
Copy link
Collaborator

navidcy commented May 18, 2025

Of course. I don’t disagree. I think the warning was trying to communicate that you shouldn’t expect bitwise reproducibility. Perhaps we can rephrase it?

@navidcy
Copy link
Collaborator

navidcy commented May 18, 2025

I’ll have a look into that. There’s a good chance that it’s an old warning because saving a reduced field wasn’t easy then. Perhaps saving the tendencies is straightforward now.

@glwagner
Copy link
Member

Yes the question in my mind is more: "why don't we save the tendencies of the free surface displacement?"

@simone-silvestri
Copy link
Collaborator

simone-silvestri commented May 18, 2025

In this case, the model uses a SplitExplicitFreeSurface, so we don't need to save down / store tendencies for η.

function hydrostatic_tendency_fields(velocities, free_surface::SplitExplicitFreeSurface, grid, tracer_names, bcs)
u = XFaceField(grid, boundary_conditions=bcs.u)
v = YFaceField(grid, boundary_conditions=bcs.v)
@apply_regionally U_bcs = barotropic_velocity_boundary_conditions(velocities.u)
@apply_regionally V_bcs = barotropic_velocity_boundary_conditions(velocities.v)
free_surface_grid = free_surface.η.grid
U = Field{Face, Center, Nothing}(free_surface_grid, boundary_conditions=U_bcs)
V = Field{Center, Face, Nothing}(free_surface_grid, boundary_conditions=V_bcs)
tracers = TracerFields(tracer_names, grid, bcs)
return merge((u=u, v=v, U=U, V=V), tracers)
end

I think here
if string(name) keys(file["$addr/timestepper/Gⁿ"]) # Test if variable tendencies exist in checkpoint

we should also check if a property like that exists in the timestepper:

if string(name) ∈ keys(file["$addr/timestepper/Gⁿ"]) && hasproperty(timestepper.Gⁿ, name) # Test if variable tendencies exist in checkpoint _and_ in the timestepper

@glwagner
Copy link
Member

Hm, so we only need tendencies for ExplicitFreeSurface -- is that right?

@glwagner
Copy link
Member

glwagner commented May 18, 2025

The follow-up question for @roxyboy is whether bitwise checkpointing succeeds or not. To test this, do something like

# control run
simulation1.stop_time = T
run!(simulation1)

# checkpointed run
simulation2.stop_time = T/2
run!(simulation2)
simulation2.stop_time = T
run!(simulation2, pickup=true)

u1 = simulation1.model.velocities.u
u2 = simulation2.model.velocities.u
@show parent(u1) == parent(u2)

this can be done at low resolution on a laptop with the setup being developed to test

@roxyboy
Copy link
Author

roxyboy commented May 19, 2025

> # control run
> simulation1.stop_time = T
> run!(simulation1)
> 
> # checkpointed run
> simulation2.stop_time = T/2
> run!(simulation2)
> simulation2.stop_time = T
> run!(simulation2, pickup=true)
> 
> u1 = simulation1.model.velocities.u
> u2 = simulation2.model.velocities.u
> @show parent(u1) == parent(u2)
> this can be done at low resolution on a laptop with the setup being developed to test

I'm testing this now but the time stepping based on the CFL condition changes every time I pick up from the same checkpoint so I'm skeptical that bit-wise reproducibility is being achieved...

@simone-silvestri
Copy link
Collaborator

@roxyboy, you shouldn't use a CFL-based timestepping with the default timestepper (AB2) because with a variable timestep AB2 is not valid, and the timestepping reduces to forward Euler, which is much worse than AB2. Try setting a fixed timestep.

@roxyboy
Copy link
Author

roxyboy commented May 19, 2025

@roxyboy, you shouldn't use a CFL-based timestepping with the default timestepper (AB2) because with a variable timestep AB2 is not valid, and the timestepping reduces to forward Euler, which is much worse than AB2. Try setting a fixed timestep.

This is very good to know, thank you! But still, (I think) the fact that CFL-based timestepping gives different time steps every time when the run is restarted from a pickup file is not ideal.

@glwagner
Copy link
Member

I think you're right that is shows the states are not identical

@glwagner
Copy link
Member

@simone-silvestri couldn't we still update the time-step, say, every 100 iterations without doing too much damage to the state? It does not reduce the time-stepping to forward Euler every time step just because we change the time-step once, does it?

@roxyboy
Copy link
Author

roxyboy commented May 20, 2025

> # control run
> simulation1.stop_time = T
> run!(simulation1)
> 
> # checkpointed run
> simulation2.stop_time = T/2
> run!(simulation2)
> simulation2.stop_time = T
> run!(simulation2, pickup=true)
> 
> u1 = simulation1.model.velocities.u
> u2 = simulation2.model.velocities.u
> @show parent(u1) == parent(u2)
> this can be done at low resolution on a laptop with the setup being developed to test

I'm testing this now but the time stepping based on the CFL condition changes every time I pick up from the same checkpoint so I'm skeptical that bit-wise reproducibility is being achieved...

So, it seems that @show parent(u1) == parent(u2) gives true so I may have been wrong. I think bit-wise reproducibility is being achieved without the free-surface tendencies being saved in the pick up files under constant time stepping. Below is a reproducible example taken from the tutorial:

using Oceananigans
using Oceananigans.Units

Lx = 100kilometers # east-west extent [m]
Ly = 100kilometers # north-south extent [m]
Lz = 500    # depth [m]

grid = RectilinearGrid(GPU(), size = (64, 64, 32),
                       x = (0, Lx),
                       y = (-Ly/2, Ly/2),
                       z = (-Lz, 0),
                       topology = (Periodic, Bounded, Bounded)
)

model = HydrostaticFreeSurfaceModel(; grid, 
            			    buoyancy = BuoyancyTracer(),
                                    coriolis = BetaPlane(latitude = -45),
                                    closure = CATKEVerticalDiffusivity(),
                                    tracers = (:b, :e), 
                                    momentum_advection = WENO(order=5),
                                    tracer_advection = WENO(order=5)
)

ramp(y, Δy) = min(max(0, y/Δy + 1/2), 1)

N² = 1e-5 # [s⁻²] buoyancy frequency / stratification
M² = 1e-7 # [s⁻²] horizontal buoyancy gradient

Δy = 100kilometers # width of the region of the front
Δb = Δy * M²       # buoyancy jump associated with the front
ϵb = 1e-2 * Δb     # noise amplitude

bᵢ(x, y, z) = N² * z + Δb * ramp(y, Δy) + ϵb * randn()

set!(model, b=bᵢ)

simulation1 = Simulation(model, Δt=10seconds)
T = 1day
simulation1.stop_time = T
run!(simulation1)

simulation2 = Simulation(model, Δt=10seconds)

simulation2.output_writers[:checkpointer] = Checkpointer(model, 
    schedule=IterationInterval(4320), 
    prefix="bitwise_checkpoint"
)

# checkpointed run
# simulation2.stop_time = T/2
# run!(simulation2)
simulation2.stop_time = T
run!(simulation2, pickup=4320)

u1 = simulation1.model.velocities.u
u2 = simulation2.model.velocities.u
@show parent(u1) == parent(u2)
---------------------------------
┌ Warning: This simulation will run forever as stop iteration = stop time = wall time limit = Inf.
└ @ Oceananigans.Simulations [~/.julia/packages/Oceananigans/YRhAo/src/Simulations/simulation.jl:64](http://127.0.0.1:8989/lab/tree/Oceananigans/ocean_wind_mixing_and_convection/~/.julia/packages/Oceananigans/YRhAo/src/Simulations/simulation.jl#line=63)
┌ Warning: Tendencies for η do not exist in checkpoint and could not be restored.
└ @ Oceananigans.OutputWriters [~/.julia/packages/Oceananigans/YRhAo/src/OutputWriters/checkpointer.jl:277](http://127.0.0.1:8989/lab/tree/Oceananigans/ocean_wind_mixing_and_convection/~/.julia/packages/Oceananigans/YRhAo/src/OutputWriters/checkpointer.jl#line=276)
[ Info: Initializing simulation...
[ Info:     ... simulation initialization complete (24.418 ms)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (3.061 ms).
[ Info: Simulation is stopping after running for 13.521 seconds.
[ Info: Simulation time 1 day equals or exceeds stop time 1 day.
Iteration: 8640, time: 1 day, Δt: 10 seconds, max(|w|) = 3.7e-03 ms⁻¹, wall time: 13.521 seconds
parent(u1) == parent(u2) = true
true

@simone-silvestri
Copy link
Collaborator

simone-silvestri commented May 20, 2025

@simone-silvestri couldn't we still update the time-step, say, every 100 iterations without doing too much damage to the state? It does not reduce the time-stepping to forward Euler every time step just because we change the time-step once, does it?

I am unsure how much damage 1 in a 100 timestep can do. I guess it is a matter of experimentation. I gave up trying to use a wizard with AB2 because I found out that in most practical cases (talking only about large scale ocean modeling here...) it is more detrimental than anything, (even on 100 timesteps) it creates velocity extrema that, to be cleaned up, require reducing the time step in the next wizard pass, hence effectively running at a lower timestep that what would be necessary without wizard. For the cases I typically ran, a wizard does not necessarily destabilize the simulation (because the following 99 AB2 steps clean up the errors in the 1 FE step), but it slows down the computation, defeating the purpose of having a wizard.

A problem of the wizard limiting to FE is that, typically, you want to set the wizard to the typical CFL limit of AB2 (not the one of FE, which is roughly 50% lower than what we can achieve with AB2), so that one FE step will be by definition outside of the recommended stability limit and create problems. The effect on the simulation is, for sure, on a case-to-case basis, and probably more rigorous testing needs to be done. If a user is willing to try and tune the CFL limit and the update schedule of the wizard, he might find good results; it is something to be aware of, though.

Anecdotally, I remember some time ago @sandreza told me that speedyweather used FE at the first time step to restart, and that single FE timestep was enough to make some simulations completely diverge and crash when compared to their counterpart without checkpointing. To be fair, I think those were simulations at the limit of what the numerics could handle.

@simone-silvestri
Copy link
Collaborator

simone-silvestri commented May 20, 2025

Just whipped up an example for a 1D tracer advection solution, probably representative of advection-dominated problems. I am running with RK3 at CFL=0.5 (below the required timestep size, RK3 can handle CFL 0.7). I interject a FE step every 100 timesteps to see how the solution evolves. This is without adaptive timestep, but the velocity is constant (u=1), so the timestep size would not change anyway. The presence of false extrema is quite striking, even at 1 in 100 timesteps

evolution.mp4

@navidcy
Copy link
Collaborator

navidcy commented May 20, 2025

Wow. I wouldn't expect to have such a long-lasting effect...

@simone-silvestri
Copy link
Collaborator

simone-silvestri commented May 20, 2025

This is the same test case with CFL 0.3 and FE steps every 20 iterations.

evolution.mp4

@glwagner
Copy link
Member

So, it seems that @show parent(u1) == parent(u2) gives true so I may have been wrong. I think bit-wise reproducibility is being achieved without the free-surface tendencies being saved in the pick up files under constant time stepping. Below is a reproducible example taken from the tutorial:

Very, very helpful! I am actually somewhat surprised, because I believe there is some history that CATKE needs which is not recorded right now, so this is useful to know. I think CATKE's memory terms may prevent bitwise reproducibility (something we are going to try to fix this week), but I do not think it would affect any science conclusions (parameterization or god forbid advection scheme errors like the ones @simone-silvestri is illustrating greatly exceed checkpointing differences, probably).

@simone-silvestri that is illuminating --- it highlights the urgency to switch to RK3 for the hydrostatic model ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants