-
Notifications
You must be signed in to change notification settings - Fork 235
Warning message regarding free surface when using checkpoint is not consistent #4516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@navidcy has been looking at the checkpointer so he might have some info |
Hm... What's confusing about this message? I think the message it's clear. The Quasi-Adams-Bashforth time stepper requires the tendencies of the previous time step. So what happens here (if I am deducing from the warning and assuming that there is no bug or something) is that the pickup from checkpointer occurs but the tendencies from the previous timestep of the free surface are zero. This must result in slight discrepancy between eg if you run up to iteration 1000 or if you run up to 500 and then pick up and run for another 500 iterations, right? |
This seems to imply that I should explicitly save the tendencies of free surface in the pickup files. I've been saving the pickups by
but how should I change this so that it saves the free surface tendency? |
Of course. I don’t disagree. I think the warning was trying to communicate that you shouldn’t expect bitwise reproducibility. Perhaps we can rephrase it? |
I’ll have a look into that. There’s a good chance that it’s an old warning because saving a reduced field wasn’t easy then. Perhaps saving the tendencies is straightforward now. |
Yes the question in my mind is more: "why don't we save the tendencies of the free surface displacement?" |
In this case, the model uses a Lines 163 to 178 in e611d00
I think here
we should also check if a property like that exists in the timestepper:
|
Hm, so we only need tendencies for ExplicitFreeSurface -- is that right? |
The follow-up question for @roxyboy is whether bitwise checkpointing succeeds or not. To test this, do something like # control run
simulation1.stop_time = T
run!(simulation1)
# checkpointed run
simulation2.stop_time = T/2
run!(simulation2)
simulation2.stop_time = T
run!(simulation2, pickup=true)
u1 = simulation1.model.velocities.u
u2 = simulation2.model.velocities.u
@show parent(u1) == parent(u2) this can be done at low resolution on a laptop with the setup being developed to test |
I'm testing this now but the time stepping based on the CFL condition changes every time I pick up from the same checkpoint so I'm skeptical that bit-wise reproducibility is being achieved... |
@roxyboy, you shouldn't use a CFL-based timestepping with the default timestepper (AB2) because with a variable timestep AB2 is not valid, and the timestepping reduces to forward Euler, which is much worse than AB2. Try setting a fixed timestep. |
This is very good to know, thank you! But still, (I think) the fact that CFL-based timestepping gives different time steps every time when the run is restarted from a pickup file is not ideal. |
I think you're right that is shows the states are not identical |
@simone-silvestri couldn't we still update the time-step, say, every 100 iterations without doing too much damage to the state? It does not reduce the time-stepping to forward Euler every time step just because we change the time-step once, does it? |
So, it seems that
|
I am unsure how much damage 1 in a 100 timestep can do. I guess it is a matter of experimentation. I gave up trying to use a wizard with AB2 because I found out that in most practical cases (talking only about large scale ocean modeling here...) it is more detrimental than anything, (even on 100 timesteps) it creates velocity extrema that, to be cleaned up, require reducing the time step in the next wizard pass, hence effectively running at a lower timestep that what would be necessary without wizard. For the cases I typically ran, a wizard does not necessarily destabilize the simulation (because the following 99 AB2 steps clean up the errors in the 1 FE step), but it slows down the computation, defeating the purpose of having a wizard. A problem of the wizard limiting to FE is that, typically, you want to set the wizard to the typical CFL limit of AB2 (not the one of FE, which is roughly 50% lower than what we can achieve with AB2), so that one FE step will be by definition outside of the recommended stability limit and create problems. The effect on the simulation is, for sure, on a case-to-case basis, and probably more rigorous testing needs to be done. If a user is willing to try and tune the CFL limit and the update schedule of the wizard, he might find good results; it is something to be aware of, though. Anecdotally, I remember some time ago @sandreza told me that speedyweather used FE at the first time step to restart, and that single FE timestep was enough to make some simulations completely diverge and crash when compared to their counterpart without checkpointing. To be fair, I think those were simulations at the limit of what the numerics could handle. |
Just whipped up an example for a 1D tracer advection solution, probably representative of advection-dominated problems. I am running with RK3 at CFL=0.5 (below the required timestep size, RK3 can handle CFL 0.7). I interject a FE step every 100 timesteps to see how the solution evolves. This is without adaptive timestep, but the velocity is constant (u=1), so the timestep size would not change anyway. The presence of false extrema is quite striking, even at 1 in 100 timesteps evolution.mp4 |
Wow. I wouldn't expect to have such a long-lasting effect... |
This is the same test case with CFL 0.3 and FE steps every 20 iterations. evolution.mp4 |
Very, very helpful! I am actually somewhat surprised, because I believe there is some history that CATKE needs which is not recorded right now, so this is useful to know. I think CATKE's memory terms may prevent bitwise reproducibility (something we are going to try to fix this week), but I do not think it would affect any science conclusions (parameterization or god forbid advection scheme errors like the ones @simone-silvestri is illustrating greatly exceed checkpointing differences, probably). @simone-silvestri that is illuminating --- it highlights the urgency to switch to RK3 for the hydrostatic model ASAP. |
Hi Oceananigans,
I would like to report a warning message that could be confusing.
Continuing the discussion, I am getting the following error when restarting the simulation from a pickup file:
The set up I have is the following:
The model seems to run despite the warning message.
The text was updated successfully, but these errors were encountered: