-
Notifications
You must be signed in to change notification settings - Fork 767
FUNCTION FLUSH re-create lua VM, fix flush not gc, fix flush async + load crash #1826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
FUNCTION FLUSH re-create lua VM, fix flush not gc, fix flush async + load crash #1826
Conversation
…load crash There will be two issues in this test: ``` test {FUNCTION - test function flush} { for {set i 0} {$i < 10000} {incr i} { r function load [get_function_code LUA test_$i test_$i {return 'hello'}] } set before_flush_memory [s used_memory_vm_functions] r function flush sync set after_flush_memory [s used_memory_vm_functions] puts "flush sync, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory" for {set i 0} {$i < 10000} {incr i} { r function load [get_function_code LUA test_$i test_$i {return 'hello'}] } set before_flush_memory [s used_memory_vm_functions] r function flush async set after_flush_memory [s used_memory_vm_functions] puts "flush async, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory" for {set i 0} {$i < 10000} {incr i} { r function load [get_function_code LUA test_$i test_$i {return 'hello'}] } puts "Test done" } ``` The first one is the test output, we can see that after executing FUNCTION FLUSH, used_memory_vm_functions has not changed at all: ``` flush sync, before_flush_memory: 2962432, after_flush_memory: 2962432 flush async, before_flush_memory: 4504576, after_flush_memory: 4504576` ``` The second one is there is a crash when loading the functions during the async flush: ``` === VALKEY BUG REPORT START: Cut & paste starting from here === # valkey 255.255.255 crashed by signal: 11, si_code: 2 # Accessing address: 0xe0429b7100000a3c # Crashed running the instruction at: 0x102e0b09c ------ STACK TRACE ------ EIP: 0 valkey-server 0x0000000102e0b09c luaH_getstr + 52 Backtrace: 0 libsystem_platform.dylib 0x000000018b066584 _sigtramp + 56 1 valkey-server 0x0000000102e01054 luaD_precall + 96 2 valkey-server 0x0000000102e01b10 luaD_call + 104 3 valkey-server 0x0000000102e00d1c luaD_rawrunprotected + 76 4 valkey-server 0x0000000102e01e3c luaD_pcall + 60 5 valkey-server 0x0000000102dfc630 lua_pcall + 300 6 valkey-server 0x0000000102f77770 luaEngineCompileCode + 708 7 valkey-server 0x0000000102f71f50 scriptingEngineCallCompileCode + 104 8 valkey-server 0x0000000102f700b0 functionsCreateWithLibraryCtx + 2088 9 valkey-server 0x0000000102f70898 functionLoadCommand + 312 10 valkey-server 0x0000000102e3978c call + 416 11 valkey-server 0x0000000102e3b5b8 processCommand + 3340 12 valkey-server 0x0000000102e563cc processInputBuffer + 520 13 valkey-server 0x0000000102e55808 readQueryFromClient + 92 14 valkey-server 0x0000000102f696e0 connSocketEventHandler + 180 15 valkey-server 0x0000000102e20480 aeProcessEvents + 372 16 valkey-server 0x0000000102e4aad0 main + 26412 17 dyld 0x000000018acab154 start + 2476 ------ STACK TRACE DONE ------ ``` The reason is that, in the old implementation (introduced in 7.0), FUNCTION FLUSH use lua_unref to remove the script from lua VM. lua_unref does not trigger the gc, it causes us to not be able to effectively reclaim memory after the FUNCTION FLUSH. The other issue is that, since we don't re-create the lua VM in FUNCTION FLUSH, loading the functions during a FUNCTION FLUSH ASYNC will result a crash because lua engine state is not thread-safe. The correct solution is to re-create a new Lua VM to use, just like SCRIPT FLUSH. Signed-off-by: Binbin <[email protected]>
3454da4
to
ae694e0
Compare
This problem has existed since 7.0, which means it affects our 7.2 and 8.0. So should we backport it? This code cannot be backported due to the refactoring related to the script module. If we want to backport it, i can re-write the code based on the old branchs. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #1826 +/- ##
============================================
- Coverage 71.04% 71.04% -0.01%
============================================
Files 123 123
Lines 65665 65683 +18
============================================
+ Hits 46651 46662 +11
- Misses 19014 19021 +7
🚀 New features to boost your workflow:
|
@enjoy-binbin thanks for fixing these bugs! I'll do a review later today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the problem of the struct used to represent the compiled function that I mentioned in the inline comment, everything else looks good.
Signed-off-by: Binbin <[email protected]>
void functionsLibCtxFree(functionsLibCtx *functions_lib_ctx) { | ||
functionsLibCtxClear(functions_lib_ctx, NULL); | ||
void functionsLibCtxFree(functionsLibCtx *functions_lib_ctx, void(callback)(dict *), list *engine_callbacks) { | ||
functionsLibCtxClear(functions_lib_ctx, callback); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is clear needed if the data structure (including its dictionaries) is already being released and freed?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this was here before the change, is it actually needed or just redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before, i guess it is a redundant when introduced it. And now we can pass the callback to it.
curr_functions_lib_ctx = functionsLibCtxCreate(); | ||
freeFunctionsAsync(old_l_ctx); | ||
list *engine_callbacks = listCreate(); | ||
scriptingEngineManagerForEachEngine(resetEngineFunctionEnvCallback, engine_callbacks); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resetEngineFunctionEnvCallback
is misleading, it doesn't reset or act as a callback. It collects reset callbacks. Suggest renaming to collectEngineResetCallbacks
for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it is a bit misleading, i copy the name from resetEngineEvalEnvCallback
. @rjd15372 thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the name is not the best because the function behaves differently whether the reset is async or sync. In sync mode the resetEngineEvalEnvCallback
will actually reset the lua environment, while in async mode it will just collect callbacks for reseting lua in the background.
So calling it just collectEngineResetCallbacks
is also not accurate in sync mode.
Maybe call it resetEngineOrCollectResetCallbacks
?
while ((node = listNext(iter)) != NULL) { | ||
callableLazyEnvReset *engine_callback = listNodeValue(node); | ||
if (engine_callback != NULL) { | ||
engine_callback->engineLazyEnvResetCallback(engine_callback->context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this callback responsible for lua-gc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it actually reset the lua env
* Callback function used for resetting the EVAL/FUNCTION context implemented by an
* engine. This callback will be called by a background thread when it's
* ready for resetting the context.
static void resetLuaContext(void *context) {
lua_State *lua = context;
lua_gc(lua, LUA_GCCOLLECT, 0);
lua_close(lua);
#if !defined(USE_LIBC)
/* The lua interpreter may hold a lot of memory internally, and lua is
* using libc. libc may take a bit longer to return the memory to the OS,
* so after lua_close, we call malloc_trim try to purge it earlier.
*
* We do that only when the server itself does not use libc. When Lua and the server
* use different allocators, one won't use the fragmentation holes of the
* other, and released memory can take a long time until it is returned to
* the OS. */
zlibc_trim();
#endif
}
if (async) { | ||
functionsLibCtx *old_l_ctx = curr_functions_lib_ctx; | ||
curr_functions_lib_ctx = functionsLibCtxCreate(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering why the engine is crashing if new curr_functions_lib_ctx
synchronously created ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't call lua_close before, both old ctx and the new ctx are using the same lua vm, one ctx is deleting the function in the bio thread and the other one ctx is loading the functions in the main thread, and look like lua vm is not thread-safe so it will crash.
There will be two issues in this test:
The first one is the test output, we can see that after executing FUNCTION FLUSH,
used_memory_vm_functions has not changed at all:
The second one is there is a crash when loading the functions during the async
flush:
The reason is that, in the old implementation (introduced in 7.0), FUNCTION FLUSH
use lua_unref to remove the script from lua VM. lua_unref does not trigger the gc,
it causes us to not be able to effectively reclaim memory after the FUNCTION FLUSH.
The other issue is that, since we don't re-create the lua VM in FUNCTION FLUSH,
loading the functions during a FUNCTION FLUSH ASYNC will result a crash because
lua engine state is not thread-safe.
The correct solution is to re-create a new Lua VM to use, just like SCRIPT FLUSH.