Skip to content

FUNCTION FLUSH re-create lua VM, fix flush not gc, fix flush async + load crash #1826

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: unstable
Choose a base branch
from

Conversation

enjoy-binbin
Copy link
Member

@enjoy-binbin enjoy-binbin commented Mar 6, 2025

There will be two issues in this test:

test {FUNCTION - test function flush} {
    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    set before_flush_memory [s used_memory_vm_functions]
    r function flush sync
    set after_flush_memory [s used_memory_vm_functions]
    puts "flush sync, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"

    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    set before_flush_memory [s used_memory_vm_functions]
    r function flush async
    set after_flush_memory [s used_memory_vm_functions]
    puts "flush async, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"

    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    puts "Test done"
}

The first one is the test output, we can see that after executing FUNCTION FLUSH,
used_memory_vm_functions has not changed at all:

flush sync, before_flush_memory: 2962432, after_flush_memory: 2962432
flush async, before_flush_memory: 4504576, after_flush_memory: 4504576

The second one is there is a crash when loading the functions during the async
flush:

=== VALKEY BUG REPORT START: Cut & paste starting from here ===
 # valkey 255.255.255 crashed by signal: 11, si_code: 2
 # Accessing address: 0xe0429b7100000a3c
 # Crashed running the instruction at: 0x102e0b09c

------ STACK TRACE ------
EIP:
0   valkey-server                       0x0000000102e0b09c luaH_getstr + 52

Backtrace:
0   libsystem_platform.dylib            0x000000018b066584 _sigtramp + 56
1   valkey-server                       0x0000000102e01054 luaD_precall + 96
2   valkey-server                       0x0000000102e01b10 luaD_call + 104
3   valkey-server                       0x0000000102e00d1c luaD_rawrunprotected + 76
4   valkey-server                       0x0000000102e01e3c luaD_pcall + 60
5   valkey-server                       0x0000000102dfc630 lua_pcall + 300
6   valkey-server                       0x0000000102f77770 luaEngineCompileCode + 708
7   valkey-server                       0x0000000102f71f50 scriptingEngineCallCompileCode + 104
8   valkey-server                       0x0000000102f700b0 functionsCreateWithLibraryCtx + 2088
9   valkey-server                       0x0000000102f70898 functionLoadCommand + 312
10  valkey-server                       0x0000000102e3978c call + 416
11  valkey-server                       0x0000000102e3b5b8 processCommand + 3340
12  valkey-server                       0x0000000102e563cc processInputBuffer + 520
13  valkey-server                       0x0000000102e55808 readQueryFromClient + 92
14  valkey-server                       0x0000000102f696e0 connSocketEventHandler + 180
15  valkey-server                       0x0000000102e20480 aeProcessEvents + 372
16  valkey-server                       0x0000000102e4aad0 main + 26412
17  dyld                                0x000000018acab154 start + 2476

------ STACK TRACE DONE ------

The reason is that, in the old implementation (introduced in 7.0), FUNCTION FLUSH
use lua_unref to remove the script from lua VM. lua_unref does not trigger the gc,
it causes us to not be able to effectively reclaim memory after the FUNCTION FLUSH.

The other issue is that, since we don't re-create the lua VM in FUNCTION FLUSH,
loading the functions during a FUNCTION FLUSH ASYNC will result a crash because
lua engine state is not thread-safe.

The correct solution is to re-create a new Lua VM to use, just like SCRIPT FLUSH.

@enjoy-binbin enjoy-binbin requested a review from rjd15372 March 6, 2025 11:58
…load crash

There will be two issues in this test:
```
test {FUNCTION - test function flush} {
    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    set before_flush_memory [s used_memory_vm_functions]
    r function flush sync
    set after_flush_memory [s used_memory_vm_functions]
    puts "flush sync, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"

    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    set before_flush_memory [s used_memory_vm_functions]
    r function flush async
    set after_flush_memory [s used_memory_vm_functions]
    puts "flush async, before_flush_memory: $before_flush_memory, after_flush_memory: $after_flush_memory"

    for {set i 0} {$i < 10000} {incr i} {
        r function load [get_function_code LUA test_$i test_$i {return 'hello'}]
    }
    puts "Test done"
}
```

The first one is the test output, we can see that after executing FUNCTION FLUSH,
used_memory_vm_functions has not changed at all:
```
flush sync, before_flush_memory: 2962432, after_flush_memory: 2962432
flush async, before_flush_memory: 4504576, after_flush_memory: 4504576`
```

The second one is there is a crash when loading the functions during the async
flush:
```
=== VALKEY BUG REPORT START: Cut & paste starting from here ===
 # valkey 255.255.255 crashed by signal: 11, si_code: 2
 # Accessing address: 0xe0429b7100000a3c
 # Crashed running the instruction at: 0x102e0b09c

------ STACK TRACE ------
EIP:
0   valkey-server                       0x0000000102e0b09c luaH_getstr + 52

Backtrace:
0   libsystem_platform.dylib            0x000000018b066584 _sigtramp + 56
1   valkey-server                       0x0000000102e01054 luaD_precall + 96
2   valkey-server                       0x0000000102e01b10 luaD_call + 104
3   valkey-server                       0x0000000102e00d1c luaD_rawrunprotected + 76
4   valkey-server                       0x0000000102e01e3c luaD_pcall + 60
5   valkey-server                       0x0000000102dfc630 lua_pcall + 300
6   valkey-server                       0x0000000102f77770 luaEngineCompileCode + 708
7   valkey-server                       0x0000000102f71f50 scriptingEngineCallCompileCode + 104
8   valkey-server                       0x0000000102f700b0 functionsCreateWithLibraryCtx + 2088
9   valkey-server                       0x0000000102f70898 functionLoadCommand + 312
10  valkey-server                       0x0000000102e3978c call + 416
11  valkey-server                       0x0000000102e3b5b8 processCommand + 3340
12  valkey-server                       0x0000000102e563cc processInputBuffer + 520
13  valkey-server                       0x0000000102e55808 readQueryFromClient + 92
14  valkey-server                       0x0000000102f696e0 connSocketEventHandler + 180
15  valkey-server                       0x0000000102e20480 aeProcessEvents + 372
16  valkey-server                       0x0000000102e4aad0 main + 26412
17  dyld                                0x000000018acab154 start + 2476

------ STACK TRACE DONE ------
```

The reason is that, in the old implementation (introduced in 7.0), FUNCTION FLUSH
use lua_unref to remove the script from lua VM. lua_unref does not trigger the gc,
it causes us to not be able to effectively reclaim memory after the FUNCTION FLUSH.

The other issue is that, since we don't re-create the lua VM in FUNCTION FLUSH,
loading the functions during a FUNCTION FLUSH ASYNC will result a crash because
lua engine state is not thread-safe.

The correct solution is to re-create a new Lua VM to use, just like SCRIPT FLUSH.

Signed-off-by: Binbin <[email protected]>
@enjoy-binbin enjoy-binbin force-pushed the function_flush_async branch from 3454da4 to ae694e0 Compare March 6, 2025 11:59
@enjoy-binbin
Copy link
Member Author

enjoy-binbin commented Mar 6, 2025

This problem has existed since 7.0, which means it affects our 7.2 and 8.0. So should we backport it? This code cannot be backported due to the refactoring related to the script module. If we want to backport it, i can re-write the code based on the old branchs.

Copy link

codecov bot commented Mar 6, 2025

Codecov Report

Attention: Patch coverage is 97.46835% with 2 lines in your changes missing coverage. Please review.

Project coverage is 71.04%. Comparing base (bb944c7) to head (a001a8d).
Report is 12 commits behind head on unstable.

Files with missing lines Patch % Lines
src/functions.c 96.87% 1 Missing ⚠️
src/replication.c 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1826      +/-   ##
============================================
- Coverage     71.04%   71.04%   -0.01%     
============================================
  Files           123      123              
  Lines         65665    65683      +18     
============================================
+ Hits          46651    46662      +11     
- Misses        19014    19021       +7     
Files with missing lines Coverage Δ
src/db.c 89.49% <100.00%> (ø)
src/eval.c 88.10% <100.00%> (ø)
src/lazyfree.c 92.41% <100.00%> (+6.30%) ⬆️
src/lua/engine_lua.c 88.05% <100.00%> (+0.39%) ⬆️
src/lua/function_lua.c 98.23% <100.00%> (-0.46%) ⬇️
src/scripting_engine.c 75.86% <100.00%> (ø)
src/server.h 100.00% <ø> (ø)
src/functions.c 94.52% <96.87%> (+0.20%) ⬆️
src/replication.c 87.18% <0.00%> (+0.08%) ⬆️

... and 10 files with indirect coverage changes

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@rjd15372
Copy link
Member

rjd15372 commented Mar 6, 2025

@enjoy-binbin thanks for fixing these bugs! I'll do a review later today.

Copy link
Member

@rjd15372 rjd15372 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the problem of the struct used to represent the compiled function that I mentioned in the inline comment, everything else looks good.

@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Mar 10, 2025
@madolson madolson moved this to In Progress in Valkey 8.1 Mar 17, 2025
@ranshid ranshid moved this to To be backported in Valkey 8.0 Mar 18, 2025
@ranshid ranshid moved this from In Progress to To be backported in Valkey 8.1 Mar 18, 2025
@ranshid ranshid moved this from To be backported to In Progress in Valkey 8.1 Mar 18, 2025
@ranshid ranshid moved this to To be backported in Valkey 7.2 Mar 18, 2025
@enjoy-binbin enjoy-binbin added the release-notes This issue should get a line item in the release notes label Mar 19, 2025
void functionsLibCtxFree(functionsLibCtx *functions_lib_ctx) {
functionsLibCtxClear(functions_lib_ctx, NULL);
void functionsLibCtxFree(functionsLibCtx *functions_lib_ctx, void(callback)(dict *), list *engine_callbacks) {
functionsLibCtxClear(functions_lib_ctx, callback);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is clear needed if the data structure (including its dictionaries) is already being released and freed?"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this was here before the change, is it actually needed or just redundant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before, i guess it is a redundant when introduced it. And now we can pass the callback to it.

curr_functions_lib_ctx = functionsLibCtxCreate();
freeFunctionsAsync(old_l_ctx);
list *engine_callbacks = listCreate();
scriptingEngineManagerForEachEngine(resetEngineFunctionEnvCallback, engine_callbacks);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resetEngineFunctionEnvCallback is misleading, it doesn't reset or act as a callback. It collects reset callbacks. Suggest renaming to collectEngineResetCallbacks for clarity.

Copy link
Member Author

@enjoy-binbin enjoy-binbin Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is a bit misleading, i copy the name from resetEngineEvalEnvCallback. @rjd15372 thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the name is not the best because the function behaves differently whether the reset is async or sync. In sync mode the resetEngineEvalEnvCallback will actually reset the lua environment, while in async mode it will just collect callbacks for reseting lua in the background.

So calling it just collectEngineResetCallbacks is also not accurate in sync mode.

Maybe call it resetEngineOrCollectResetCallbacks?

while ((node = listNext(iter)) != NULL) {
callableLazyEnvReset *engine_callback = listNodeValue(node);
if (engine_callback != NULL) {
engine_callback->engineLazyEnvResetCallback(engine_callback->context);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this callback responsible for lua-gc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it actually reset the lua env

     * Callback function used for resetting the EVAL/FUNCTION context implemented by an
     * engine. This callback will be called by a background thread when it's
     * ready for resetting the context.
static void resetLuaContext(void *context) {
    lua_State *lua = context;
    lua_gc(lua, LUA_GCCOLLECT, 0);
    lua_close(lua);

#if !defined(USE_LIBC)
    /* The lua interpreter may hold a lot of memory internally, and lua is
     * using libc. libc may take a bit longer to return the memory to the OS,
     * so after lua_close, we call malloc_trim try to purge it earlier.
     *
     * We do that only when the server itself does not use libc. When Lua and the server
     * use different allocators, one won't use the fragmentation holes of the
     * other, and released memory can take a long time until it is returned to
     * the OS. */
    zlibc_trim();
#endif
}

if (async) {
functionsLibCtx *old_l_ctx = curr_functions_lib_ctx;
curr_functions_lib_ctx = functionsLibCtxCreate();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering why the engine is crashing if new curr_functions_lib_ctx synchronously created ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't call lua_close before, both old ctx and the new ctx are using the same lua vm, one ctx is deleting the function in the bio thread and the other one ctx is loading the functions in the main thread, and look like lua vm is not thread-safe so it will crash.

@madolson madolson self-assigned this Apr 7, 2025
@zuiderkwast zuiderkwast moved this from To be backported to In Progress in Valkey 7.2 Apr 15, 2025
@zuiderkwast zuiderkwast moved this from To be backported to In Progress in Valkey 8.0 Apr 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes This issue should get a line item in the release notes run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)
Projects
Status: In Progress
Status: In Progress
Status: In Progress
Development

Successfully merging this pull request may close these issues.

4 participants