[Rhi] [bug] Fix the Unified Allocator to no longer return first two allocations as dupes (#8705)

hughperkins · web-flow · commit 8fe47c71e849 · 2025-05-06T23:37:56.000-07:00
Issue: # ### Brief Summary The unified allocator always allocated the first two allocations as dupe memory addresses, which always clobbered each other. copilot:summary # New walkthrough Looking at the Unified allocator code, we can see that when we first allocate a memory chunk, we do not add `size` to `head`, and thus the next allocation will receive the exact same address too. Thus, there will be two structs or similar, in memory, which clobber each other, leading to plausibly a plethora of hard-to-debug crashes. ## High level overview of how allocator works The allocator can work with two types of request: - `exclusive` - `not exclusive` For exclusive requests: - a buffer is allocated from the system: - https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/rhi/common/unified_allocator.cpp#L74-L75 - the size of the buffer matches the requested bytes (to within alignment bytes) - a new chunk is created - `chunk.data` is set to the start of this buffer - `chunk.head` is too, but it's not really used for exclusive access - `chunk.tail` is set to the end of the buffer, but again not really used for exclusive access Exclusive access requests are thus fairly straightforward For non-exclusive, it is slightly more complex - for the first request,we allocate a much larger buffer than the request - by default 1GB https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/rhi/common/unified_allocator.cpp#L9-L10 - we create a new chunk - `chunk.data` is set to the start of this buffer - `chunk.head` is set to the start of unused space in this buffer - it should be set to `chunk.data + size` - prior to this PR, it is incorrectly being set to point to `chunk.data` though - meaning that the next request will incorrectly return the start of this chunk, again - then we return `chunk.head` - for subsequent requests, we look for a chunk that has available space (head - tail <= requested size) - when we find such a chunk: - we add `size` to `head` (to within alignment) - we return the old `head` (to within alignment) ## Proposed fix The proposed fix is to set `head` to `data + size` for newly allocated chunks - thinking about it, an alternative fix is to split the function into two parts: - first part searches for an existing chunk, or makes a new one - does not return the allocated address - does not update head etc - second part is always executed - updates head - returns old head I don't really have a strong opinion on which fix we prefer. The second approach seems mildly cleaner perhaps, since decouples 'finding/creating a chunk' from 'updating the chunk and returning the requested memory pointer'. ## Low level details In more details, and assuming non exclusive mode: - let's say client requests `size` bytes - we allocate a chunk much larger than that, `default_allocator_size` bytes - https://github.com/hughperkins/taichi/blob/0c41277f7b4a597247ea23760336dcba5c7f7efc/taichi/rhi/common/unified_allocator.cpp#L65-L75 - the address of this chunk is stored in `ptr` - we create a `chunk` structure to store information about the chunk we just allocated - https://github.com/hughperkins/taichi/blob/0c41277f7b4a597247ea23760336dcba5c7f7efc/taichi/rhi/common/unified_allocator.cpp#L63 - https://github.com/hughperkins/taichi/blob/0c41277f7b4a597247ea23760336dcba5c7f7efc/taichi/rhi/common/unified_allocator.cpp#L76-L79 - ptr is stored in chunk.data - head is set to ptr too, via chunk.data - tail is set to ptr + allocation size, via chunk.data - we return ptr - https://github.com/hughperkins/taichi/blob/0c41277f7b4a597247ea23760336dcba5c7f7efc/taichi/rhi/common/unified_allocator.cpp#L85 - we should have added allocation_size to chunk.head - we can look at what happens when we re-use this chunk later, to confirm this: When we re-use a chunk: - we loop over all allocated chunks, looking for non-exclusive chunks - https://github.com/hughperkins/taichi/blob/0c41277f7b4a597247ea23760336dcba5c7f7efc/taichi/rhi/common/unified_allocator.cpp#L39-L45 - we add allocation size to head, adjusting for alignment, store that in ret, and check if ret is less than tail - https://github.com/hughperkins/taichi/blob/0c41277f7b4a597247ea23760336dcba5c7f7efc/taichi/rhi/common/unified_allocator.cpp#L49-L53 - if ret is less or equal to tail, then we: - update head to be equal to ret (so, we've updated it to be old head + allocation_size, adjusted for alignment) - return ret - (and break out of the loop, by virtue of the return) - otherwise, we ignore, and keep looping over available chunks - (if no suitable chunks found, then we will allocate a brand new chunk) # Original Walkthrough High level summary: =============== - both the LLVMRuntime and the result_buffer are allocated to the same memory address - this results in the return error code from running a kernel overwriting the address of snode tree 19 - this results in subsequent access to any field having snode tree 19 crashing Taichi Reproducing the bug =================== This bug was initially reproduced in #8569 , but knowing what the bug is, we can reproduce it using the following much simpler code: ``` import taichi as ti ti.init(arch=ti.arm64, debug=True) fields = [] for i in range(20): fields.append(ti.field(float, shape=())) ti.sync() @ti.kernel def foo(): fields[19][None] = 1 foo() foo() ``` What this code does: - allocates snode trees 0 through 19, by creating fields indexed 0 through 19, and immediately calling ti.sync, to materialize the snode tree - you can optionally print out the snode tree ids as long as you have a version of master that includes the PR at #8697, to verify this assertion - following the creation of snode trees 0 through 19, we call a kernel twice - the first kernel runs without issue - however, the address of snode tree 19 will be set to 0, following this kernel call, because it is overwritten by the return code of this call - when we run the second kernel call, using the address of snode tree 19 - which is now set to 0 - to access values from snode tree 19, causes a segmentation fault: [E 04/30/25 19:00:30.022 3136495] Received signal 11 (Segmentation fault: 11) Detailed walkthrough ==================== 1. LLVMRuntime and result_buffer are allocated the same memory address - When we first initialize the LLVMRuntime, we: - allocate a result_buffer from the unified allocator, via the host allocator - result_buffer allocated here https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/runtime/llvm/llvm_runtime_executor.cpp#L699-L700 - call runtime_initialize - here https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/runtime/llvm/llvm_runtime_executor.cpp#L706-L711 - passing in the result_buffer - and the host allocator - inside runtime_initialize, we: - allocate the LLVMRuntime, using the same allocator - here https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/runtime/llvm/runtime_module/runtime.cpp#L932-L933 - interestingly, the address allocated for the LLVMRuntime memory is identical to the address of the result_buffer memory - verifiable by printing out the two addresses. Over multiple runs, they consistently have the same address as each other (though the exact addresses vary between runs) - these are both allocated from the exact same allocator - if you print out the address of the allocator in each location, they are identical - and no deallocations take place between the allocations - so, how is this possible? - looking at the unified allocator, there is a concept of 'exclusive' - here https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/rhi/common/unified_allocator.cpp#L32 - if a request for memory is not marked as exclusive, previously allocated buffers can be re-used, and allocated to new requests - here https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/rhi/common/unified_allocator.cpp#L57 - the default is exclusive = false - here https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/rhi/common/unified_allocator.h#L31 - therefore, by default, memory chunks allocated can be re-used/returned/allocated across multiple requests Let's first walk through the effects of LLVMRuntime and result_buffer occupying the same space. 2. The return code of a kernel overwrites snode tree address 19 - following a kernel launch, the method runtime_retrieve_and_reset_error_code is run on runtime.cpp - here https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/runtime/llvm/runtime_module/runtime.cpp#L727-L730 - this method calls `runtime->set_result(taichi_result_buffer_error_id, runtime->error_code);` - the first parameter is a constant - defined here https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/inc/constants.h#L21 - `constexpr std::size_t taichi_max_num_ret_value = 30;` - `set_result`: - is here https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/runtime/llvm/runtime_module/runtime.cpp#L600-L604 - sets result_buffer[i] to t - here https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/runtime/llvm/runtime_module/runtime.cpp#L602 - in this case, i is taichi_max_num_ret_value - which is 30 - t is the return code - empirically this has a value of 0, in the test cases described above - i is used to index onto an array of i64 - here https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/runtime/llvm/runtime_module/runtime.cpp#L602 - therefore each element of the array has 8 bytes - and therefore to get the address of the element which will be set to 0, we should multiply the index, which is 30, by 8 - thus, we will zero out 8 bytes at byte offset 30 * 8 = 240 - the base address for this offset is result_buffer - however, result_buffer has the same address as LLVMRuntime - (as discussed in the first section) - so we are going to clobber 8 bytes in LLVMRuntime with zeros, at offset 240 - let's now look at where byte offset 240 is in LLVMRuntime - LLVMRuntime struct: - is here https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/runtime/llvm/runtime_module/runtime.cpp#L552-L562 - there are two PreallocatedMemoryChunks, each of which contains two pointers and a size_t - https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/runtime/llvm/runtime_module/runtime.cpp#L546-L550 - each pointer is 8 bytes, and size_t likely 8 bytes, for 24 bytes each - 48 bytes total - host_allocator_type is a pointer to function -> 8 more bytes - assert_failed_type, host_printf_type, host_vsnprintf_type, and Ptr are also all pointers, so 8 bytes each, for a total for them of 32 bytes - now we arrive at roots, which is the snode tree roots address array - at this point, we are at an offset of 48 + 8 + 32 = 88 - so our offset into roots will be 240 - 88 = 152 - each element of roots is also a pointer - so size 8 bytes - 152 bytes / 8 bytes = 19 - thus when we write the return code of 0 to result_buffer[30], we clobber the address of tree snode 19 with 0 3. kernel access of tree snode 19 - when a kernel is initialized, and that kernel uses a field that is allocated on snode tree 19: - the lowered kernel calls `%10 = call ptr @LLVMRuntime_get_roots(ptr %9, i32 19` (exact statement index varies depends on the kernel of course) - this will return 0 - then when we access a field based on offset 0, we crash. ~Proposed fix~ ============ ~We need to ensure that the allocator does not allocate the same memory block twice, both to the LLVMRuntime and to the result_buffer~ - ~my proposed fix is to expose the `exclusive` parameter to the LLVMRuntime~ - ~and to set this parameter to `true`, when used from the runtime~ ~Questions~ ========= ~A question in my mind is, why we would ever want exclusive to not be true. And by default, it is in fact set to false. I feel like there is some knowledge or insight that is missing to me.~ copilot:walkthrough
diff --git a/cmake/TaichiTests.cmake b/cmake/TaichiTests.cmake
@@ -17,6 +17,7 @@ file(GLOB_RECURSE TAICHI_TESTS_SOURCE
         "tests/cpp/common/*.cpp"
         "tests/cpp/ir/*.cpp"
         "tests/cpp/program/*.cpp"
+        "tests/cpp/rhi/common/*.cpp"
         "tests/cpp/struct/*.cpp"
         "tests/cpp/transforms/*.cpp"
         "tests/cpp/offline_cache/*.cpp")
diff --git a/taichi/rhi/common/unified_allocator.cpp b/taichi/rhi/common/unified_allocator.cpp
@@ -6,7 +6,7 @@
 
 namespace taichi::lang {
 
-const std::size_t UnifiedAllocator::default_allocator_size =
+std::size_t UnifiedAllocator::default_allocator_size =
     1 << 30;  // 1 GB per allocator
 
 template <typename T>
@@ -35,7 +35,6 @@ void *UnifiedAllocator::allocate(std::size_t size,
 
   // Note: put mutex on MemoryPool instead of Allocator, since Allocators are
   // transparent to user code
-  std::size_t allocation_size = size;
   if (!chunks_.empty() && !exclusive) {
     // Search for a non-exclusive chunk that has enough space
     for (size_t chunk_id = 0; chunk_id < chunks_.size(); chunk_id++) {
@@ -49,7 +48,7 @@ void *UnifiedAllocator::allocate(std::size_t size,
       auto ret = head + alignment - 1 - (head + alignment - 1) % alignment;
       TI_TRACE("UM [data={}] allocate() request={} remain={}", (intptr_t)data,
                size, (tail - head));
-      head = ret + allocation_size;
+      head = ret + size;
       if (head <= tail) {
         // success
         TI_ASSERT(ret % alignment == 0);
@@ -62,6 +61,7 @@ void *UnifiedAllocator::allocate(std::size_t size,
   // Allocate a new chunk
   MemoryChunk chunk;
 
+  std::size_t allocation_size = size;
   if (!exclusive) {
     // Do not allocate large memory chunks for "exclusive" allocation
     // to increate memory & performance efficiency
@@ -74,7 +74,7 @@ void *UnifiedAllocator::allocate(std::size_t size,
   void *ptr =
       HostMemoryPool::get_instance().allocate_raw_memory(allocation_size);
   chunk.data = ptr;
-  chunk.head = chunk.data;
+  chunk.head = (void *)((std::size_t)chunk.data + size);
   chunk.tail = (void *)((std::size_t)chunk.head + allocation_size);
   chunk.is_exclusive = exclusive;
 
diff --git a/taichi/rhi/common/unified_allocator.h b/taichi/rhi/common/unified_allocator.h
@@ -22,7 +22,7 @@ class UnifiedAllocator {
   };
 
  private:
-  static const std::size_t default_allocator_size;
+  static std::size_t default_allocator_size;
 
   UnifiedAllocator();
 
@@ -35,6 +35,7 @@ class UnifiedAllocator {
   std::vector<MemoryChunk> chunks_;
 
   friend class HostMemoryPool;
+  friend class HostMemoryPoolTestHelper;
 };
 
 }  // namespace taichi::lang
diff --git a/tests/cpp/rhi/common/host_memory_pool_test.cpp b/tests/cpp/rhi/common/host_memory_pool_test.cpp
@@ -0,0 +1,37 @@
+#include "gtest/gtest.h"
+
+#include "taichi/rhi/common/host_memory_pool.h"
+
+namespace taichi::lang {
+
+class HostMemoryPoolTestHelper {
+ public:
+  static void setDefaultAllocatorSize(std::size_t size) {
+    UnifiedAllocator::default_allocator_size = size;
+  }
+  static size_t getDefaultAllocatorSize() {
+    return UnifiedAllocator::default_allocator_size;
+  }
+};
+
+TEST(HostMemoryPool, AllocateMemory) {
+  auto oldAllocatorSize = HostMemoryPoolTestHelper::getDefaultAllocatorSize();
+  HostMemoryPoolTestHelper::setDefaultAllocatorSize(102400);  // 100KB
+
+  HostMemoryPool pool;
+
+  void *ptr1 = pool.allocate(1024, 16);
+  void *ptr2 = pool.allocate(1024, 16);
+  void *ptr3 = pool.allocate(1024, 16);
+
+  EXPECT_NE(ptr1, ptr2);
+  EXPECT_NE(ptr1, ptr3);
+  EXPECT_NE(ptr2, ptr3);
+
+  EXPECT_EQ((std::size_t)ptr2, (std::size_t)ptr1 + 1024);
+  EXPECT_EQ((std::size_t)ptr3, (std::size_t)ptr2 + 1024);
+
+  HostMemoryPoolTestHelper::setDefaultAllocatorSize(oldAllocatorSize);
+}
+
+}  // namespace taichi::lang