feat: eliminate unnecessary memory allocations #2892

BjornTheProgrammer · 2025-03-11T01:19:00Z

Pull Request Template

Checklist

Confirmed that run-checks all script has been executed.
Made sure the book is up to date with changes in this PR.

Related Issues/PRs

This PR builds off of @HerrMuellerluedenscheid's fantastic work in pull request #2881, although it has a slightly different approach.
#2871

Changes

This PR it so that args are passed back to Recorder when there is an error, so that there is no need to clone the data unnecessarily. Additionally, this PR will add the ability to have different LoadArgs for at least the BinBytesRecorder.

Testing

Compile all of the examples, and add in some additional tests for other LoadArgs like &'static [u8].

codecov · 2025-03-11T01:45:47Z

Codecov Report

Attention: Patch coverage is 90.90909% with 4 lines in your changes missing coverage. Please review.

Project coverage is 80.90%. Comparing base (1965796) to head (da9740d).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/burn-import/src/pytorch/recorder.rs	0.00%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2892      +/-   ##
==========================================
- Coverage   80.90%   80.90%   -0.01%     
==========================================
  Files         872      872              
  Lines      121039   121066      +27     
==========================================
+ Hits        97932    97953      +21     
- Misses      23107    23113       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

BjornTheProgrammer · 2025-03-11T07:15:10Z

After hours of testing to figure out how much memory has been saved from this PR, I find that it is inconclusive, I don't believe that the .clone()s were the cause of the large amount of memory. From my testing, the sine.onnx actually required about 2 kb more heap allocation than the original implementation, and my own custom model required about 1 kb less. This could easily be just a compilation difference causing variance.

This doesn't make any sense to me unless the real issue with the model not fitting is actually the states or something similar. @HerrMuellerluedenscheid, I would appreciate it greatly if you could try out this fork on the esp32.

Does anyone else have any ideas what could be the actual issue? Or something else that I'm missing?

HerrMuellerluedenscheid · 2025-03-11T09:06:33Z

I quickly gave it a try running the squeezenet example using the changes from your main branch on an esp32. It did not work out of the box as the api changed and the example has to be modified. As soon as I have a little time I will dive deeper into that. But I think this is a very promissing approach 👍

HerrMuellerluedenscheid · 2025-03-12T08:38:47Z

After the last force push of yours @BjornTheProgrammer I tried again running the squeezenet on esp32. It compiles but panics on memory allocation of stunning 700MB rooted apparently in burn-tensor/src/tensor/bytes.rs:140. Strange. I will try to do some debugging ASAP.

memory allocation of 758360262 bytes failed

Backtrace:

0x42040a27
0x42040a27 - __rdl_oom
    at /Users/mariuskriegerowski/.rustup/toolchains/esp/lib/rustlib/src/rust/library/alloc/src/alloc.rs:441
0x4205de64
0x4205de64 - __rust_alloc_error_handler
    at ??:??
0x420408a0
0x420408a0 - alloc::alloc::handle_alloc_error::rt_error
    at /Users/mariuskriegerowski/.rustup/toolchains/esp/lib/rustlib/src/rust/library/alloc/src/alloc.rs:406
0x42040894
0x42040894 - alloc::alloc::handle_alloc_error
    at /Users/mariuskriegerowski/.rustup/toolchains/esp/lib/rustlib/src/rust/library/alloc/src/alloc.rs:412
0x4204a11d
0x4204a11d - burn_tensor::tensor::bytes::buffer_alloc::{{closure}}
    at /Users/mariuskriegerowski/src/rust/burn/crates/burn-tensor/src/tensor/bytes.rs:203
0x4203a4e7
0x4203a4e7 - burn_tensor::tensor::bytes::Allocation::new
    at /Users/mariuskriegerowski/src/rust/burn/crates/burn-tensor/src/tensor/bytes.rs:140
0x42028048
0x42028048 - <core::marker::PhantomData<T> as serde::de::DeserializeSeed>::deserialize
    at /Users/mariuskriegerowski/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.218/src/de/mod.rs:800

I

BjornTheProgrammer · 2025-03-12T09:36:25Z

700MBs?! Something is really wrong then. I'll have to debug it alongside you then. What is the allocations looking like on v0.16?

BjornTheProgrammer · 2025-03-12T10:02:59Z

To be honest, I think that this is probably not an issue from this PR after triple checking my changes, I can't see anything that would possibly allocate more memory other than the struct to pass back the args. Even then it would only be a couple of bytes at a time.

Maybe it has something to do with how burn uses the loaded model.

I should mention that in my testing the model I was using was 4.3kb, it started to work without panicking after allowing around 40kb to be allocated on heap. That's a 1:10 ratio. It was the same for both my changes and also v0.16

HerrMuellerluedenscheid · 2025-03-12T15:02:50Z

I'm also pretty sure the problem is rooted elsewhere. I will try to evaluate the memory footprint with a smaller example asap.

ivila · 2025-03-19T06:35:34Z

Just a suggestion—the changes can be easily simplified and reduced to just a few lines:

Remove Clone from the trait bound of Recorder.
Modify the load_item method in Recorder from load_item<I>(&self, args: Self::LoadArgs) to load_item<I>(&self, args: &Self::LoadArgs), allowing you to remove the extra RecorderErrorWithParameter error type.
Set the default type of the trait bounds in BytesRecorder to Vec<u8>, which eliminates many duplicate changes.

For more details, you can refer to: ivila/burn@main...ivila:burn:feat_memory_zc

You can ignore it if you think it is unnecessary.

BjornTheProgrammer · 2025-03-19T12:35:47Z

Thank you for the suggestions! I'll be implementing this and some other changes to this PR soon!

BjornTheProgrammer · 2025-04-06T19:23:03Z

I think that this is ready for a merge. I'm not super happy with how this doesn't reduce memory usage much, but I think some other approaches in different PRs would be more useful to solve those.

laggui

I'm not super happy with how this doesn't reduce memory usage much

In your case, the difference is essentially going from the bytes record to a vec when .to_vec() was called? Might not be a huge difference if the record is small. But still a nice little improvement.

I have a couple of minor comments 🙂

Also, the CI failure should be fixed if you merge main into your branch!

crates/burn-core/src/module/param/tensor.rs

crates/burn-core/src/record/recorder.rs

~/.cargo/advisory-db..lock

examples/raspberry-pi-pico/Cargo.toml

BjornTheProgrammer · 2025-04-08T09:27:26Z

In your case, the difference is essentially going from the bytes record to a vec when .to_vec() was called? Might not be a huge difference if the record is small. But still a nice little improvement.

That's true, but the larger issue is that on embedded devices, it doesn't really matter the RAM usage fluctuations, as much as the peak RAM used does. I think my next PR will create some tooling to inspect memory usage more in-depth to find areas of optimization.

BjornTheProgrammer · 2025-04-08T10:18:39Z

I don't think it's related to this PR, but some tests are failing when running ./run-checks.sh all for some reason for Pop_OS! linux with an x86_64 Intel Ultra 9 185H CPU.

---- tests::cube::kernel::quantization::tests::should_quantize_dequantize_per_block_affine stdout ----

thread 'tests::cube::kernel::quantization::tests::should_quantize_dequantize_per_block_affine' panicked at crates/burn-wgpu/src/lib.rs:129:5:
Tensors are not eq:
  => Position 4: -127 != -128
  => Position 6: 7 != 6
  => Position 7: 39 != 38
  => Position 8: 39 != 38
  => Position 9: 7 != 6
1 more errors...
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: burn_tensor::tensor::data::TensorData::assert_eq_elem
   3: burn_tensor::tensor::data::TensorData::assert_eq
   4: burn_wgpu::tests::cube::kernel::quantization::tests::should_quantize_dequantize_per_block_affine
   5: burn_wgpu::tests::cube::kernel::quantization::tests::should_quantize_dequantize_per_block_affine::{{closure}}
   6: core::ops::function::FnOnce::call_once
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

---- tests::cube::tensor::f32_ty::remainder::tests::should_have_no_remainder stdout ----

thread 'tests::cube::tensor::f32_ty::remainder::tests::should_have_no_remainder' panicked at crates/burn-wgpu/src/lib.rs:129:5:
Tensors are not approx eq:
  => Position 4: 0.5034000277519226 != 0 | difference 0.5034000277519226 > tolerance 0.0010000000000000002
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: burn_tensor::tensor::data::TensorData::assert_approx_eq_diff
   3: burn_tensor::tensor::data::TensorData::assert_approx_eq
   4: burn_wgpu::tests::cube::tensor::f32_ty::remainder::tests::should_have_no_remainder
   5: burn_wgpu::tests::cube::tensor::f32_ty::remainder::tests::should_have_no_remainder::{{closure}}
   6: core::ops::function::FnOnce::call_once
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

---- tests::cube_fusion::tensor::f32_ty::remainder::tests::should_have_no_remainder stdout ----

thread 'tests::cube_fusion::tensor::f32_ty::remainder::tests::should_have_no_remainder' panicked at crates/burn-wgpu/src/lib.rs:129:5:
Tensors are not approx eq:
  => Position 4: 0.5034000277519226 != 0 | difference 0.5034000277519226 > tolerance 0.0010000000000000002
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: burn_tensor::tensor::data::TensorData::assert_approx_eq_diff
   3: burn_tensor::tensor::data::TensorData::assert_approx_eq
   4: burn_wgpu::tests::cube_fusion::tensor::f32_ty::remainder::tests::should_have_no_remainder
   5: burn_wgpu::tests::cube_fusion::tensor::f32_ty::remainder::tests::should_have_no_remainder::{{closure}}
   6: core::ops::function::FnOnce::call_once
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


failures:
    tests::cube::kernel::quantization::tests::should_quantize_dequantize_per_block_affine
    tests::cube::tensor::f32_ty::remainder::tests::should_have_no_remainder
    tests::cube_fusion::tensor::f32_ty::remainder::tests::should_have_no_remainder

test result: FAILED. 2576 passed; 3 failed; 16 ignored; 0 measured; 0 filtered out; finished in 102.71s

error: test failed, to rerun pass `-p burn-wgpu --lib`
Error: Workspace Unit Tests failed

laggui · 2025-04-08T13:18:40Z

Ahh yess the remainder issue. IIRC this is broken at the driver-level for some time. Quantization test failure is probably due to precision or rounding issue. But totally unrelated to this PR.

laggui

LGTM! Thanks for addressing the changes 🙂

Cleaned up some minor left-overs

BjornTheProgrammer marked this pull request as draft March 11, 2025 01:19

BjornTheProgrammer force-pushed the main branch 2 times, most recently from 10ebf1b to e24eb4c Compare March 11, 2025 04:12

BjornTheProgrammer marked this pull request as ready for review March 11, 2025 07:15

HerrMuellerluedenscheid mentioned this pull request Mar 11, 2025

clone file in str2reader to avoid passing owned mutables #2881

Closed

2 tasks

BjornTheProgrammer force-pushed the main branch 2 times, most recently from f2b5fd5 to 0bbde7b Compare March 11, 2025 20:08

BjornTheProgrammer mentioned this pull request Mar 16, 2025

Memory optimization for loading weights for no_std mode #2871

Open

BjornTheProgrammer force-pushed the main branch from 61a38f7 to 201ff27 Compare April 4, 2025 07:49

laggui requested changes Apr 7, 2025

View reviewed changes

crates/burn-core/src/module/param/tensor.rs Outdated Show resolved Hide resolved

crates/burn-core/src/record/recorder.rs Outdated Show resolved Hide resolved

~/.cargo/advisory-db..lock Outdated Show resolved Hide resolved

examples/raspberry-pi-pico/Cargo.toml Show resolved Hide resolved

BjornTheProgrammer added 8 commits April 8, 2025 00:06

feat: eliminate unnecessary memory allocations

c8757e8

feat: make BinBytesRecorder have generic load args

86c3ee9

fix: make no_std build

596ca83

chore: fix lints

943a4d3

feat: update raspberry pi pico example

69eca5e

chore: remove bincode dep from pico example

bf1f141

feat: use mut ref instead of owned for recorder

6bf2161

fix: fixed compilation issues

5f75c0f

BjornTheProgrammer force-pushed the main branch from 563833d to 5f75c0f Compare April 8, 2025 07:07

chore: refactor recorder

38c6c2b

BjornTheProgrammer force-pushed the main branch 3 times, most recently from 05daaaf to 897e528 Compare April 8, 2025 09:10

chore: update raspberry pi pico example

c7d4061

BjornTheProgrammer force-pushed the main branch from 897e528 to c7d4061 Compare April 8, 2025 09:21

BjornTheProgrammer requested a review from laggui April 8, 2025 09:56

laggui added 2 commits April 8, 2025 10:22

Cleanup

71fe268

Missed one

da9740d

laggui approved these changes Apr 8, 2025

View reviewed changes

laggui merged commit 32f474d into tracel-ai:main Apr 8, 2025
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: eliminate unnecessary memory allocations #2892

feat: eliminate unnecessary memory allocations #2892

BjornTheProgrammer commented Mar 11, 2025 •

edited

Loading

codecov bot commented Mar 11, 2025 •

edited

Loading

BjornTheProgrammer commented Mar 11, 2025

HerrMuellerluedenscheid commented Mar 11, 2025

HerrMuellerluedenscheid commented Mar 12, 2025

BjornTheProgrammer commented Mar 12, 2025

BjornTheProgrammer commented Mar 12, 2025

HerrMuellerluedenscheid commented Mar 12, 2025

ivila commented Mar 19, 2025 •

edited

Loading

BjornTheProgrammer commented Mar 19, 2025

BjornTheProgrammer commented Apr 6, 2025

laggui left a comment

BjornTheProgrammer commented Apr 8, 2025

BjornTheProgrammer commented Apr 8, 2025 •

edited

Loading

laggui commented Apr 8, 2025

laggui left a comment

feat: eliminate unnecessary memory allocations #2892

feat: eliminate unnecessary memory allocations #2892

Conversation

BjornTheProgrammer commented Mar 11, 2025 • edited Loading

Pull Request Template

Checklist

Related Issues/PRs

Changes

Testing

codecov bot commented Mar 11, 2025 • edited Loading

Codecov Report

BjornTheProgrammer commented Mar 11, 2025

HerrMuellerluedenscheid commented Mar 11, 2025

HerrMuellerluedenscheid commented Mar 12, 2025

BjornTheProgrammer commented Mar 12, 2025

BjornTheProgrammer commented Mar 12, 2025

HerrMuellerluedenscheid commented Mar 12, 2025

ivila commented Mar 19, 2025 • edited Loading

BjornTheProgrammer commented Mar 19, 2025

BjornTheProgrammer commented Apr 6, 2025

laggui left a comment

Choose a reason for hiding this comment

BjornTheProgrammer commented Apr 8, 2025

BjornTheProgrammer commented Apr 8, 2025 • edited Loading

laggui commented Apr 8, 2025

laggui left a comment

Choose a reason for hiding this comment

BjornTheProgrammer commented Mar 11, 2025 •

edited

Loading

codecov bot commented Mar 11, 2025 •

edited

Loading

ivila commented Mar 19, 2025 •

edited

Loading

BjornTheProgrammer commented Apr 8, 2025 •

edited

Loading