Skip to content

feat: eliminate unnecessary memory allocations #2892

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 8, 2025

Conversation

BjornTheProgrammer
Copy link
Contributor

@BjornTheProgrammer BjornTheProgrammer commented Mar 11, 2025

Pull Request Template

Checklist

  • Confirmed that run-checks all script has been executed.
  • Made sure the book is up to date with changes in this PR.

Related Issues/PRs

This PR builds off of @HerrMuellerluedenscheid's fantastic work in pull request #2881, although it has a slightly different approach.
#2871

Changes

This PR it so that args are passed back to Recorder when there is an error, so that there is no need to clone the data unnecessarily. Additionally, this PR will add the ability to have different LoadArgs for at least the BinBytesRecorder.

Testing

Compile all of the examples, and add in some additional tests for other LoadArgs like &'static [u8].

@BjornTheProgrammer BjornTheProgrammer marked this pull request as draft March 11, 2025 01:19
Copy link

codecov bot commented Mar 11, 2025

Codecov Report

Attention: Patch coverage is 90.90909% with 4 lines in your changes missing coverage. Please review.

Project coverage is 80.90%. Comparing base (1965796) to head (da9740d).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
crates/burn-import/src/pytorch/recorder.rs 0.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2892      +/-   ##
==========================================
- Coverage   80.90%   80.90%   -0.01%     
==========================================
  Files         872      872              
  Lines      121039   121066      +27     
==========================================
+ Hits        97932    97953      +21     
- Misses      23107    23113       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@BjornTheProgrammer BjornTheProgrammer force-pushed the main branch 2 times, most recently from 10ebf1b to e24eb4c Compare March 11, 2025 04:12
@BjornTheProgrammer
Copy link
Contributor Author

After hours of testing to figure out how much memory has been saved from this PR, I find that it is inconclusive, I don't believe that the .clone()s were the cause of the large amount of memory. From my testing, the sine.onnx actually required about 2 kb more heap allocation than the original implementation, and my own custom model required about 1 kb less. This could easily be just a compilation difference causing variance.

This doesn't make any sense to me unless the real issue with the model not fitting is actually the states or something similar. @HerrMuellerluedenscheid, I would appreciate it greatly if you could try out this fork on the esp32.

Does anyone else have any ideas what could be the actual issue? Or something else that I'm missing?

@HerrMuellerluedenscheid

I quickly gave it a try running the squeezenet example using the changes from your main branch on an esp32. It did not work out of the box as the api changed and the example has to be modified. As soon as I have a little time I will dive deeper into that. But I think this is a very promissing approach 👍

@BjornTheProgrammer BjornTheProgrammer force-pushed the main branch 2 times, most recently from f2b5fd5 to 0bbde7b Compare March 11, 2025 20:08
@HerrMuellerluedenscheid

After the last force push of yours @BjornTheProgrammer I tried again running the squeezenet on esp32. It compiles but panics on memory allocation of stunning 700MB rooted apparently in burn-tensor/src/tensor/bytes.rs:140. Strange. I will try to do some debugging ASAP.

memory allocation of 758360262 bytes failed

Backtrace:

0x42040a27
0x42040a27 - __rdl_oom
    at /Users/mariuskriegerowski/.rustup/toolchains/esp/lib/rustlib/src/rust/library/alloc/src/alloc.rs:441
0x4205de64
0x4205de64 - __rust_alloc_error_handler
    at ??:??
0x420408a0
0x420408a0 - alloc::alloc::handle_alloc_error::rt_error
    at /Users/mariuskriegerowski/.rustup/toolchains/esp/lib/rustlib/src/rust/library/alloc/src/alloc.rs:406
0x42040894
0x42040894 - alloc::alloc::handle_alloc_error
    at /Users/mariuskriegerowski/.rustup/toolchains/esp/lib/rustlib/src/rust/library/alloc/src/alloc.rs:412
0x4204a11d
0x4204a11d - burn_tensor::tensor::bytes::buffer_alloc::{{closure}}
    at /Users/mariuskriegerowski/src/rust/burn/crates/burn-tensor/src/tensor/bytes.rs:203
0x4203a4e7
0x4203a4e7 - burn_tensor::tensor::bytes::Allocation::new
    at /Users/mariuskriegerowski/src/rust/burn/crates/burn-tensor/src/tensor/bytes.rs:140
0x42028048
0x42028048 - <core::marker::PhantomData<T> as serde::de::DeserializeSeed>::deserialize
    at /Users/mariuskriegerowski/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.218/src/de/mod.rs:800

I

@BjornTheProgrammer
Copy link
Contributor Author

700MBs?! Something is really wrong then. I'll have to debug it alongside you then. What is the allocations looking like on v0.16?

@BjornTheProgrammer
Copy link
Contributor Author

To be honest, I think that this is probably not an issue from this PR after triple checking my changes, I can't see anything that would possibly allocate more memory other than the struct to pass back the args. Even then it would only be a couple of bytes at a time.

Maybe it has something to do with how burn uses the loaded model.

I should mention that in my testing the model I was using was 4.3kb, it started to work without panicking after allowing around 40kb to be allocated on heap. That's a 1:10 ratio. It was the same for both my changes and also v0.16

@HerrMuellerluedenscheid

I'm also pretty sure the problem is rooted elsewhere. I will try to evaluate the memory footprint with a smaller example asap.

@ivila
Copy link
Contributor

ivila commented Mar 19, 2025

Just a suggestion—the changes can be easily simplified and reduced to just a few lines:

  1. Remove Clone from the trait bound of Recorder.
  2. Modify the load_item method in Recorder from load_item<I>(&self, args: Self::LoadArgs) to load_item<I>(&self, args: &Self::LoadArgs), allowing you to remove the extra RecorderErrorWithParameter error type.
  3. Set the default type of the trait bounds in BytesRecorder to Vec<u8>, which eliminates many duplicate changes.

For more details, you can refer to: ivila/burn@main...ivila:burn:feat_memory_zc

You can ignore it if you think it is unnecessary.

@BjornTheProgrammer
Copy link
Contributor Author

Thank you for the suggestions! I'll be implementing this and some other changes to this PR soon!

@BjornTheProgrammer
Copy link
Contributor Author

I think that this is ready for a merge. I'm not super happy with how this doesn't reduce memory usage much, but I think some other approaches in different PRs would be more useful to solve those.

Copy link
Member

@laggui laggui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super happy with how this doesn't reduce memory usage much

In your case, the difference is essentially going from the bytes record to a vec when .to_vec() was called? Might not be a huge difference if the record is small. But still a nice little improvement.

I have a couple of minor comments 🙂

Also, the CI failure should be fixed if you merge main into your branch!

@BjornTheProgrammer BjornTheProgrammer force-pushed the main branch 3 times, most recently from 05daaaf to 897e528 Compare April 8, 2025 09:10
@BjornTheProgrammer
Copy link
Contributor Author

In your case, the difference is essentially going from the bytes record to a vec when .to_vec() was called? Might not be a huge difference if the record is small. But still a nice little improvement.

That's true, but the larger issue is that on embedded devices, it doesn't really matter the RAM usage fluctuations, as much as the peak RAM used does. I think my next PR will create some tooling to inspect memory usage more in-depth to find areas of optimization.

@BjornTheProgrammer
Copy link
Contributor Author

BjornTheProgrammer commented Apr 8, 2025

I don't think it's related to this PR, but some tests are failing when running ./run-checks.sh all for some reason for Pop_OS! linux with an x86_64 Intel Ultra 9 185H CPU.

---- tests::cube::kernel::quantization::tests::should_quantize_dequantize_per_block_affine stdout ----

thread 'tests::cube::kernel::quantization::tests::should_quantize_dequantize_per_block_affine' panicked at crates/burn-wgpu/src/lib.rs:129:5:
Tensors are not eq:
  => Position 4: -127 != -128
  => Position 6: 7 != 6
  => Position 7: 39 != 38
  => Position 8: 39 != 38
  => Position 9: 7 != 6
1 more errors...
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: burn_tensor::tensor::data::TensorData::assert_eq_elem
   3: burn_tensor::tensor::data::TensorData::assert_eq
   4: burn_wgpu::tests::cube::kernel::quantization::tests::should_quantize_dequantize_per_block_affine
   5: burn_wgpu::tests::cube::kernel::quantization::tests::should_quantize_dequantize_per_block_affine::{{closure}}
   6: core::ops::function::FnOnce::call_once
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

---- tests::cube::tensor::f32_ty::remainder::tests::should_have_no_remainder stdout ----

thread 'tests::cube::tensor::f32_ty::remainder::tests::should_have_no_remainder' panicked at crates/burn-wgpu/src/lib.rs:129:5:
Tensors are not approx eq:
  => Position 4: 0.5034000277519226 != 0 | difference 0.5034000277519226 > tolerance 0.0010000000000000002
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: burn_tensor::tensor::data::TensorData::assert_approx_eq_diff
   3: burn_tensor::tensor::data::TensorData::assert_approx_eq
   4: burn_wgpu::tests::cube::tensor::f32_ty::remainder::tests::should_have_no_remainder
   5: burn_wgpu::tests::cube::tensor::f32_ty::remainder::tests::should_have_no_remainder::{{closure}}
   6: core::ops::function::FnOnce::call_once
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

---- tests::cube_fusion::tensor::f32_ty::remainder::tests::should_have_no_remainder stdout ----

thread 'tests::cube_fusion::tensor::f32_ty::remainder::tests::should_have_no_remainder' panicked at crates/burn-wgpu/src/lib.rs:129:5:
Tensors are not approx eq:
  => Position 4: 0.5034000277519226 != 0 | difference 0.5034000277519226 > tolerance 0.0010000000000000002
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: burn_tensor::tensor::data::TensorData::assert_approx_eq_diff
   3: burn_tensor::tensor::data::TensorData::assert_approx_eq
   4: burn_wgpu::tests::cube_fusion::tensor::f32_ty::remainder::tests::should_have_no_remainder
   5: burn_wgpu::tests::cube_fusion::tensor::f32_ty::remainder::tests::should_have_no_remainder::{{closure}}
   6: core::ops::function::FnOnce::call_once
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


failures:
    tests::cube::kernel::quantization::tests::should_quantize_dequantize_per_block_affine
    tests::cube::tensor::f32_ty::remainder::tests::should_have_no_remainder
    tests::cube_fusion::tensor::f32_ty::remainder::tests::should_have_no_remainder

test result: FAILED. 2576 passed; 3 failed; 16 ignored; 0 measured; 0 filtered out; finished in 102.71s

error: test failed, to rerun pass `-p burn-wgpu --lib`
Error: Workspace Unit Tests failed

@laggui
Copy link
Member

laggui commented Apr 8, 2025

Ahh yess the remainder issue. IIRC this is broken at the driver-level for some time. Quantization test failure is probably due to precision or rounding issue. But totally unrelated to this PR.

Copy link
Member

@laggui laggui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for addressing the changes 🙂

Cleaned up some minor left-overs

@laggui laggui merged commit 32f474d into tracel-ai:main Apr 8, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants