Description
Research Stage
- Background Research (Let's try to avoid reinventing the wheel)
- Hypothesis Formed (How do you think this will work and it's effect?)
- Strategy / Implementation Forming
- Analysis of results
- Debrief / Documentation (So people in the future can learn from us)
Previous existing literature and research
No response
Hypothesis
I'm loading a large model into a large amount of GPU memory with some CPU offload. The GPU memory exceeds system memory.
GPU Memory: 196 GB
CPU Memory: 148 GB
Model Size: 220 GB
I've noticed that when the model size exceeds system memory, mmap seemingly has no effect on load times. Whereas when it's within system memory size, the load time is nearly immediate.
I suspect that since the model is being loaded deterministically/sequentially, the mapped file is also being deterministically evicted just prior to it being needed for the load onto GPU.
I suspect loading the large weights in reverse inference order would significantly alleviate this to avoid the deterministic mmap eviction in kernel.
I'm looking for some confirmation from a maintainer that my hypothesis may be correct.
Implementation
No response
Analysis
No response