Skip to content

Research: mmap eviction #14154

Open
Open
@koush

Description

@koush

Research Stage

  • Background Research (Let's try to avoid reinventing the wheel)
  • Hypothesis Formed (How do you think this will work and it's effect?)
  • Strategy / Implementation Forming
  • Analysis of results
  • Debrief / Documentation (So people in the future can learn from us)

Previous existing literature and research

No response

Hypothesis

I'm loading a large model into a large amount of GPU memory with some CPU offload. The GPU memory exceeds system memory.

GPU Memory: 196 GB
CPU Memory: 148 GB
Model Size: 220 GB

I've noticed that when the model size exceeds system memory, mmap seemingly has no effect on load times. Whereas when it's within system memory size, the load time is nearly immediate.

I suspect that since the model is being loaded deterministically/sequentially, the mapped file is also being deterministically evicted just prior to it being needed for the load onto GPU.

I suspect loading the large weights in reverse inference order would significantly alleviate this to avoid the deterministic mmap eviction in kernel.

I'm looking for some confirmation from a maintainer that my hypothesis may be correct.

Implementation

No response

Analysis

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions