Replies: 4 comments 2 replies
-
i have seen that this is not currently possible so i have changed the category to ideas i have also changed the title from "Is it possible to offload layers to an SSD" |
Beta Was this translation helpful? Give feedback.
-
I also wonder what the effects would be if the model is split vertically across layers instead of being split one layer at the time, or if it's split diagonally across layers in several ways like a partition line being fully across the main diagonal of the neural net as if the nodes of the net were elements of a 2D matrix, partially across with the rest of the partition line being horizontal or vertical or even diagonal at a different angle, different partition line angles, although a better approach could be something like powerinfer but modified so that some part of the model would be kept in RAM and the rest in an SSD |
Beta Was this translation helpful? Give feedback.
-
I also wonder if it would be possible to do something inspired by RAID, in which a single GGUF file would be split into several with one file per SSD to increase speed, without the need to actually set up RAID. |
Beta Was this translation helpful? Give feedback.
-
Something bypassing the CPU/RAM, like Nvidia's GPUdirect? |
Beta Was this translation helpful? Give feedback.
-
It would be slow, yes. But would it be possible to offload layers to an SSD or disk using llama.cpp without using swap? Ideally implementing the LLM in a Flash paper: https://arxiv.org/abs/2312.11514 even without implementing that paper, and with only an algorithm that only reads sequentially from an SSD, offloading to an SSD without swap could reduce the number of writes in an SSD compared to using swap.
Beta Was this translation helpful? Give feedback.
All reactions