Feature Req: Add Importance Matrix / RAM avail calculations to ISQ

Looking over ISQ (based on your previous ask), I found a few things missing that I've learned are helpful via trial and error.  

`imatrix`, If you look at the discussion [here](https://github.com/ggerganov/llama.cpp/discussions/5263), you can see that calculating the Importance Matrix prior to quantization, can be used to offset some of the negative effects of quantization.  In particular, this [comment](https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-9437698) gives a great walkthrough of which tools to use to calculate the imatrix, then how to use it when quantizing. 

Also, one of the key benefits of Ollama (Go wrapper around llama.cpp) is in [`llm/memory.go`](https://github.com/ollama/ollama/blob/89d99001523bcf81f448973382cc7f6b9a68c578/llm/memory.go).  In the function `EstimateGPULayers`, it calculates, based on available VRAM (or system RAM for metal) how many layers can be offloaded to the GPU.  This number is then passed to the `--n_gpu_layers` option of `llama.cpp`.  

What are the chances of incorporating these ideas into ISQ?  It would be *great* to go from safetensors / `bf16` on disk to automagically optimal memory loading for inference. :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Req: Add Importance Matrix / RAM avail calculations to ISQ #377

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feature Req: Add Importance Matrix / RAM avail calculations to ISQ #377

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions