Skip to content

Feature Req: Add Importance Matrix / RAM avail calculations to ISQ #377

Closed
@psyv282j9d

Description

@psyv282j9d

Looking over ISQ (based on your previous ask), I found a few things missing that I've learned are helpful via trial and error.

imatrix, If you look at the discussion here, you can see that calculating the Importance Matrix prior to quantization, can be used to offset some of the negative effects of quantization. In particular, this comment gives a great walkthrough of which tools to use to calculate the imatrix, then how to use it when quantizing.

Also, one of the key benefits of Ollama (Go wrapper around llama.cpp) is in llm/memory.go. In the function EstimateGPULayers, it calculates, based on available VRAM (or system RAM for metal) how many layers can be offloaded to the GPU. This number is then passed to the --n_gpu_layers option of llama.cpp.

What are the chances of incorporating these ideas into ISQ? It would be great to go from safetensors / bf16 on disk to automagically optimal memory loading for inference. :-)

Metadata

Metadata

Assignees

No one assigned

    Labels

    modelsAdditions to model or architecturesnew featureNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions