Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Fix #13128
My resident is currently having no internet (temporary), and I'm using a slow 4G to upload this PR.
This PR allows using
-hf
and-mu
without internet access, given you already downloaded the model.If the model is not yet download, or the manifest file is not yet generated (which does not exist before this PR), then you will see this error:
Behavior change
2 noticeable things:
HEAD
request now doesn't allow retry. This is because if we force the user to wait for 3 retries, it will be a bad UX for offline usage. Not sure if this will impact anyone, but I hope this will be a big problem (see next point)HEAD
request fails, but the file does exist, we won't re-download it. The argument is that if the server does not supportETag
onHEAD
request, there is no point of forcing user to re-download the file every time.Idea for the future
While making this PR, I intentionally add a
manifest=
prefix to the cached manifest file.In the future, we can have a flag like
--list-cached-models
to show the list of cached models that user can use.In a far future, we can also allow
llama-server
to swap models (not necessarily running 2 or more in parallel). Think of it like the use case of LM Studio where you can load 1 model at a time. The manifest file provided by this PR can allow listing available models in cache, ready to be loaded.