-
Notifications
You must be signed in to change notification settings - Fork 2k
Update doc for server arguments #2742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
zhaochenyang20
merged 23 commits into
sgl-project:main
from
simveit:feature/server-arguments-docs
Jan 23, 2025
Merged
Changes from 1 commit
Commits
Show all changes
23 commits
Select commit
Hold shift + click to select a range
452a766
Added model arguments
simveit 0a288d7
Added sections on tensor and data parallelism
simveit b939c56
Shortened existing texts, added more parameters
simveit 9084cfe
Merge remote-tracking branch 'upstream/main' into feature/server-argu…
simveit 4c0ae15
Added arguments
simveit 17819d2
Merge remote-tracking branch 'upstream/main' into feature/server-argu…
simveit 49b0815
Merge remote-tracking branch 'origin/feature/server-arguments-docs' i…
simveit 5cf32be
Merge branch 'main' into feature/server-arguments-docs
zhaochenyang20 15a3e87
Merge branch 'sgl-project:main' into feature/server-arguments-docs
simveit ce845c4
Adjusted descriptions
simveit efa4e9c
Merge branch 'sgl-project:main' into feature/server-arguments-docs
simveit d190c6e
Merge branch 'feature/server-arguments-docs' of github.com:simveit/sg…
simveit 30e818e
Completed all arguments
simveit 8759299
Remove argument
simveit 342aa20
Refined argument description
simveit 95ce466
Merge branch 'main' into feature/server-arguments-docs
simveit 452acac
Adjusted descriptions.
simveit 9e4788b
Merge branch 'main' into feature/server-arguments-docs
zhaochenyang20 9d20887
Merge branch 'main' into feature/server-arguments-docs
zhyncs e5ef947
Merge branch 'main' into feature/server-arguments-docs
simveit 3e64a3f
Added common launch commands from previous doc.
simveit 93ae630
Merge branch 'feature/server-arguments-docs' of github.com:simveit/sg…
simveit 4425c28
Linted
simveit File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,7 +13,7 @@ In this document we aim to give an overview of the possible arguments when deplo | |
* `kv_cache_dtype`: Dtype of the kv cache, defaults to the `dtype`. | ||
* `context_length`: The number of tokens our model can process *including the input*. Not that extending the default might lead to strange behavior. | ||
* `device`: The device we put the model, defaults to `cuda`. | ||
* `chat_template`: The chat template to use. Deviating from the default might lead to unexpected responses. | ||
* `chat_template`: The chat template to use. Deviating from the default might lead to unexpected responses. For multi-modal chat templates, refer to [here](https://docs.sglang.ai/backend/openai_api_vision.html#Chat-Template). | ||
* `is_embedding`: Set to true to perform [embedding](https://docs.sglang.ai/backend/openai_api_embeddings.html) / [enocode](https://docs.sglang.ai/backend/native_api.html#Encode-(embedding-model)) and [reward](https://docs.sglang.ai/backend/native_api.html#Classify-(reward-model)) tasks. | ||
* `revision`: Adjust if a specific version of the model should be used. | ||
* `skip_tokenizer_init`: Set to true to provide the tokens to the engine and get the output tokens directly, typically used in RLHF. | ||
|
@@ -28,7 +28,7 @@ In this document we aim to give an overview of the possible arguments when deplo | |
|
||
### API configuration | ||
|
||
* `api_key`: Sets an API key for the server or the OpenAI-compatible API. | ||
* `api_key`: Sets an API key for the server and the OpenAI-compatible API. | ||
* `file_storage_pth`: Directory for storing uploaded or generated files from API calls. | ||
* `enable_cache_report`: If set, includes detailed usage of cached tokens in the response usage. | ||
|
||
|
@@ -49,10 +49,10 @@ In this document we aim to give an overview of the possible arguments when deplo | |
|
||
## Memory and scheduling | ||
|
||
* `mem_fraction_static`: Fraction of the free GPU memory used for static memory like model weights and KV cache. If build KV cache failed, should be increased. In case of OOM should be decreased. | ||
* `mem_fraction_static`: Fraction of the free GPU memory used for static memory like model weights and KV cache. If building KV cache fails, it should be increased. If CUDA runs out of memory, it should be decreased. | ||
* `max_running_requests`: The maximum number of requests to run concurrently. | ||
* `max_total_tokens`: The maximum number of tokens that can be stored into the KV cache. Use mainly for debugging. | ||
* `chunked_prefill_size`: Perform the prefill in chunks of these size. Larger chunk size speeds up the prefill phase but increases the VRAM consumption. In case of OOM should be decreased. | ||
* `chunked_prefill_size`: Perform the prefill in chunks of these size. Larger chunk size speeds up the prefill phase but increases the VRAM consumption. If CUDA runs out of memory, it should be decreased. | ||
* `max_prefill_tokens`: Token budget of how many tokens to accept in one prefill batch. The actual number is the max of this parameter and the `context_length`. | ||
* `schedule_policy`: The scheduling policy to control the processing order of waiting prefill requests in a single engine. | ||
* `schedule_conservativeness`: Can be used to decrease/increase the conservativeness of the server when taking new requests. Highly conservative behavior leads to starvation, but low conservativeness leads to slowed-down performance. | ||
|
@@ -66,7 +66,7 @@ In this document we aim to give an overview of the possible arguments when deplo | |
* `watchdog_timeout`: Adjusts the watchdog thread’s timeout before killing the server if batch generation takes too long. | ||
* `download_dir`: Use to override the default Hugging Face cache directory for model weights. | ||
* `base_gpu_id`: Use to adjust first GPU used to distribute the model across available GPUs. | ||
|
||
* `allow_auto_truncate`: Automatically truncate requests that exceed the maximum input length. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great! |
||
|
||
## Logging | ||
|
||
|
@@ -86,7 +86,7 @@ In this document we aim to give an overview of the possible arguments when deplo | |
|
||
## LoRA | ||
|
||
* `lora_paths`: You may provide a list of adapters to your model as a list. Each batch element will get model response with the corresponding lora adapter applied. Currently `cuda_graph` and `radix_attention` are not supportet with this option so you need to disable them manually. We are still working on through these [issues](https://github.com/sgl-project/sglang/issues?q=is%3Aissue%20lora%20). | ||
* `lora_paths`: You may provide a list of adapters to your model as a list. Each batch element will get model response with the corresponding lora adapter applied. Currently `cuda_graph` and `radix_attention` are not supportet with this option so you need to disable them manually. We are still working on through these [issues](https://github.com/sgl-project/sglang/issues/2929). | ||
* `max_loras_per_batch`: Maximum number of LoRAs in a running batch including base model. | ||
|
||
## Kernel backend | ||
|
@@ -97,6 +97,7 @@ In this document we aim to give an overview of the possible arguments when deplo | |
## Constrained Decoding | ||
|
||
* `grammar_backend`: The grammar backend for constraint decoding. Detailed usage can be found in this [document](https://docs.sglang.ai/backend/structured_outputs.html). | ||
* `constrained_json_whitespace_pattern`: Use with `Outlines` grammar backend to allow JSON with syntatic newlines, tabs or multiple spaces. Details can be found [here](https://dottxt-ai.github.io/outlines/latest/reference/generation/json/#using-pydantic). | ||
|
||
## Speculative decoding | ||
|
||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.