forked from ggml-org/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 0
[pull] master from ggerganov:master #175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
typo: `\` -> `/` Change the UNIX path separator to` \`.
Technically the fixed width types come only from iostream and cstdint/stdint.h headers. memory and vector headers should not provide these. In GCC 15 the headers are cleaned up and you require the proper header cstdint. src/llama-mmap.h:26:5: error: ‘uint32_t’ does not name a type 26 | uint32_t read_u32() const; | ^~~~~~~~
* server : (webui) introduce conversation branching + idb storage * mark old conv as "migrated" instead deleting them * improve migration * add more comments * more clarification
* Update ggml.c * Update arg.cpp * Update speculative.h
* CUDA: use arch list for feature availability check --------- Co-authored-by: Diego Devesa <[email protected]>
* server : use common_token_to_piece instead of common_detokenize This commit replaces the call to common_detokenize with common_token_to_piece in the populate_token_probs. The motivation for this change is to avoid an issue where common_detokenize would remove the word boundary character for tokens, which caused a regression in the server generated token probabilities. Resolves: #11728 * squash! server : use common_token_to_piece instead of common_detokenize Use common_token_to_piece for post_sampling_probs as well.
Signed-off-by: Weizhao Ouyang <[email protected]>
* Bug fix for clamp_f32 When using tensors larger than 1d clamp operation does not work due to the restriction of returning if ith is not 0. * Bug fix for clamp_f32 * Bug fix for clamp_f32
* All messages get the copy button * Update index.html.gz
* ggml : x2 speed for WASM by optimizing SIMD * fix bad merging * rm trailing spaces * rm redundant clamp * better quantize_row_q8_K Co-authored-by: camel-cdr <[email protected]> * remove memset that causes buffer overflow Co-authored-by: camel-cdr <[email protected]> --------- Co-authored-by: camel-cdr <[email protected]>
* readme : add notice about new package registry * cont : fix whitespace
* simple typo fixed * Update examples/imatrix/README.md --------- Co-authored-by: Tobias Bergmann <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
* docker : attempt fixing arm64 build on ci * qemu v7.0.0-28
This patch fixes a typo in command help. prefx -> prefix Signed-off-by: Masanari Iida <[email protected]>
* vulkan: support memset_tensor * vulkan: support GGML_OP_SUM * vulkan: implement GGML_OP_ARGMAX * vulkan: implement GGML_OP_SUB * vulkan: implement GGML_OP_COUNT_EQUAL * vulkan: implement GGML_OP_OPT_STEP_ADAMW * vulkan: fix check_results RWKV_WKV6 crash and memory leaks * vulkan: implement GGML_OP_REPEAT_BACK * tests: remove invalid test-backend-ops REPEAT_BACK tests * vulkan: fix COUNT_EQUAL memset using a fillBuffer command
* CUDA: use async data loading for FlashAttention --------- Co-authored-by: Diego Devesa <[email protected]>
This commit fixes an issue in the llama.cpp project where the command for testing the llama-server object contained a duplicated file extension. The original command was: ./tests.sh unit/test_chat_completion.py.py -v -x It has been corrected to: ./tests.sh unit/test_chat_completion.py -v -x This change ensures that the test script correctly locates and executes the intended test file, preventing test failures due to an incorrect file name.
Signed-off-by: MoonRide303 <[email protected]>
* server : add TEI API format for /rerank endpoint * Apply suggestions from code review Co-authored-by: Georgi Gerganov <[email protected]> * fix * also gitignore examples/server/*.gz.hpp --------- Co-authored-by: Georgi Gerganov <[email protected]>
…1900) * tool-call refactoring: moved common_chat_* to chat.h, common_chat_templates_init return a unique_ptr to opaque type * addressed clang-tidy lints in [test-]chat.* * rm minja deps from util & common & move it to common/minja/ * add name & tool_call_id to common_chat_msg * add common_chat_tool * added json <-> tools, msgs conversions to chat.h * fix double bos/eos jinja avoidance hack (was preventing inner bos/eos tokens) * fix deepseek r1 slow test (no longer <think> opening w/ new template) * allow empty tools w/ auto + grammar * fix & test server grammar & json_schema params w/ & w/o --jinja
…n iframe) (#11940) * Webui: Enable communication with parent html (if webui is in iframe): - Listens for "setText" command from parent with "text" and "context" fields. "text" is set in inputMsg, "context" is used as hidden context on the following requests to the llama.cpp server - On pressing na Escape button sends command "escapePressed" to the parent Example handling from the parent html side: - Send command "setText" from parent html to webui in iframe: const iframe = document.getElementById('askAiIframe'); if (iframe) { iframe.contentWindow.postMessage({ command: 'setText', text: text, context: context }, '*'); } - Listen for Escape key from webui on parent html: // Listen for escape key event in the iframe window.addEventListener('keydown', (event) => { if (event.key === 'Escape') { // Process case when Escape is pressed inside webui } }); * Move the extraContext from storage to app.context. * Fix formatting. * add Message.extra * format + build * MessageExtraContext * build * fix display * rm console.log --------- Co-authored-by: igardev <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]>
This commit adjusts the indentation for the functions `parse_sequence` and `parse_rule` in src/llama-grammar.cpp. The motivation is consistency and improve readability.
* speculative : update default params * speculative : do not discard the last drafted token
This commit adds a preset for llama.vim to use the default Qwen 2.5 Coder models. The motivation for this change is to make it easier to start a server suitable to be used with the llama.vim plugin. For example, the server can be started with a command like the following: ```console $ llama.vim --fim-qwen-1.5b-default ``` Refs: #10932
Relates to: #11178 Added --chat-template-file CLI option to llama-run. If specified, the file will be read and the content passed for overwriting the chat template of the model to common_chat_templates_from_model. Signed-off-by: Michael Engel <[email protected]>
* Added SVE Implementation for Q3_K Kernel in ggml-cpu-quants.c file * Improved Formating of code in ggml-cpu-quants.c file * style : minor fixes * style : less whitespaces * style : ptr spaceing --------- Co-authored-by: vithulep <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
* ggml-cpu: Add CPU backend support for KleidiAI library * Add environmental variable GGML_KLEIDIAI_SME * Add support for multithread LHS conversion * Switch kernel selection order to dotprod and i8mm * updates for review comments * More updates for review comments * Reorganize and rename KleidiAI files * Move ggml-cpu-traits.h to source file * Update cmake for SME build and add alignment for SME * Remove append GGML_USE_CPU_KLEIDIAI to the GGML_CDEF_PUBLIC list
* fix skip ime composing * fix npm rebuild * fix warn --------- Co-authored-by: momonga <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
android
Apple Metal
⤵️ pull
devops
documentation
Improvements or additions to documentation
examples
ggml
nix
Nvidia GPU
python
script
server
SYCL
testing
Vulkan
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.1)
Can you help keep this open source service alive? 💖 Please sponsor : )