You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please note that using the `prefill` and `decode` method conventions are required for easy integration into the Mediapipe LLM Inference API.
69
69
<br/>
@@ -78,7 +78,7 @@ The user needs to implement the entire LLM Pipeline themselves, and call TFLite
78
78
79
79
This approach provides users with the most control. For example, they can implement streaming, get more control over system memory or implement advanced features such as constrained grammar decoding, speculative decoding etc.
80
80
81
-
A very simple text generation pipeline based on a decoder-only-transformer is provided [here](https://github.com/google-ai-edge/ai-edge-torch-archive/blob/main/ai_edge_torch/generative/examples/c%2B%2B/text_generator_main.cc) for reference. Note that this example serves as a starting point, and users are expected to implement their own pipelines based on their model's specific requirements.
81
+
A very simple text generation pipeline based on a decoder-only-transformer is provided [here](https://github.com/google-ai-edge/ai-edge-torch/blob/main/ai_edge_torch/generative/examples/c%2B%2B/text_generator_main.cc) for reference. Note that this example serves as a starting point, and users are expected to implement their own pipelines based on their model's specific requirements.
For an end-to-end example showing how to author, convert, quantize and execute, please refer to the steps [here](https://github.com/google-ai-edge/ai-edge-torch-archive/blob/main/ai_edge_torch/generative/examples/README.md)
108
+
For an end-to-end example showing how to author, convert, quantize and execute, please refer to the steps [here](https://github.com/google-ai-edge/ai-edge-torch/blob/main/ai_edge_torch/generative/examples/README.md)
With the `TensorNames` defined, a user can simply use the loading utils to load
48
48
an instance of the mapped model. For instance:
@@ -59,7 +59,7 @@ using a few input samples before proceeding to the conversion step.
59
59
60
60
### Model conversion
61
61
In this step, we use the `ai_edge_torch`'s standard multi-signature conversion API to convert PyTorch `nn.Module` to a single TFLite flatbuffer for on-device execution. For example, in `tiny_llama/convert_to_tflite.py`, we use this python code to convert the `TinyLLama` model to a multi-signature TFLite model:
Once converted, you will get a `.tflite` model which will be ready for on-device execution. Note that the `.tflite` model generated uses static shapes. Inside the generated `.tflite` model, there will be two signatures defined (two entrypoints to the model):
65
65
1)`prefill`: taking 2 tensor inputs `prefill_tokens`, `prefill_input_pos`. With shape `(BATCH_SIZE, PREFILL_SEQ_LEN)` and `(PREFILL_SEQ_LEN)`.
Copy file name to clipboardExpand all lines: ai_edge_torch/generative/layers/README.md
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -43,4 +43,4 @@ Currently, the library provides the following configuration class for you to cus
43
43
44
44
## High-Level function boundary for performance
45
45
We introduce High-Level Function Boundary (HLFB) as a way of annotating performance-critical pieces of the model (e.g. `scaled_dot_product_attention`, or `KVCache`). HLFB allows the converter to lower the annotated blocks to performant TFLite custom ops. Following is an example of applying HLFB to `SDPA`:
0 commit comments