File tree Expand file tree Collapse file tree 1 file changed +11
-2
lines changed Expand file tree Collapse file tree 1 file changed +11
-2
lines changed Original file line number Diff line number Diff line change @@ -56,13 +56,22 @@ tokenizer is around 17x faster than the original tokenizer and 9.6x faster than
56
56
57
57
![ performance-comparison] ( data/performance-comparison.png )
58
58
59
- We compared also the multithreading/batch encoding performance using a [ script] ( tools/test_tiktoken-huggingface-rwkv.py )
60
- which based on the [ Huggingface Tokenizers] ( https://github.com/huggingface/tokenizers ) :
59
+ We updated the Rust RWKV world tokenizer to support multithreading for batch encoding. We ran the same comparison
60
+ [ script] ( tools/test_tiktoken-huggingface-rwkv.py ) from the [ Huggingface Tokenizers] ( https://github.com/huggingface/tokenizers )
61
+ with the additional rwkv tokenizer. The result shows that the rwkv world tokenizer is significantly faster than
62
+ the Tiktoken and Huggingface tokenizers in all numbers of threads and document sizes (on average, its speed is ten times faster).
63
+
61
64
![ performance-comparison] ( data/performance-comparison-multithreading.png )
62
65
63
66
* The simple English Wikipedia dataset can be downloaded as jsonl file from
64
67
https://huggingface.co/datasets/cahya/simple-wikipedia/resolve/main/simple-wikipedia.jsonl?download=true
65
68
69
+ ## Tools using this tokenizer
70
+
71
+ We also created the [ json2bin] ( https://github.com/cahya-wirawan/json2bin ) application to convert datasets from JSONL format
72
+ into binidx format, a data format used for training RWKV models. It supports batch encoding with multithreading and
73
+ can convert a dataset more than 70 times faster than the original json2binidx program written in Python.
74
+
66
75
## Changelog
67
76
- Version 0.9.0
68
77
- Added multithreading for the function encode_batch()
You can’t perform that action at this time.
0 commit comments