Skip to content

Commit a2bc33f

Browse files
committed
updated README.md
1 parent fce3f71 commit a2bc33f

File tree

1 file changed

+11
-2
lines changed

1 file changed

+11
-2
lines changed

README.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -56,13 +56,22 @@ tokenizer is around 17x faster than the original tokenizer and 9.6x faster than
5656

5757
![performance-comparison](data/performance-comparison.png)
5858

59-
We compared also the multithreading/batch encoding performance using a [script](tools/test_tiktoken-huggingface-rwkv.py)
60-
which based on the [Huggingface Tokenizers](https://github.com/huggingface/tokenizers):
59+
We updated the Rust RWKV world tokenizer to support multithreading for batch encoding. We ran the same comparison
60+
[script](tools/test_tiktoken-huggingface-rwkv.py) from the [Huggingface Tokenizers](https://github.com/huggingface/tokenizers)
61+
with the additional rwkv tokenizer. The result shows that the rwkv world tokenizer is significantly faster than
62+
the Tiktoken and Huggingface tokenizers in all numbers of threads and document sizes (on average, its speed is ten times faster).
63+
6164
![performance-comparison](data/performance-comparison-multithreading.png)
6265

6366
*The simple English Wikipedia dataset can be downloaded as jsonl file from
6467
https://huggingface.co/datasets/cahya/simple-wikipedia/resolve/main/simple-wikipedia.jsonl?download=true
6568

69+
## Tools using this tokenizer
70+
71+
We also created the [json2bin](https://github.com/cahya-wirawan/json2bin) application to convert datasets from JSONL format
72+
into binidx format, a data format used for training RWKV models. It supports batch encoding with multithreading and
73+
can convert a dataset more than 70 times faster than the original json2binidx program written in Python.
74+
6675
## Changelog
6776
- Version 0.9.0
6877
- Added multithreading for the function encode_batch()

0 commit comments

Comments
 (0)