Skip to content

Commit 45b56be

Browse files
runkDmitry Shirokov
andauthored
Strings (#103)
* Development snapshot * Development snapshot * Development snapshot --------- Co-authored-by: Dmitry Shirokov <[email protected]>
1 parent cc03faf commit 45b56be

File tree

2 files changed

+14
-7
lines changed

2 files changed

+14
-7
lines changed

README.md

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -55,21 +55,30 @@ chardet.analyse(new Uint8Array([0x68, 0x65, 0x6c, 0x6c, 0x6f]));
5555

5656
## Working with large data sets
5757

58-
Sometimes, when data set is huge and you want to optimize performance (with a tradeoff of less accuracy),
58+
Sometimes, when data set is huge and you want to optimize performance (with a trade off of less accuracy),
5959
you can sample only the first N bytes of the buffer:
6060

6161
```javascript
62-
const encoding = await chardet
63-
.detectFile('/path/to/file', { sampleSize: 32 });
62+
const encoding = await chardet.detectFile('/path/to/file', { sampleSize: 32 });
6463
```
6564

6665
You can also specify where to begin reading from in the buffer:
6766

6867
```javascript
69-
const encoding = await chardet
70-
.detectFile('/path/to/file', { sampleSize: 32, offset: 128 });
68+
const encoding = await chardet.detectFile('/path/to/file', {
69+
sampleSize: 32,
70+
offset: 128,
71+
});
7172
```
7273

74+
## Working with strings
75+
76+
In both Node.js and browsers, all strings in memory are represented in UTF-16 encoding. This is a fundamental aspect of the JavaScript language specification. Therefore, you cannot use plain strings directly as input for `chardet.analyse()` or `chardet.detect()`. Instead, you need the original string data in the form of a Buffer or Uint8Array.
77+
78+
In other words, if you receive a piece of data over the network and want to detect its encoding, use the original data payload, not its string representation. By the time you convert data to a string, it will be in UTF-16 encoding.
79+
80+
Note on [TextEncoder](https://developer.mozilla.org/en-US/docs/Web/API/TextEncoder/TextEncoder): By default, it returns a UTF-8 encoded buffer, which means the buffer will not be in the original encoding of the string.
81+
7382
## Supported Encodings:
7483

7584
- UTF-8

src/index.test.ts

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,4 @@ describe('chardet', () => {
7575
expect(matches).toEqual(expectedEncodingsFromPath);
7676
});
7777
});
78-
79-
8078
});

0 commit comments

Comments
 (0)