Skip to content

ASCII File Detected as ISO-8859-1 Due to Control Characters #109

Open
@mpytel-cksource

Description

@mpytel-cksource

Description

When detecting the character set of a plain ASCII file containing newline (\n, ASCII 10) or other control characters, chardet incorrectly identifies it as ISO-8859-1 instead of ASCII. This happens because control characters (ASCII 0-31, 127) are present, which seems to influence the detection process.

Since ASCII includes both printable characters (32-126) and control characters (0-31, 127), the presence of these should not change the classification to a different encoding like ISO-8859-1.

Expected Behavior

Files containing only bytes 0-127 (including control characters like \n) should be correctly detected as ASCII, not ISO-8859-1.

Steps to Reproduce

  1. Create a file (ascii.txt) with the following content:
Hello, World!
This is a test.

(Ensure there is a newline at the end of the file.)

  1. Run the following script:
import chardet from 'chardet';

console.log(chardet.detectFileSync('ascii.txt')); // Expected: 'ASCII', but gets 'ISO-8859-1']

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions