Skip to content

Commit 8b79131

Browse files
author
Elia Robyn Lake
committed
back out support for cp850, it doesn't work well
1 parent 1c185ef commit 8b79131

File tree

5 files changed

+16
-12
lines changed

5 files changed

+16
-12
lines changed

CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
- Switched packaging from poetry to uv.
44
- Uses modern Python packaging exclusively (no setup.py).
5-
- Added support for mojibake in Windows-1257 (Baltic) and codepage 850 (MS-DOS in Western Europe).
5+
- Added support for mojibake in Windows-1257 (Baltic).
66
- Detects mojibake for "Ü" in an uppercase word, such as "ZURÜCK".
77
- Expanded a heuristic that notices improbable punctuation.
88
- Fixed a false positive involving two concatenated strings, one of which began with the § sign.

docs/encodings.rst

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,7 @@ ftfy can understand text that was decoded as any of these single-byte encodings:
1414
- Windows-1257 (cp1257 -- used in Microsoft products in Baltic countries)
1515
- ISO-8859-2 (which is not quite the same as Windows-1250)
1616
- MacRoman (used on Mac OS 9 and earlier)
17-
- cp437 (used in MS-DOS, and some versions of the Windows command prompt, in the Americas)
18-
- cp850 (used in MS-DOS, and some versions of the Windows command prompt, in Western Europe)
17+
- cp437 (it's the "text mode" in your video card firmware)
1918

2019
when it was actually intended to be decoded as one of these variable-length encodings:
2120

@@ -28,6 +27,8 @@ However, ftfy cannot understand other mixups between single-byte encodings, beca
2827

2928
We also can't handle the legacy encodings used for Chinese, Japanese, and Korean, such as ``shift-jis`` and ``gb18030``. See `issue #34`_ for why this is so hard.
3029

30+
I tried adding support for cp850, the cp437-workalike that supported European languages, but I couldn't find any real examples that it fixed, and it introduced some false positives.
31+
3132
.. _`issue #34`: https://github.com/rspeer/python-ftfy/issues/34
3233

3334
Remember that the input to ftfy is Unicode, so it handles actual CJK *text* just fine. It just can't discover that a CJK *encoding* introduced mojibake into the text.

ftfy/chardata.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,6 @@
2323
"iso-8859-2",
2424
"macroman",
2525
"cp437",
26-
"cp850",
2726
]
2827

2928
SINGLE_QUOTE_RE = re.compile("[\u02bc\u2018-\u201b]")

notes/mysteries.txt

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
on https://www.nipette.com/article-6358031.html, a comment is signed 'MÃ\x83©Ã\x82¬Ã\x82¡nie'.
2+
This happens to be triple-UTF-8 for 'M鬡nie', but that's probably not the name they meant.
3+
4+
What exactly did https://www.horoskopy-horoskop.cz/clanek/431-numerologicky-vyznam-jmena-jaromir
5+
mean when they said 'TadeÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂáÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂá' ?
6+
7+
https://mtlurb.com/tags/arbres/
8+
'montrã©al' probably isn't in cp850, but what is it?
9+
10+

tests/test_cases.json

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -112,23 +112,17 @@
112112
"expect": "pass"
113113
},
114114
{
115-
"label": "Synthetic: Messy language name in cp850: Czech",
115+
"label": "Synthetic: Messy language name in cp437: Czech",
116116
"original": "Čeština",
117117
"fixed": "Čeština",
118118
"expect": "pass"
119119
},
120120
{
121-
"label": "Synthetic: Messy language name in cp850: Vietnamese",
121+
"label": "Synthetic: Messy language name in cp437: Vietnamese",
122122
"original": "Tiếng Việt",
123123
"fixed": "Tiếng Việt",
124124
"expect": "pass"
125125
},
126-
{
127-
"label": "Synthetic: Messy language name in cp850: Japanese",
128-
"original": "µùѵ£¼Þ¬×",
129-
"fixed": "日本語",
130-
"expect": "pass"
131-
},
132126
{
133127
"label": "Low-codepoint emoji",
134128
"comment": "From the ancient era before widespread emoji support on Twitter",

0 commit comments

Comments
 (0)