back out support for cp850, it doesn't work well

Elia Robyn Lake · Elia Robyn Lake · commit 8b791317cab3 · 2024-10-10T17:11:36.000-04:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,7 +2,7 @@
 
 - Switched packaging from poetry to uv.
 - Uses modern Python packaging exclusively (no setup.py).
-- Added support for mojibake in Windows-1257 (Baltic) and codepage 850 (MS-DOS in Western Europe).
+- Added support for mojibake in Windows-1257 (Baltic).
 - Detects mojibake for "Ü" in an uppercase word, such as "ZURÜCK".
 - Expanded a heuristic that notices improbable punctuation.
 - Fixed a false positive involving two concatenated strings, one of which began with the § sign.
diff --git a/docs/encodings.rst b/docs/encodings.rst
@@ -14,8 +14,7 @@ ftfy can understand text that was decoded as any of these single-byte encodings:
 - Windows-1257 (cp1257 -- used in Microsoft products in Baltic countries)
 - ISO-8859-2 (which is not quite the same as Windows-1250)
 - MacRoman (used on Mac OS 9 and earlier)
-- cp437 (used in MS-DOS, and some versions of the Windows command prompt, in the Americas)
-- cp850 (used in MS-DOS, and some versions of the Windows command prompt, in Western Europe)
+- cp437 (it's the "text mode" in your video card firmware)
 
 when it was actually intended to be decoded as one of these variable-length encodings:
 
@@ -28,6 +27,8 @@ However, ftfy cannot understand other mixups between single-byte encodings, beca
 
 We also can't handle the legacy encodings used for Chinese, Japanese, and Korean, such as ``shift-jis`` and ``gb18030``.  See `issue #34`_ for why this is so hard.
 
+I tried adding support for cp850, the cp437-workalike that supported European languages, but I couldn't find any real examples that it fixed, and it introduced some false positives.
+
 .. _`issue #34`: https://github.com/rspeer/python-ftfy/issues/34
 
 Remember that the input to ftfy is Unicode, so it handles actual CJK *text* just fine. It just can't discover that a CJK *encoding* introduced mojibake into the text.
diff --git a/ftfy/chardata.py b/ftfy/chardata.py
@@ -23,7 +23,6 @@
     "iso-8859-2",
     "macroman",
     "cp437",
-    "cp850",
 ]
 
 SINGLE_QUOTE_RE = re.compile("[\u02bc\u2018-\u201b]")
diff --git a/notes/mysteries.txt b/notes/mysteries.txt
@@ -0,0 +1,10 @@
+on https://www.nipette.com/article-6358031.html, a comment is signed 'MÃ\x83Â©Ã\x82Â¬Ã\x82Â¡nie'.
+This happens to be triple-UTF-8 for 'M鬡nie', but that's probably not the name they meant.
+
+What exactly did https://www.horoskopy-horoskop.cz/clanek/431-numerologicky-vyznam-jmena-jaromir
+mean when they said 'TadeÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ¡ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ¡' ?
+
+https://mtlurb.com/tags/arbres/
+'montrã©al' probably isn't in cp850, but what is it?
+
+
diff --git a/tests/test_cases.json b/tests/test_cases.json
@@ -112,23 +112,17 @@
         "expect": "pass"
     },
     {
-        "label": "Synthetic: Messy language name in cp850: Czech",
+        "label": "Synthetic: Messy language name in cp437: Czech",
         "original": "─îe┼ítina",
         "fixed": "Čeština",
         "expect": "pass"
     },
     {
-        "label": "Synthetic: Messy language name in cp850: Vietnamese",
+        "label": "Synthetic: Messy language name in cp437: Vietnamese",
         "original": "Tiß║┐ng Viß╗çt",
         "fixed": "Tiếng Việt",
         "expect": "pass"
     },
-    {
-        "label": "Synthetic: Messy language name in cp850: Japanese",
-        "original": "µùÑµ£¼Þ¬×",
-        "fixed": "日本語",
-        "expect": "pass"
-    },
     {
         "label": "Low-codepoint emoji",
         "comment": "From the ancient era before widespread emoji support on Twitter",

Original file line number	Diff line number	Diff line change
`@@ -23,7 +23,6 @@`
`23`	`23`	`"iso-8859-2",`
`24`	`24`	`"macroman",`
`25`	`25`	`"cp437",`
`26`		`- "cp850",`
`27`	`26`	`]`
`28`	`27`
`29`	`28`	`SINGLE_QUOTE_RE = re.compile("[\u02bc\u2018-\u201b]")`