Scan Report: All Open statements are lacking specification of argument encoding="utf-8" - which can cause the output methods to crash with UnicodeEncodeError for users running Windows OS #2083

CAW-nz · 2024-12-06T04:13:08Z

Issue Type

Bug

Source

source

Giskard Library Version

2.16.0

OS Platform and Distribution

Windows 10, version 22H2

Python version

3.12.4

Installed python packages

removed as was far too long - can provide if needed

Current Behaviour?

On Window OS the ScanReport output methods such as to_html() are unreliable and can result in the script crashing with a UnicodeEncodeError error message. Refer dump log below.

I came across this issue in late Oct but didn't know how to fix this issue then. Now I know exactly what this issue is and how you can fix it.

The root cause is in giskard/scanner/report.py because the open statements for all the methods (to_json, to_html, to_markdown, to_avid and generate_rails) do not specify a value for encoding and use the system default. For most systems this is encoding="utf-8" but for Windows OS this defaults to "cp1252" (refer locale.getpreferredencoding()) - and even though sys.getfilesystemencoding() returns "utf-8".

=> Solution is to add specific encoding="utf-8" argument to open statements. There are 5 open statements in report.py and it's likely that all 5 need to be updated. I'm assuming that there will also be multiple other files that similarly have open statements without this argument. Please review all the code when fixing this issue.

Additionally it is not good enough to just add this argument for the to_json() method. The json.dump commands also need ensure_ascii=False argument. Refer to my scripts below to reproduce both these issues using standard function calls and sample strings to demo issue.

Additional info:
For the open statement problem, the issue is that the default encoding is not always "utf-8".
Refer https://docs.python.org/3/library/locale.html#locale.getpreferredencoding and https://docs.python.org/3/glossary.html#term-locale-encoding which says for Windows the default locale encoding is "cp1252". Without the extra encoding argument the code is not truly cross-platform supported.

A workaround does exist (but not for the extra json issue) while the code remains without specific encoding="utf-8" arguments and is to set PYTHONUTF8=1 as an environment variable before Python startup. Refer https://docs.python.org/3/library/os.html#utf8-mode

Standalone code OR list down the steps to reproduce the issue

This issue was reproduced today (on latest version of Giskard) simply using the Quickstart script provided at https://github.com/Giskard-AI/giskard and simply executing sections 1 and 2 in a Jupyter Notebook session on Window OS. The only thing I did differently was I added the following immediately before your script (as well as setting the OpenAI API key):

import giskard
giskard.llm.set_llm_model("gpt-4o-mini")
giskard.llm.set_embedding_model("text-embedding-3-small")

Example to demonstrate the open encoding='utf-8' issue:

# This script works
with open("OpenIssue_specified.txt", "w", encoding="utf-8") as f2:
   f2.write("This was written with encoding utf-8 specifically included.\n")
   f2.write("I'm seeing issues when a PDF has been read in and encoded, then the data is output with content such as this word 'ﬁnance'.\n")
   f2.write("NB: The word has a single 'character' ﬁ instead of two characters fi.\n")
   f2.close()

# But this script gives unicode error:
#   UnicodeEncodeError: 'charmap' codec can't encode character '\ufb01' in position 115: character maps to <undefined>
# (However it executes properly after PYTHONUTF8=1 was set before startup of the Python server)
with open("OpenIssue_notspecified.txt", "w") as f1:
   f1.write("This was written with no default specified.\n")
   f1.write("I'm seeing issues when a PDF has been read in and encoded, then the data is output with content such as this word 'ﬁnance'.\n")
   f1.write("NB: The word has a single 'character' ﬁ instead of two characters fi.\n")
   f1.close()

this second one crashes with the following:

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
Cell In[6], line 6
      4 with open("OpenIssue_notspecified.txt", "w") as f1:
      5    f1.write("This was written with no default specified.\n")
----> 6    f1.write("I'm seeing issues when a PDF has been read in and encoded, then the data is output with content such as this word 'ﬁnance'.\n")
      7    f1.write("NB: The word has a single 'character' ﬁ instead of two characters fi.\n")
      8    f1.close()

File C:\ProgramData\anaconda3\Lib\encodings\cp1252.py:19, in IncrementalEncoder.encode(self, input, final)
     18 def encode(self, input, final=False):
---> 19     return codecs.charmap_encode(input,self.errors,encoding_table)[0]

UnicodeEncodeError: 'charmap' codec can't encode character '\ufb01' in position 115: character maps to <undefined>

Example to demonstrate the additional issue for json output:

import json
results = [{"text_string": "The temperature was 1.5°C", "test": "Pass", "extra_string": "Also testing the word 'ﬁnance'"}]

# Current script arguments
with open("JsonIssue_CurrentArgs.txt", "w") as json_file1:
   json.dump(results, json_file1, indent=4)
   json_file1.close()

# Added fix open(encoding="utf-8") but didn't add anything further.
# NB: This gives the same incorrectly encoded output as previous file.
with open("JsonIssue_NoEnsure_asciiSpecified.txt", "w", encoding="utf-8") as json_file2:
   json.dump(results, json_file2, indent=4)
   json_file2.close()

# Added fix open(encoding="utf-8") AND json.dump(ensure_ascii=False)
# NB: This is the only combination that work properly!!
with open("JsonIssue_InclEnsure_ascii.txt", "w", encoding="utf-8") as json_file3:
   json.dump(results, json_file3, indent=4, ensure_ascii=False)
   json_file3.close()

# FYI the other combo - just to show ensure_ascii=False is not ok on its own
# NB: I first removed the 'fi' character otherwise this crashes with same UnicodeEncodeError as highlighted for other file formats when encoding="utf-8" missing
# This simplified output string works...
results = [{"text_string": "The temperature was 1.5°C", "test": "Pass"}]
with open("JsonIssue_JustEnsure_ascii_ShortText.txt", "w") as json_file4:
   json.dump(results, json_file4, indent=4, ensure_ascii=False)
   json_file4.close()

# Reinstate the original results string.  This fails to execute...
results = [{"text_string": "The temperature was 1.5°C", "test": "Pass", "extra_string": "Also testing the word 'ﬁnance'"}]
with open("JsonIssue_JustEnsure_ascii_FullText.txt", "w") as json_file5:
   json.dump(results, json_file5, indent=4, ensure_ascii=False)
   json_file5.close()

Running this last script gives dump on last json.dump call.

The contents of each file is as follows:
JsonIssue_CurrentArgs.txt

[
    {
        "text_string": "The temperature was 1.5\u00b0C",
        "test": "Pass",
        "extra_string": "Also testing the word '\ufb01nance'"
    }
]

JsonIssue_NoEnsure_asciiSpecified.txt

[
    {
        "text_string": "The temperature was 1.5\u00b0C",
        "test": "Pass",
        "extra_string": "Also testing the word '\ufb01nance'"
    }
]

JsonIssue_InclEnsure_ascii.txt

[
    {
        "text_string": "The temperature was 1.5°C",
        "test": "Pass",
        "extra_string": "Also testing the word 'ﬁnance'"
    }
]

JsonIssue_JustEnsure_ascii_ShortText.txt

[
    {
        "text_string": "The temperature was 1.5°C",
        "test": "Pass"
    }
]

JsonIssue_JustEnsure_ascii_FullText.txt

[
    {
        "text_string": "The temperature was 1.5°C",
        "test": "Pass",
        "extra_string":

which also crashed with the following log:

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
Cell In[6], line 4
      2 results = [{"text_string": "The temperature was 1.5°C", "test": "Pass", "extra_string": "Also testing the word 'ﬁnance'"}]
      3 with open("JsonIssue_JustEnsure_ascii_FullText.txt", "w") as json_file5:
----> 4    json.dump(results, json_file5, indent=4, ensure_ascii=False)
      5    json_file5.close()

File C:\ProgramData\anaconda3\Lib\json\__init__.py:180, in dump(obj, fp, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    177 # could accelerate with writelines in some versions of Python, at
    178 # a debuggability cost
    179 for chunk in iterable:
--> 180     fp.write(chunk)

File C:\ProgramData\anaconda3\Lib\encodings\cp1252.py:19, in IncrementalEncoder.encode(self, input, final)
     18 def encode(self, input, final=False):
---> 19     return codecs.charmap_encode(input,self.errors,encoding_table)[0]

UnicodeEncodeError: 'charmap' codec can't encode character '\ufb01' in position 24: character maps to <undefined>

Relevant log output

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
Cell In[9], line 2
      1 # Or save it to a file
----> 2 scan_results.to_html("scan_results.html")

File ~\AppData\Roaming\Python\Python312\site-packages\giskard\scanner\report.py:119, in ScanReport.to_html(self, filename, embed)
    117 if filename is not None:
    118     with open(filename, "w") as f:
--> 119         f.write(html)
    120     return
    122 return html

File C:\ProgramData\anaconda3\Lib\encodings\cp1252.py:19, in IncrementalEncoder.encode(self, input, final)
     18 def encode(self, input, final=False):
---> 19     return codecs.charmap_encode(input,self.errors,encoding_table)[0]

UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f512' in position 81530: character maps to <undefined>

The text was updated successfully, but these errors were encountered:

henchaves · 2024-12-06T22:38:57Z

Hey @CAW-nz! Thanks a lot for reporting this issue. It will be fixed and merged as soon as possible.
I will reference this issue when we open a PR solving it.

henchaves · 2024-12-12T14:19:53Z

Hello @CAW-nz, I've opened a PR (#2088) implementing the modifications you suggested. Could you install giskard from this branch and see if it works well on your environment now?

This is the command for installing it with pip:

pip install git+https://github.com/Giskard-AI/giskard.git@feature/gsk-4014-set-encoding-as-utf-8-for-all-open-statements#egg=giskard[llm]

Thanks!

CAW-nz · 2024-12-16T00:44:13Z

@henchaves thanks for making the changes. I've successfully installed the branch in my env. FYI - I needed to uninstall giskard first because when I ran the pip install command you provided - it pulled the version ok but didn't install it because it was flagged as the 'same' version as I already had installed (2.16.0). After I uninstalled giskard and then repeated your command it successfully installed the branch (and I could verify my version then had the additional encoding="utf-8" parameters specific to your branch).

I can confirm that the Quickstart scripts - Sections 1 through 3 are able to be run without generating any error now. It looks good from my perspective now. Thanks very much.

henchaves self-assigned this Dec 6, 2024

henchaves added the bug Something isn't working label Dec 6, 2024

henchaves mentioned this issue Dec 17, 2024

[GSK-4014] Set encoding as utf-8 for all open statements #2088

Merged

6 tasks

henchaves closed this as completed Dec 19, 2024

CAW-nz mentioned this issue Jan 13, 2025

'charmap' codec can't encode character '\U0001f512' in position 1020: character maps to <undefined> #2095

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Scan Report: All Open statements are lacking specification of argument encoding="utf-8" - which can cause the output methods to crash with UnicodeEncodeError for users running Windows OS #2083

Scan Report: All Open statements are lacking specification of argument encoding="utf-8" - which can cause the output methods to crash with UnicodeEncodeError for users running Windows OS #2083

CAW-nz commented Dec 6, 2024 •

edited

Loading

henchaves commented Dec 6, 2024

Uh oh!

henchaves commented Dec 12, 2024 •

edited

Loading

Uh oh!

CAW-nz commented Dec 16, 2024 •

edited

Loading

Uh oh!

Uh oh!

Scan Report: All Open statements are lacking specification of argument encoding="utf-8" - which can cause the output methods to crash with UnicodeEncodeError for users running Windows OS #2083

Scan Report: All Open statements are lacking specification of argument encoding="utf-8" - which can cause the output methods to crash with UnicodeEncodeError for users running Windows OS #2083

Comments

CAW-nz commented Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue Type

Source

Giskard Library Version

OS Platform and Distribution

Python version

Installed python packages

Current Behaviour?

Standalone code OR list down the steps to reproduce the issue

Relevant log output

henchaves commented Dec 6, 2024

Uh oh!

henchaves commented Dec 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CAW-nz commented Dec 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CAW-nz commented Dec 6, 2024 •

edited

Loading

henchaves commented Dec 12, 2024 •

edited

Loading

CAW-nz commented Dec 16, 2024 •

edited

Loading