Skip to content

Scan Report: All Open statements are lacking specification of argument encoding="utf-8" - which can cause the output methods to crash with UnicodeEncodeError for users running Windows OS #2083

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
CAW-nz opened this issue Dec 6, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@CAW-nz
Copy link

CAW-nz commented Dec 6, 2024

Issue Type

Bug

Source

source

Giskard Library Version

2.16.0

OS Platform and Distribution

Windows 10, version 22H2

Python version

3.12.4

Installed python packages

removed as was far too long - can provide if needed

Current Behaviour?

On Window OS the ScanReport output methods such as to_html() are unreliable and can result in the script crashing with a UnicodeEncodeError error message. Refer dump log below.

I came across this issue in late Oct but didn't know how to fix this issue then. Now I know exactly what this issue is and how you can fix it.

The root cause is in giskard/scanner/report.py because the open statements for all the methods (to_json, to_html, to_markdown, to_avid and generate_rails) do not specify a value for encoding and use the system default. For most systems this is encoding="utf-8" but for Windows OS this defaults to "cp1252" (refer locale.getpreferredencoding()) - and even though sys.getfilesystemencoding() returns "utf-8".

=> Solution is to add specific encoding="utf-8" argument to open statements. There are 5 open statements in report.py and it's likely that all 5 need to be updated. I'm assuming that there will also be multiple other files that similarly have open statements without this argument. Please review all the code when fixing this issue.

Additionally it is not good enough to just add this argument for the to_json() method. The json.dump commands also need ensure_ascii=False argument. Refer to my scripts below to reproduce both these issues using standard function calls and sample strings to demo issue.


Additional info:
For the open statement problem, the issue is that the default encoding is not always "utf-8".
Refer https://docs.python.org/3/library/locale.html#locale.getpreferredencoding and https://docs.python.org/3/glossary.html#term-locale-encoding which says for Windows the default locale encoding is "cp1252". Without the extra encoding argument the code is not truly cross-platform supported.

A workaround does exist (but not for the extra json issue) while the code remains without specific encoding="utf-8" arguments and is to set PYTHONUTF8=1 as an environment variable before Python startup. Refer https://docs.python.org/3/library/os.html#utf8-mode

Standalone code OR list down the steps to reproduce the issue

This issue was reproduced today (on latest version of Giskard) simply using the Quickstart script provided at https://github.com/Giskard-AI/giskard and simply executing sections 1 and 2 in a Jupyter Notebook session on Window OS. The only thing I did differently was I added the following immediately before your script (as well as setting the OpenAI API key):

import giskard
giskard.llm.set_llm_model("gpt-4o-mini")
giskard.llm.set_embedding_model("text-embedding-3-small")

Example to demonstrate the open encoding='utf-8' issue:

# This script works
with open("OpenIssue_specified.txt", "w", encoding="utf-8") as f2:
   f2.write("This was written with encoding utf-8 specifically included.\n")
   f2.write("I'm seeing issues when a PDF has been read in and encoded, then the data is output with content such as this word 'finance'.\n")
   f2.write("NB: The word has a single 'character' fi instead of two characters fi.\n")
   f2.close()
# But this script gives unicode error:
#   UnicodeEncodeError: 'charmap' codec can't encode character '\ufb01' in position 115: character maps to <undefined>
# (However it executes properly after PYTHONUTF8=1 was set before startup of the Python server)
with open("OpenIssue_notspecified.txt", "w") as f1:
   f1.write("This was written with no default specified.\n")
   f1.write("I'm seeing issues when a PDF has been read in and encoded, then the data is output with content such as this word 'finance'.\n")
   f1.write("NB: The word has a single 'character' fi instead of two characters fi.\n")
   f1.close()

this second one crashes with the following:

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
Cell In[6], line 6
      4 with open("OpenIssue_notspecified.txt", "w") as f1:
      5    f1.write("This was written with no default specified.\n")
----> 6    f1.write("I'm seeing issues when a PDF has been read in and encoded, then the data is output with content such as this word 'finance'.\n")
      7    f1.write("NB: The word has a single 'character' fi instead of two characters fi.\n")
      8    f1.close()

File C:\ProgramData\anaconda3\Lib\encodings\cp1252.py:19, in IncrementalEncoder.encode(self, input, final)
     18 def encode(self, input, final=False):
---> 19     return codecs.charmap_encode(input,self.errors,encoding_table)[0]

UnicodeEncodeError: 'charmap' codec can't encode character '\ufb01' in position 115: character maps to <undefined>

Example to demonstrate the additional issue for json output:

import json
results = [{"text_string": "The temperature was 1.5°C", "test": "Pass", "extra_string": "Also testing the word 'finance'"}]

# Current script arguments
with open("JsonIssue_CurrentArgs.txt", "w") as json_file1:
   json.dump(results, json_file1, indent=4)
   json_file1.close()

# Added fix open(encoding="utf-8") but didn't add anything further.
# NB: This gives the same incorrectly encoded output as previous file.
with open("JsonIssue_NoEnsure_asciiSpecified.txt", "w", encoding="utf-8") as json_file2:
   json.dump(results, json_file2, indent=4)
   json_file2.close()

# Added fix open(encoding="utf-8") AND json.dump(ensure_ascii=False)
# NB: This is the only combination that work properly!!
with open("JsonIssue_InclEnsure_ascii.txt", "w", encoding="utf-8") as json_file3:
   json.dump(results, json_file3, indent=4, ensure_ascii=False)
   json_file3.close()

# FYI the other combo - just to show ensure_ascii=False is not ok on its own
# NB: I first removed the 'fi' character otherwise this crashes with same UnicodeEncodeError as highlighted for other file formats when encoding="utf-8" missing
# This simplified output string works...
results = [{"text_string": "The temperature was 1.5°C", "test": "Pass"}]
with open("JsonIssue_JustEnsure_ascii_ShortText.txt", "w") as json_file4:
   json.dump(results, json_file4, indent=4, ensure_ascii=False)
   json_file4.close()

# Reinstate the original results string.  This fails to execute...
results = [{"text_string": "The temperature was 1.5°C", "test": "Pass", "extra_string": "Also testing the word 'finance'"}]
with open("JsonIssue_JustEnsure_ascii_FullText.txt", "w") as json_file5:
   json.dump(results, json_file5, indent=4, ensure_ascii=False)
   json_file5.close()

Running this last script gives dump on last json.dump call.

The contents of each file is as follows:
JsonIssue_CurrentArgs.txt

[
    {
        "text_string": "The temperature was 1.5\u00b0C",
        "test": "Pass",
        "extra_string": "Also testing the word '\ufb01nance'"
    }
]

JsonIssue_NoEnsure_asciiSpecified.txt

[
    {
        "text_string": "The temperature was 1.5\u00b0C",
        "test": "Pass",
        "extra_string": "Also testing the word '\ufb01nance'"
    }
]

JsonIssue_InclEnsure_ascii.txt

[
    {
        "text_string": "The temperature was 1.5°C",
        "test": "Pass",
        "extra_string": "Also testing the word 'finance'"
    }
]

JsonIssue_JustEnsure_ascii_ShortText.txt

[
    {
        "text_string": "The temperature was 1.5°C",
        "test": "Pass"
    }
]

JsonIssue_JustEnsure_ascii_FullText.txt

[
    {
        "text_string": "The temperature was 1.5°C",
        "test": "Pass",
        "extra_string": 

which also crashed with the following log:

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
Cell In[6], line 4
      2 results = [{"text_string": "The temperature was 1.5°C", "test": "Pass", "extra_string": "Also testing the word 'finance'"}]
      3 with open("JsonIssue_JustEnsure_ascii_FullText.txt", "w") as json_file5:
----> 4    json.dump(results, json_file5, indent=4, ensure_ascii=False)
      5    json_file5.close()

File C:\ProgramData\anaconda3\Lib\json\__init__.py:180, in dump(obj, fp, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    177 # could accelerate with writelines in some versions of Python, at
    178 # a debuggability cost
    179 for chunk in iterable:
--> 180     fp.write(chunk)

File C:\ProgramData\anaconda3\Lib\encodings\cp1252.py:19, in IncrementalEncoder.encode(self, input, final)
     18 def encode(self, input, final=False):
---> 19     return codecs.charmap_encode(input,self.errors,encoding_table)[0]

UnicodeEncodeError: 'charmap' codec can't encode character '\ufb01' in position 24: character maps to <undefined>

Relevant log output

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
Cell In[9], line 2
      1 # Or save it to a file
----> 2 scan_results.to_html("scan_results.html")

File ~\AppData\Roaming\Python\Python312\site-packages\giskard\scanner\report.py:119, in ScanReport.to_html(self, filename, embed)
    117 if filename is not None:
    118     with open(filename, "w") as f:
--> 119         f.write(html)
    120     return
    122 return html

File C:\ProgramData\anaconda3\Lib\encodings\cp1252.py:19, in IncrementalEncoder.encode(self, input, final)
     18 def encode(self, input, final=False):
---> 19     return codecs.charmap_encode(input,self.errors,encoding_table)[0]

UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f512' in position 81530: character maps to <undefined>
@henchaves henchaves self-assigned this Dec 6, 2024
@henchaves henchaves added the bug Something isn't working label Dec 6, 2024
@henchaves
Copy link
Member

Hey @CAW-nz! Thanks a lot for reporting this issue. It will be fixed and merged as soon as possible.
I will reference this issue when we open a PR solving it.

@henchaves
Copy link
Member

henchaves commented Dec 12, 2024

Hello @CAW-nz, I've opened a PR (#2088) implementing the modifications you suggested. Could you install giskard from this branch and see if it works well on your environment now?

This is the command for installing it with pip:

pip install git+https://github.com/Giskard-AI/giskard.git@feature/gsk-4014-set-encoding-as-utf-8-for-all-open-statements#egg=giskard[llm]

Thanks!

@CAW-nz
Copy link
Author

CAW-nz commented Dec 16, 2024

@henchaves thanks for making the changes. I've successfully installed the branch in my env. FYI - I needed to uninstall giskard first because when I ran the pip install command you provided - it pulled the version ok but didn't install it because it was flagged as the 'same' version as I already had installed (2.16.0). After I uninstalled giskard and then repeated your command it successfully installed the branch (and I could verify my version then had the additional encoding="utf-8" parameters specific to your branch).

I can confirm that the Quickstart scripts - Sections 1 through 3 are able to be run without generating any error now. It looks good from my perspective now. Thanks very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

2 participants