Scan Report: All Open statements are lacking specification of argument encoding="utf-8" - which can cause the output methods to crash with UnicodeEncodeError for users running Windows OS #2083
Labels
bug
Something isn't working
Uh oh!
There was an error while loading. Please reload this page.
Issue Type
Bug
Source
source
Giskard Library Version
2.16.0
OS Platform and Distribution
Windows 10, version 22H2
Python version
3.12.4
Installed python packages
removed as was far too long - can provide if needed
Current Behaviour?
On Window OS the ScanReport output methods such as
to_html()
are unreliable and can result in the script crashing with a UnicodeEncodeError error message. Refer dump log below.I came across this issue in late Oct but didn't know how to fix this issue then. Now I know exactly what this issue is and how you can fix it.
The root cause is in
giskard/scanner/report.py
because theopen
statements for all the methods (to_json
,to_html
,to_markdown
,to_avid
andgenerate_rails
) do not specify a value for encoding and use the system default. For most systems this is encoding="utf-8" but for Windows OS this defaults to "cp1252" (referlocale.getpreferredencoding()
) - and even thoughsys.getfilesystemencoding()
returns "utf-8".=> Solution is to add specific
encoding="utf-8"
argument toopen
statements. There are 5 open statements inreport.py
and it's likely that all 5 need to be updated. I'm assuming that there will also be multiple other files that similarly haveopen
statements without this argument. Please review all the code when fixing this issue.Additionally it is not good enough to just add this argument for the
to_json()
method. Thejson.dump
commands also needensure_ascii=False
argument. Refer to my scripts below to reproduce both these issues using standard function calls and sample strings to demo issue.Additional info:
For the open statement problem, the issue is that the default encoding is not always "utf-8".
Refer https://docs.python.org/3/library/locale.html#locale.getpreferredencoding and https://docs.python.org/3/glossary.html#term-locale-encoding which says for Windows the default locale encoding is "cp1252". Without the extra encoding argument the code is not truly cross-platform supported.
A workaround does exist (but not for the extra json issue) while the code remains without specific
encoding="utf-8"
arguments and is to setPYTHONUTF8=1
as an environment variable before Python startup. Refer https://docs.python.org/3/library/os.html#utf8-modeStandalone code OR list down the steps to reproduce the issue
This issue was reproduced today (on latest version of Giskard) simply using the Quickstart script provided at https://github.com/Giskard-AI/giskard and simply executing sections 1 and 2 in a Jupyter Notebook session on Window OS. The only thing I did differently was I added the following immediately before your script (as well as setting the OpenAI API key):
Example to demonstrate the open encoding='utf-8' issue:
this second one crashes with the following:
Example to demonstrate the additional issue for json output:
Running this last script gives dump on last json.dump call.
The contents of each file is as follows:
JsonIssue_CurrentArgs.txt
JsonIssue_NoEnsure_asciiSpecified.txt
JsonIssue_InclEnsure_ascii.txt
JsonIssue_JustEnsure_ascii_ShortText.txt
JsonIssue_JustEnsure_ascii_FullText.txt
which also crashed with the following log:
Relevant log output
The text was updated successfully, but these errors were encountered: