-
-
Notifications
You must be signed in to change notification settings - Fork 74
license file gathering for non-utf8 files #868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
possible solution: auto-detect encoding like cyclonedx-python/cyclonedx_py/_internal/utils/io.py Lines 25 to 31 in b4c01c4
important: don't replace undecodable characters with other characters. keep the |
possible solution: do not read files as text, but as binary - and base64-encode the content in the CycloneDX. this would be a breaking change -- just a remark; not a stopper |
@schlenk this issue is for you :) |
After looking into the actual issue some more, i wonder, what the correct way forward is for such cases. https://peps.python.org/pep-0639/#add-license-file-field declares that:
So technically, the package metadata is invalid. This can happen, due to for example setuptools not validating the content it puts there, it sometimes just picks up some file that matches a regular expression, without specifying a license file in the pyproject.yaml. So one can have bogus or wrongly encoded license information in an installed package. The SBOM generation from the environment should list the installed package. I think the only sane way forward would be to encode the file as base64 and let downstream review it and hopefully communicate with its upstream to fix the incorrect metadata for the future. I would also declare the content-type as application/octet-stream as we do not know anything about the actual encoding. Auto-detection would not be my choice, as the PEP639 spec clearly states it MUST be UTF-8. The CycloneDX standard says:
What does that mean in this context? For example, my issue was a python file called 'license_checker.py' that was picked up by setuptools. But it could be malware or other malicious data too, which might have been injected by accident or on purpose. Do we need to sanitize things here? Should there be some property in the cdx:python namespace to flag such malformed entities for review or is an 'acknowledgment: declared' sufficient for this issue? |
nope. not needed. we dont collect arbitrary evidence here. we collect declared license texts, and we treat them as this: text files. my preferred solution: just add text encoding auto-detect when reading the text, and all is good. |
Ok, i will give it a try. |
Struggling a bit with some of the test data setup for an integration test, but have some things nearly working by now. Some questions:
|
I also filed a bug report with setuptools (pypa/setuptools#4936 ) so they will hopefully prevent bad files to be embedded in packages in the future. |
at the time pep639 was implemented in this very tool, the PEP was not finalized, so that most of the features were best-effort/experimental. see #843.
this is a feature that was not requested/implemented so far. see also #843.
If possible, I would simply use the existing function There is no need to solve all possible issues - just solve what you can and what is worth the effort 👍
why? the test is expected to be of mime-type
my expectation: just skip the broken one. log on level "DEBUG".
my expectation: just skip the broken one, use all the others. log on level "DEBUG".
my expectation: silently skip the thing. log on level "DEBUG". |
I've revisited parts of PEP630, and given the goal of this ticket - "don't crash when a license file is not UTF8" - I'd say:
if a license file is not readable, or not convertible to utf8, or whatever issue happens: this means: no need for base64 encoding the text, |
Yes, this would solve the ticket. My main goal is to not abort the SBOM generation due to one misdeclared license file, so that would fix it. |
feel free to open a PR. you could use |
How is it going? Do you need help? |
I do have some working code, just test failed to get the test cases all working properly. I'll upload some PR later today. |
sure, even an in-dev/early/draft PullRequest is welcome. |
Working tests are added. So i'd consider the PR ready. |
this was released via https://github.com/CycloneDX/cyclonedx-python/releases/tag/v6.0.0 |
Thanks for the help with getting this fixed. |
problem
gathering license files (PEP638) may detect files that are not UTF8.
python's default character encoding is UTF8. it causes an exception, when decoding fails.
we need resilience when collecting license texts.
REMARK FROM MAINTAINERS:
@jkowalleck: PEP638 clearly defines to expect UTF8 text only. This means, all text that is not proper UTF8 can be skipped, as it does not adhere to the spec.
The only exception are Windows systems, where the encoding might be off, since the default encoding of the OS/FS is non-UTF8 so that files are "migrated" when put to disc.
goal
don't crash when a license file is not UTF8.
expected outcome
except UnicodeDecodeError
) but dont happen in the first placebinary
via gitattributes-file might be required, to keep the intended non-UTF8 encoding)solution
being discussed in the comments
The text was updated successfully, but these errors were encountered: