Skip to content

license file gathering for non-utf8 files #868

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jkowalleck opened this issue Apr 2, 2025 · 18 comments · Fixed by #884
Closed

license file gathering for non-utf8 files #868

jkowalleck opened this issue Apr 2, 2025 · 18 comments · Fixed by #884
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@jkowalleck
Copy link
Member

jkowalleck commented Apr 2, 2025

problem

gathering license files (PEP638) may detect files that are not UTF8.
python's default character encoding is UTF8. it causes an exception, when decoding fails.

we need resilience when collecting license texts.

REMARK FROM MAINTAINERS:

@jkowalleck: PEP638 clearly defines to expect UTF8 text only. This means, all text that is not proper UTF8 can be skipped, as it does not adhere to the spec.
The only exception are Windows systems, where the encoding might be off, since the default encoding of the OS/FS is non-UTF8 so that files are "migrated" when put to disc.

goal

don't crash when a license file is not UTF8.

expected outcome

  • encoding issues are not ignored (except UnicodeDecodeError) but dont happen in the first place
  • have tests in place. (set test files to binary via gitattributes-file might be required, to keep the intended non-UTF8 encoding)

solution

being discussed in the comments

@jkowalleck
Copy link
Member Author

jkowalleck commented Apr 2, 2025

possible solution: auto-detect encoding like

def io2str(io: BinaryIO, *, errors: str = 'strict') -> str:
data = io.read()
# see https://docs.python.org/3/library/codecs.html#standard-encodings
encoding = (chardetect(data)['encoding'] or getdefaultencoding()).replace(
# replace Windows-encoding with code-page
'Windows-', 'cp')
return data.decode(encoding, errors)

important: don't replace undecodable characters with other characters. keep the errors to strict.

@jkowalleck
Copy link
Member Author

jkowalleck commented Apr 2, 2025

possible solution: do not read files as text, but as binary - and base64-encode the content in the CycloneDX.
probably the most simple way, which shifts issues right (downstream)

this would be a breaking change -- just a remark; not a stopper

@jkowalleck jkowalleck added the bug Something isn't working label Apr 2, 2025
@jkowalleck
Copy link
Member Author

@schlenk this issue is for you :)

@jkowalleck jkowalleck added the help wanted Extra attention is needed label Apr 2, 2025
@schlenk
Copy link
Contributor

schlenk commented Apr 2, 2025

After looking into the actual issue some more, i wonder, what the correct way forward is for such cases.

https://peps.python.org/pep-0639/#add-license-file-field declares that:

License file content MUST be UTF-8 encoded text.

So technically, the package metadata is invalid. This can happen, due to for example setuptools not validating the content it puts there, it sometimes just picks up some file that matches a regular expression, without specifying a license file in the pyproject.yaml.
It seems there are some other issues with broken metadata for PEP639 as well (e.g. pypa/setuptools#4759) but the package is still installable.

So one can have bogus or wrongly encoded license information in an installed package. The SBOM generation from the environment should list the installed package.

I think the only sane way forward would be to encode the file as base64 and let downstream review it and hopefully communicate with its upstream to fix the incorrect metadata for the future. I would also declare the content-type as application/octet-stream as we do not know anything about the actual encoding. Auto-detection would not be my choice, as the PEP639 spec clearly states it MUST be UTF-8.

The CycloneDX standard says:

The attachment data. Proactive controls such as input validation and sanitization should be employed to prevent misuse of attachment text.

What does that mean in this context? For example, my issue was a python file called 'license_checker.py' that was picked up by setuptools. But it could be malware or other malicious data too, which might have been injected by accident or on purpose. Do we need to sanitize things here?

Should there be some property in the cdx:python namespace to flag such malformed entities for review or is an 'acknowledgment: declared' sufficient for this issue?

@jkowalleck
Copy link
Member Author

Do we need to sanitize things here?

nope. not needed.

we dont collect arbitrary evidence here. we collect declared license texts, and we treat them as this: text files.
we do not care for "malware" or anything, dont overthink this to much. this is text - nothing to be evaluated, and XML/JSON serializers will properly escape critical characters.

my preferred solution: just add text encoding auto-detect when reading the text, and all is good.
this is PEP639 - we expect text, and we try to get text. if this fails, just crash so the user knowns something is not according to spec.
do not put the burden of decoding binary on downstream tooling.

@schlenk
Copy link
Contributor

schlenk commented Apr 2, 2025

Ok, i will give it a try.

@schlenk
Copy link
Contributor

schlenk commented Apr 4, 2025

Struggling a bit with some of the test data setup for an integration test, but have some things nearly working by now.

Some questions:

  • Is it by design, that the Project Root-Component does not use the --pep639 flag and omits its own License-File contents in the output? So --gather-license-files fails to capture a license file for the root component, which complicated the test setup a bit, as i needed two packages with a dependency relation.
  • It seems chardet is pretty bad in detecting common (windows) encodings like 'UTF-16', which is usually trivial to detect. So there might not be all that much left to auto-detect (at least not with chardet, Linux file handles that file just fine). I will try with some other encodings to see if its worth the trouble.
  • I can stuff a License.rtf file into a License-Files entry, because its technically 7-bit ASCII, so utf-8 transparent. That gets encoded as application/msword (technically it should be application/rtf or text/rtf according to Wikipedia). This gets base64 encoded in the output, which is probably fine. But might weaken the argument about binary file handling downstream, as the code already pipes UTF-8 files that have non text/ mimetypes through Base64.
  • What is the proper way to handle the error when no fallback is possible? Some specific exception or just throw a RuntimeError or ValueError or UnicodeDecodeError?
  • What should happen if multiple license files are specified for a component and just one is totally broken?
  • What should happen if a license file is specified but is is empty/non existing? That case currently just skips the file without error.

@schlenk
Copy link
Contributor

schlenk commented Apr 4, 2025

I also filed a bug report with setuptools (pypa/setuptools#4936 ) so they will hopefully prevent bad files to be embedded in packages in the future.

@jkowalleck
Copy link
Member Author

jkowalleck commented Apr 5, 2025

at the time pep639 was implemented in this very tool, the PEP was not finalized, so that most of the features were best-effort/experimental. see #843.


Some questions:

  • Is it by design, that the Project Root-Component does not use the --pep639 flag and omits its own License-File contents in the output? So --gather-license-files fails to capture a license file for the root component, which complicated the test setup a bit, as i needed two packages with a dependency relation.

this is a feature that was not requested/implemented so far. see also #843.

  • It seems chardet is pretty bad in detecting common (windows) encodings like 'UTF-16', which is usually trivial to detect. So there might not be all that much left to auto-detect (at least not with chardet, Linux file handles that file just fine). I will try with some other encodings to see if its worth the trouble.

If possible, I would simply use the existing function io2str() and hope for the best - if it crashes, then the license file is skipped.
PEP clearly defines to expect UTF8 text only - all "features" this very ticket tries to solve are good-will.
If the use of io2str() is not possible for technical reasons, then just skip the license file and print a debug message.

There is no need to solve all possible issues - just solve what you can and what is worth the effort 👍

  • I can stuff a License.rtf file into a License-Files entry, because its technically 7-bit ASCII, so utf-8 transparent. That gets encoded as application/msword (technically it should be application/rtf or text/rtf according to Wikipedia). This gets base64 encoded in the output, which is probably fine. But might weaken the argument about binary file handling downstream, as the code already pipes UTF-8 files that have non text/ mimetypes through Base64.

This gets base64 encoded in the output [...]

why? the test is expected to be of mime-type text/* - so no need to base64encode, right?

  • What is the proper way to handle the error when no fallback is possible? Some specific exception or just throw a RuntimeError or ValueError or UnicodeDecodeError?

my expectation: just skip the broken one. log on level "DEBUG".

  • What should happen if multiple license files are specified for a component and just one is totally broken?

my expectation: just skip the broken one, use all the others. log on level "DEBUG".

  • What should happen if a license file is specified but is is empty/non existing? That case currently just skips the file without error.

my expectation: silently skip the thing. log on level "DEBUG".

@jkowalleck
Copy link
Member Author

jkowalleck commented Apr 5, 2025

I've revisited parts of PEP630, and given the goal of this ticket - "don't crash when a license file is not UTF8" - I'd say:

  • all license files are expected to be text
  • encoding is expected to be UTF8 - we might detect the encoding on best effort

if a license file is not readable, or not convertible to utf8, or whatever issue happens:
just skip the file and log a debug message.
this would clearly solve this very thicket, right?

this means: no need for base64 encoding the text,
no need to go extra-miles to detect uncommon encodings, or convert back and forth.

@schlenk
Copy link
Contributor

schlenk commented Apr 6, 2025

Yes, this would solve the ticket.

My main goal is to not abort the SBOM generation due to one misdeclared license file, so that would fix it.

@jkowalleck
Copy link
Member Author

feel free to open a PR.

you could use locate_file() to craft the file paths, and try to read it with our own io2str().
see https://importlib-metadata.readthedocs.io/en/latest/api.html#importlib_metadata.Distribution.locate_file

@jkowalleck
Copy link
Member Author

Yes, this would solve the ticket.

My main goal is to not abort the SBOM generation due to one misdeclared license file, so that would fix it.

How is it going? Do you need help?

@schlenk
Copy link
Contributor

schlenk commented Apr 23, 2025

I do have some working code, just test failed to get the test cases all working properly. I'll upload some PR later today.

@jkowalleck
Copy link
Member Author

sure, even an in-dev/early/draft PullRequest is welcome.

@schlenk
Copy link
Contributor

schlenk commented Apr 23, 2025

Working tests are added. So i'd consider the PR ready.

@jkowalleck
Copy link
Member Author

@schlenk
Copy link
Contributor

schlenk commented Apr 24, 2025

Thanks for the help with getting this fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants