license file gathering for non-utf8 files #868

jkowalleck · 2025-04-02T07:39:08Z

problem

gathering license files (PEP638) may detect files that are not UTF8.
python's default character encoding is UTF8. it causes an exception, when decoding fails.

we need resilience when collecting license texts.

REMARK FROM MAINTAINERS:

@jkowalleck: PEP638 clearly defines to expect UTF8 text only. This means, all text that is not proper UTF8 can be skipped, as it does not adhere to the spec.
The only exception are Windows systems, where the encoding might be off, since the default encoding of the OS/FS is non-UTF8 so that files are "migrated" when put to disc.

see feat: finalize and enable PEP-639 der default #843

goal

don't crash when a license file is not UTF8.

expected outcome

encoding issues are not ignored (except UnicodeDecodeError) but dont happen in the first place
have tests in place. (set test files to binary via gitattributes-file might be required, to keep the intended non-UTF8 encoding)

solution

being discussed in the comments

The text was updated successfully, but these errors were encountered:

jkowalleck · 2025-04-02T07:40:58Z

possible solution: auto-detect encoding like

cyclonedx-python/cyclonedx_py/_internal/utils/io.py

Lines 25 to 31 in b4c01c4

    
           def io2str(io: BinaryIO, *, errors: str = 'strict') -> str: 
        
               data = io.read() 
        
               # see https://docs.python.org/3/library/codecs.html#standard-encodings 
        
               encoding = (chardetect(data)['encoding'] or getdefaultencoding()).replace( 
        
                   # replace Windows-encoding with code-page 
        
                   'Windows-', 'cp') 
        
               return data.decode(encoding, errors)

important: don't replace undecodable characters with other characters. keep the errors to strict.

jkowalleck · 2025-04-02T07:44:16Z

possible solution: do not read files as text, but as binary - and base64-encode the content in the CycloneDX.
probably the most simple way, which shifts issues right (downstream)

this would be a breaking change -- just a remark; not a stopper

jkowalleck · 2025-04-02T07:49:49Z

@schlenk this issue is for you :)

schlenk · 2025-04-02T08:37:32Z

After looking into the actual issue some more, i wonder, what the correct way forward is for such cases.

https://peps.python.org/pep-0639/#add-license-file-field declares that:

License file content MUST be UTF-8 encoded text.

So technically, the package metadata is invalid. This can happen, due to for example setuptools not validating the content it puts there, it sometimes just picks up some file that matches a regular expression, without specifying a license file in the pyproject.yaml.
It seems there are some other issues with broken metadata for PEP639 as well (e.g. pypa/setuptools#4759) but the package is still installable.

So one can have bogus or wrongly encoded license information in an installed package. The SBOM generation from the environment should list the installed package.

I think the only sane way forward would be to encode the file as base64 and let downstream review it and hopefully communicate with its upstream to fix the incorrect metadata for the future. I would also declare the content-type as application/octet-stream as we do not know anything about the actual encoding. Auto-detection would not be my choice, as the PEP639 spec clearly states it MUST be UTF-8.

The CycloneDX standard says:

The attachment data. Proactive controls such as input validation and sanitization should be employed to prevent misuse of attachment text.

What does that mean in this context? For example, my issue was a python file called 'license_checker.py' that was picked up by setuptools. But it could be malware or other malicious data too, which might have been injected by accident or on purpose. Do we need to sanitize things here?

Should there be some property in the cdx:python namespace to flag such malformed entities for review or is an 'acknowledgment: declared' sufficient for this issue?

jkowalleck · 2025-04-02T15:15:12Z

Do we need to sanitize things here?

nope. not needed.

we dont collect arbitrary evidence here. we collect declared license texts, and we treat them as this: text files.
we do not care for "malware" or anything, dont overthink this to much. this is text - nothing to be evaluated, and XML/JSON serializers will properly escape critical characters.

my preferred solution: just add text encoding auto-detect when reading the text, and all is good.
this is PEP639 - we expect text, and we try to get text. if this fails, just crash so the user knowns something is not according to spec.
do not put the burden of decoding binary on downstream tooling.

schlenk · 2025-04-02T15:17:48Z

Ok, i will give it a try.

schlenk · 2025-04-04T16:15:03Z

Struggling a bit with some of the test data setup for an integration test, but have some things nearly working by now.

Some questions:

Is it by design, that the Project Root-Component does not use the --pep639 flag and omits its own License-File contents in the output? So --gather-license-files fails to capture a license file for the root component, which complicated the test setup a bit, as i needed two packages with a dependency relation.
It seems chardet is pretty bad in detecting common (windows) encodings like 'UTF-16', which is usually trivial to detect. So there might not be all that much left to auto-detect (at least not with chardet, Linux file handles that file just fine). I will try with some other encodings to see if its worth the trouble.
I can stuff a License.rtf file into a License-Files entry, because its technically 7-bit ASCII, so utf-8 transparent. That gets encoded as application/msword (technically it should be application/rtf or text/rtf according to Wikipedia). This gets base64 encoded in the output, which is probably fine. But might weaken the argument about binary file handling downstream, as the code already pipes UTF-8 files that have non text/ mimetypes through Base64.
What is the proper way to handle the error when no fallback is possible? Some specific exception or just throw a RuntimeError or ValueError or UnicodeDecodeError?
What should happen if multiple license files are specified for a component and just one is totally broken?
What should happen if a license file is specified but is is empty/non existing? That case currently just skips the file without error.

schlenk · 2025-04-04T16:17:48Z

I also filed a bug report with setuptools (pypa/setuptools#4936 ) so they will hopefully prevent bad files to be embedded in packages in the future.

jkowalleck · 2025-04-05T16:41:02Z

at the time pep639 was implemented in this very tool, the PEP was not finalized, so that most of the features were best-effort/experimental. see #843.

Some questions:

Is it by design, that the Project Root-Component does not use the --pep639 flag and omits its own License-File contents in the output? So --gather-license-files fails to capture a license file for the root component, which complicated the test setup a bit, as i needed two packages with a dependency relation.

this is a feature that was not requested/implemented so far. see also #843.

It seems chardet is pretty bad in detecting common (windows) encodings like 'UTF-16', which is usually trivial to detect. So there might not be all that much left to auto-detect (at least not with chardet, Linux file handles that file just fine). I will try with some other encodings to see if its worth the trouble.

If possible, I would simply use the existing function io2str() and hope for the best - if it crashes, then the license file is skipped.
PEP clearly defines to expect UTF8 text only - all "features" this very ticket tries to solve are good-will.
If the use of io2str() is not possible for technical reasons, then just skip the license file and print a debug message.

There is no need to solve all possible issues - just solve what you can and what is worth the effort 👍

I can stuff a License.rtf file into a License-Files entry, because its technically 7-bit ASCII, so utf-8 transparent. That gets encoded as application/msword (technically it should be application/rtf or text/rtf according to Wikipedia). This gets base64 encoded in the output, which is probably fine. But might weaken the argument about binary file handling downstream, as the code already pipes UTF-8 files that have non text/ mimetypes through Base64.

This gets base64 encoded in the output [...]

why? the test is expected to be of mime-type text/* - so no need to base64encode, right?

What is the proper way to handle the error when no fallback is possible? Some specific exception or just throw a RuntimeError or ValueError or UnicodeDecodeError?

my expectation: just skip the broken one. log on level "DEBUG".

What should happen if multiple license files are specified for a component and just one is totally broken?

my expectation: just skip the broken one, use all the others. log on level "DEBUG".

What should happen if a license file is specified but is is empty/non existing? That case currently just skips the file without error.

my expectation: silently skip the thing. log on level "DEBUG".

jkowalleck · 2025-04-05T16:50:02Z

I've revisited parts of PEP630, and given the goal of this ticket - "don't crash when a license file is not UTF8" - I'd say:

all license files are expected to be text
encoding is expected to be UTF8 - we might detect the encoding on best effort

if a license file is not readable, or not convertible to utf8, or whatever issue happens:
just skip the file and log a debug message.
this would clearly solve this very thicket, right?

this means: no need for base64 encoding the text,
no need to go extra-miles to detect uncommon encodings, or convert back and forth.

schlenk · 2025-04-06T13:33:29Z

Yes, this would solve the ticket.

My main goal is to not abort the SBOM generation due to one misdeclared license file, so that would fix it.

jkowalleck · 2025-04-08T06:59:16Z

feel free to open a PR.

you could use locate_file() to craft the file paths, and try to read it with our own io2str().
see https://importlib-metadata.readthedocs.io/en/latest/api.html#importlib_metadata.Distribution.locate_file

jkowalleck · 2025-04-23T10:36:44Z

Yes, this would solve the ticket.

My main goal is to not abort the SBOM generation due to one misdeclared license file, so that would fix it.

How is it going? Do you need help?

schlenk · 2025-04-23T11:05:46Z

I do have some working code, just test failed to get the test cases all working properly. I'll upload some PR later today.

jkowalleck · 2025-04-23T11:09:51Z

sure, even an in-dev/early/draft PullRequest is welcome.

schlenk · 2025-04-23T17:15:44Z

Working tests are added. So i'd consider the PR ready.

jkowalleck · 2025-04-24T15:44:24Z

this was released via https://github.com/CycloneDX/cyclonedx-python/releases/tag/v6.0.0

schlenk · 2025-04-24T16:16:14Z

Thanks for the help with getting this fixed.

jkowalleck mentioned this issue Apr 2, 2025

Continue with PEP639 gathering even when encountering encoding errors. #867

Closed

jkowalleck added the bug Something isn't working label Apr 2, 2025

jkowalleck added the help wanted Extra attention is needed label Apr 2, 2025

schlenk mentioned this issue Apr 23, 2025

Handle misencoded license text files graceful. #884

Merged

jkowalleck linked a pull request Apr 23, 2025 that will close this issue

Handle misencoded license text files graceful. #884

Merged

jkowalleck closed this as completed in #884 Apr 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

license file gathering for non-utf8 files #868

license file gathering for non-utf8 files #868

jkowalleck commented Apr 2, 2025 •

edited

Loading

jkowalleck commented Apr 2, 2025 •

edited

Loading

jkowalleck commented Apr 2, 2025 •

edited

Loading

jkowalleck commented Apr 2, 2025

schlenk commented Apr 2, 2025 •

edited

Loading

jkowalleck commented Apr 2, 2025

schlenk commented Apr 2, 2025

schlenk commented Apr 4, 2025

schlenk commented Apr 4, 2025

jkowalleck commented Apr 5, 2025 •

edited

Loading

jkowalleck commented Apr 5, 2025 •

edited

Loading

schlenk commented Apr 6, 2025

jkowalleck commented Apr 8, 2025

jkowalleck commented Apr 23, 2025

schlenk commented Apr 23, 2025

jkowalleck commented Apr 23, 2025

schlenk commented Apr 23, 2025

jkowalleck commented Apr 24, 2025

schlenk commented Apr 24, 2025

license file gathering for non-utf8 files #868

license file gathering for non-utf8 files #868

Comments

jkowalleck commented Apr 2, 2025 • edited Loading

problem

REMARK FROM MAINTAINERS:

goal

expected outcome

solution

jkowalleck commented Apr 2, 2025 • edited Loading

jkowalleck commented Apr 2, 2025 • edited Loading

jkowalleck commented Apr 2, 2025

schlenk commented Apr 2, 2025 • edited Loading

jkowalleck commented Apr 2, 2025

schlenk commented Apr 2, 2025

schlenk commented Apr 4, 2025

schlenk commented Apr 4, 2025

jkowalleck commented Apr 5, 2025 • edited Loading

jkowalleck commented Apr 5, 2025 • edited Loading

schlenk commented Apr 6, 2025

jkowalleck commented Apr 8, 2025

jkowalleck commented Apr 23, 2025

schlenk commented Apr 23, 2025

jkowalleck commented Apr 23, 2025

schlenk commented Apr 23, 2025

jkowalleck commented Apr 24, 2025

schlenk commented Apr 24, 2025

jkowalleck commented Apr 2, 2025 •

edited

Loading

jkowalleck commented Apr 2, 2025 •

edited

Loading

jkowalleck commented Apr 2, 2025 •

edited

Loading

schlenk commented Apr 2, 2025 •

edited

Loading

jkowalleck commented Apr 5, 2025 •

edited

Loading

jkowalleck commented Apr 5, 2025 •

edited

Loading