feat: Improve matching for language parsers (avoid name collisions, use purl)

I've now hit two cases where find_vendor is finding a product with the same name but different version numbers:

- https://github.com/intel/cve-bin-tool/issues/3179
- https://github.com/intel/cve-bin-tool/issues/3152

I think it's time for us to build in some de-duplication in cases like this where we're clearly generating false positives for folk.

Since these are currently all coming from the language parsers, I think the logical place to start is in extending the language parser's `find_version()` function, found here:

https://github.com/intel/cve-bin-tool/blob/main/cve_bin_tool/parsers/__init__.py

Right now, it uses cvedb's `get_vendor_product_pairs` to search for a product name and return all matches.  That works pretty well in a lot of cases, and people can always mark the ones that aren't correct as false positives using triage.  But that's a pain, and in both these cases we know that we're not finding the right thing because we know we're looking for a python package.  So it would be really nice if we could have find_version() say "look, here's a list of known duplicate product names, let's discard them before the user even sees them"

I'm imagining a file per language, so you'd have a set of files like `python-dedupe.json` and `rust-dedupe.json` each with different entries.  (I don't love those filenames, but something that included the language and was in a human-editable format would be good.  Json is probably the best balance of human-editable and machine-readable for our user base.)

An entry is going to need the following data:
- the "productname" that we'd expect the language parser to find
- a list of {vendor, product} pairs that are NOT this package (so they should be discarded if `get_vendor_product_pairs` finds them, as in the issues linked at the top of this post)
- a list of {vendor, product} pairs that ARE this package.  (not needed for the bugs above but I can easily imagine cases where this might be useful, especially given that many of our binary checkers have multiple `{vendor, product}` pairs associated with them)
- If we're eventually going to load these all into a big shared/searchable table, we might also need the language/parser being used (e.g. "python") but that could be added during loading or something.

Presumably we'd have some entries with only NOT lists and some would have only ARE lists, so you wouldn't require both. 

You'd load this structure into somewhere (Right into the db for easy lookup? I don't think we want to load/parse on every `find_version` call.)  and use it to streamline what `find_version()` then returns.

Thoughts? Better ideas?  I'm going to tag @XDRAGON2002 specifically since he laid the groundwork for our current parser API, but everyone's thoughts are welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Improve matching for language parsers (avoid name collisions, use purl) #3180

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: Improve matching for language parsers (avoid name collisions, use purl) #3180

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions