Skip to content

feat: Improve matching for language parsers (avoid name collisions, use purl) #3180

Closed
@terriko

Description

@terriko

I've now hit two cases where find_vendor is finding a product with the same name but different version numbers:

I think it's time for us to build in some de-duplication in cases like this where we're clearly generating false positives for folk.

Since these are currently all coming from the language parsers, I think the logical place to start is in extending the language parser's find_version() function, found here:

https://github.com/intel/cve-bin-tool/blob/main/cve_bin_tool/parsers/__init__.py

Right now, it uses cvedb's get_vendor_product_pairs to search for a product name and return all matches. That works pretty well in a lot of cases, and people can always mark the ones that aren't correct as false positives using triage. But that's a pain, and in both these cases we know that we're not finding the right thing because we know we're looking for a python package. So it would be really nice if we could have find_version() say "look, here's a list of known duplicate product names, let's discard them before the user even sees them"

I'm imagining a file per language, so you'd have a set of files like python-dedupe.json and rust-dedupe.json each with different entries. (I don't love those filenames, but something that included the language and was in a human-editable format would be good. Json is probably the best balance of human-editable and machine-readable for our user base.)

An entry is going to need the following data:

  • the "productname" that we'd expect the language parser to find
  • a list of {vendor, product} pairs that are NOT this package (so they should be discarded if get_vendor_product_pairs finds them, as in the issues linked at the top of this post)
  • a list of {vendor, product} pairs that ARE this package. (not needed for the bugs above but I can easily imagine cases where this might be useful, especially given that many of our binary checkers have multiple {vendor, product} pairs associated with them)
  • If we're eventually going to load these all into a big shared/searchable table, we might also need the language/parser being used (e.g. "python") but that could be added during loading or something.

Presumably we'd have some entries with only NOT lists and some would have only ARE lists, so you wouldn't require both.

You'd load this structure into somewhere (Right into the db for easy lookup? I don't think we want to load/parse on every find_version call.) and use it to streamline what find_version() then returns.

Thoughts? Better ideas? I'm going to tag @XDRAGON2002 specifically since he laid the groundwork for our current parser API, but everyone's thoughts are welcome.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions