GSoC 2024 Project Idea: Product Mapping using PURLs

* Related #3550

# cve-bin-tool: Product Mapping using PURLs

## Project description

CVE Binary Tool needs to identify components in order to scan for vulnerabilities, but uniquely identifying software is not always an easy thing to do.  Some examples:

- Some projects have very common dictionary words as names, so multiple projects with the same name exists (e.g. #3193 and #3152)
- Some projects are wrappers around a popular library, but the wrapper may have its own different set of version numbers (e.g. #3179 )
- A single product may have a lot of names/identifiers depending on who packaged it and some other context
   -  For example: python package [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) can be known as ...
      - 'beautifulsoup4' when installed from pip
      - 'python3-bs4' on debian and ubuntu systems
      - 'python3-beautifulsoup4' on fedora based systems (and thus redhat based ones)
      - 'bs4' or 'python-bs4' or `python-beautifulsoup` in some automated tools intended to list software (e.g. sbom tools, yocto)
- Don't assume any identifier will be unique, or that the same identifier will be used in all databases.

CVE Binary Tool currently has explicit, pre-defined mappings between our binary signatures and a list of CPE identifiers for that "product."  This works pretty well (although it does need to be updated somewhat regularly as different groups handle filing of known vulnerabilities).  

Where we struggle is matching arbitrary product names found in component lists such as python's requirements.txt files.   We have enough information to do better, but we need tools that do that in a mapping.

This project is intended to improve our product mapping and reduce false positives (like in the bugs linked above).  We've noodled around on ideas and my current plan is:

1. Generate internal [PURL identifiers](https://github.com/package-url/purl-spec) within our "language" parsers found here: https://github.com/intel/cve-bin-tool/tree/main/cve_bin_tool/parsers  PURL would let us say the equivalent of "this is python arrow."  This may happen before GSoC starts (e.g. someone may make a pull request to do this in February to get us started)
2. Integrate [purl2cpe](https://github.com/scanoss/purl2cpe) to provide direct mapping between our PURLs and known CPEs
3. Some things won't have CPE entries and thus won't be in purl2CPE.  But we may know (from bug reports) that there's a product with the same name that is absolutely not the same thing.  So we'll need to provide a "is not" database to reduce false positives.  
   -  I suggest using a similar setup to what purl2cpe does -- allow humans to submit pull requests, make all the data readable, provide a way to load it into a queryable database.  We could/should spend some time arguing out those details before you begin, but give us your best guess of how this will work as part of your project proposal.
4. Once we have these integrated for language parsers, see if we can also integrate them into our SBOM matching routines (currently our SBOMs can and do read PURL but don't use the purl2cpe database)
5. Make our "is not" de-duplication database available to the general public similar to purl2cpe -- i.e. make a library so that anyone else can use the data easily.  I'd expect the library initially to be released as part of cve-bin-tool itself, but we could consider packaging it separately if that turns out to be useful to folk (it would certainly have a lot fewer dependencies).

If there's more time after that, I think we may want to consider pulling out other ideas and other sources of data to use.  Do some brainstorming and include those as stretch goals.

## Related reading
* https://github.com/intel/cve-bin-tool/issues/3180
* [PURL spec](https://github.com/package-url/purl-spec)
* [CPE](https://nvd.nist.gov/products/cpe)

## Skills
* python
* understanding of software identifiers such as CPE, PURL, SWID would be helpful (you can learn this as you write your application)

## Difficulty level 
* medium/hard.  

## Project Length
* 350 hours (e.g. full-time for 10 weeks or part-time for longer)
* It would be possible to do part of this project in a 175 hour project, but we may prefer candidates who have the time to do more assuming similar levels of ability

## Mentor
* The primary mentor for this project will likely be @terriko .  Please DO NOT EMAIL TERRI DIRECTLY and ask all questions on this issue instead so you can benefit from the expertise of other contributors and mentors.

# GSoC Participants Only

This issue is a potential project idea for GSoC 2024, and is reserved for completion by a selected GSoC contributor.  Please do not work on it outside of that program.  If you'd like to apply to do it through GSoC, please start by reading #3550.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GSoC 2024 Project Idea: Product Mapping using PURLs #3771

cve-bin-tool: Product Mapping using PURLs

Project description

Related reading

Skills

Difficulty level

Project Length

Mentor

GSoC Participants Only

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GSoC 2024 Project Idea: Product Mapping using PURLs #3771

Description

cve-bin-tool: Product Mapping using PURLs

Project description

Related reading

Skills

Difficulty level

Project Length

Mentor

GSoC Participants Only

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions