Skip to content

NVD mirror preparation tasks #3181

Closed
Closed
@terriko

Description

@terriko

I've got a few things I'd like in cve-bin-tool in conjunction with the NVD mirror work:

  1. Minor directory structure changes for mirrors
    • As @warthog9 mentioned in today's meeting: the micro mirror project may eventually want to also mirror OSV, Redhat and others if licensing allows. I don't think we're actually too excited about doing that yet since NVD is the only service that has a history of not working when we need it, but future-proofing for it basically just means "make some directories" so let's do it.
    • Current setup is here: https://github.com/sec-data/mirror-sandbox but I propose:
      • /nvd/json - holds all NVD json files (currently in json_feed). These will be exact copies of the json feeds from https://nvd.nist.gov/vuln/data-feeds for now; we may eventually need to generate these from the API if they turn off the feeds as planned in September.
      • /cve-bin-tool - holds all our pre-parsed files. So we'd move cve_severity and the others into the cve-bin-tool directory.
      • Then if we ever set up mirrors for other data sources, we'd have /osv, /redhat etc.
  2. Ability to use NVD JSON files from a mirror
    • currently we expect our pre-parsed files on a mirror, but we have the code to use the JSON from NVD, so it should be a matter of wiring these two abilities together.
    • Probably the workflow should go "if there's pre-parsed stuff, use that. If there's not, load the full JSON files and do the parsing. If neither of those exist, fail with a message about the mirror not looking right"
  3. Make sure we're sending an appropriate User-Agent header with requests
    • Right now I imagine we're sending whatever python requests does by default, but it might be nice to send cve-bin-tool and a version number, which would allow us to gauge usage on the mirrors if we wanted.
    • It looks pretty easy to change this using requests https://docs.python-requests.org/en/latest/user/advanced/#request-and-response-objects
    • Our json code may be old enough that it's not using requests but rather urllib directly.
    • We should also figure out how to document this so people can have reasonable privacy expectations and understanding.
  4. Provide a mirroring script for "grab the json and validate it" since we have the code for that but it's hard to trigger on its own.
    • Validation will mostly mean checking the metadata they provide and making sure file size/sha256 sum matches. Might have to watch for race conditions.
    • We can also provide the ability to validate the jsonschema, but we know from experience that it doesn't always validate, so we probably want to make sure this doesn't block anything.
    • @b31ngd3v may have this working on https://github.com/sec-data/mirror-sandbox/ already (I know the files are there, but I don't know how we're doing validation off the top of my head)

I think we're pretty close to turning on the replication once we firm up a directory structure.

I'm going to finish reading up on requests and user-agent setting and see if I can get a PR out shortly, but I need to do some PR review first and I may fall down a rathole of old code that doesn't use requests.

@b31ngd3v -- you've done most of the work on the mirrors, did you want to work on the directory changes or any of the rest of this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions