Closed
Description
I've got a few things I'd like in cve-bin-tool in conjunction with the NVD mirror work:
- Minor directory structure changes for mirrors
- As @warthog9 mentioned in today's meeting: the micro mirror project may eventually want to also mirror OSV, Redhat and others if licensing allows. I don't think we're actually too excited about doing that yet since NVD is the only service that has a history of not working when we need it, but future-proofing for it basically just means "make some directories" so let's do it.
- Current setup is here: https://github.com/sec-data/mirror-sandbox but I propose:
/nvd/json
- holds all NVD json files (currently in json_feed). These will be exact copies of the json feeds from https://nvd.nist.gov/vuln/data-feeds for now; we may eventually need to generate these from the API if they turn off the feeds as planned in September./cve-bin-tool
- holds all our pre-parsed files. So we'd move cve_severity and the others into the cve-bin-tool directory.- Then if we ever set up mirrors for other data sources, we'd have
/osv
,/redhat
etc.
- Ability to use NVD JSON files from a mirror
- currently we expect our pre-parsed files on a mirror, but we have the code to use the JSON from NVD, so it should be a matter of wiring these two abilities together.
- Probably the workflow should go "if there's pre-parsed stuff, use that. If there's not, load the full JSON files and do the parsing. If neither of those exist, fail with a message about the mirror not looking right"
- Make sure we're sending an appropriate User-Agent header with requests
- Right now I imagine we're sending whatever python requests does by default, but it might be nice to send cve-bin-tool and a version number, which would allow us to gauge usage on the mirrors if we wanted.
- It looks pretty easy to change this using requests https://docs.python-requests.org/en/latest/user/advanced/#request-and-response-objects
- Our json code may be old enough that it's not using requests but rather urllib directly.
- We should also figure out how to document this so people can have reasonable privacy expectations and understanding.
- Provide a mirroring script for "grab the json and validate it" since we have the code for that but it's hard to trigger on its own.
- Validation will mostly mean checking the metadata they provide and making sure file size/sha256 sum matches. Might have to watch for race conditions.
- We can also provide the ability to validate the jsonschema, but we know from experience that it doesn't always validate, so we probably want to make sure this doesn't block anything.
- @b31ngd3v may have this working on https://github.com/sec-data/mirror-sandbox/ already (I know the files are there, but I don't know how we're doing validation off the top of my head)
I think we're pretty close to turning on the replication once we firm up a directory structure.
I'm going to finish reading up on requests and user-agent setting and see if I can get a PR out shortly, but I need to do some PR review first and I may fall down a rathole of old code that doesn't use requests.
@b31ngd3v -- you've done most of the work on the mirrors, did you want to work on the directory changes or any of the rest of this?
Metadata
Metadata
Assignees
Labels
No labels