-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
9440/augument more promise items with bookworm metadata if author or publish date missing #9587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When an import record is "incomplete" per internetarchive#9440, this commit will look to the `import_item` table to supplement that `rec` with data loaded by BookWorm. This attempt to supplement will apply to any import source if one of the required-for-completeness fields (title, authors, publish_date) is missing. See internetarchive#9440.
This commit changes `promise_batch_imports.py` so that it no longer adds variations of `????` in place of the following empty fields: - `authors` - `publish_date` - `title` These fields are once again required for all imports, including promise items. `????` will also no longer be removed from these fields when importing. When one of these fields is missing, a request is made to BookWorm to stage the item for import, if metadata can be found. A companion commit changes the import process so if one of the above required fields is missing, it will look to the `import_table` for `pending` or `staged` metadata. See internetarchive#9440.
This commit updates the `statsd` wrapper to add support for the `gauge` data type, and makes use of it to add a `gauge` to promise item imports so that we can know, per batch, the complete/incomplete rate.
# https://github.com/internetarchive/openlibrary/issues/9440 | ||
# import_validator().validate() requires more fields. | ||
required_fields = ["title", "authors", "publish_date"] | ||
has_all_required_fields = all(obj.get(field) for field in required_fields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to double check the data format for empty dates -- we want to check that the date is a valid usable date rather than just there is data present. Strings like "None", "N/A", "unknown", "?" etc shouldn't count, but there are a range of date formats, I can't remember which are likely to be encountered.
We're specifically trying to deal with the data from BWB imports, so I'll check the range of possible values in that data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scottbarnes , if these imports are coming from the promise items like
https://archive.org/details/bwb_daily_pallets_2024-07-30
specifically the .ProductJSON.PublicationDate
of each entry in the json file, then I see values like this:
"20010112"
"20100101"
"20020201"
"20050131"
null
"197704"
"20020331"
"19680101"
"20170910"
"20230914"
"20240107"
"19710101"
"20121226"
"198601"
"197703"
"20041001"
"19881201"
"20121114"
"19860327"
"200310"
"1986"
null
"20081001"
"20040810"
"20170502"
null
"199512"
"200802"
"20120321"
"20210909"
"20180406"
"20120920"
"19870501"
"20090101"
"20121020"
"1976"
digits of length between 4 and 8 or JSON null
, it seems like an empty string is also a rare possibility.
I did not see any of the wilder string values I was worried about. If you can confirm the data source is bwb_daily_pallets_2024-07-30 style JSON ProductJSON.PublicationDate
s we should be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hornc, I fear I may have misunderstood the the requirements and flow chart in #9440.
In terms of this PR, I was contemplating that the data source would be any import source, and that if a record was not complete, it should not be imported.
I think what I misunderstood was the desire for every imported record to eventually be complete. I understood this as a goal of eventually making it a threshold requirement that records be complete, after initially making it a requirement that records be differentiable and sane.
On this understanding, I was hoping to go straight to that point, but now think 'eventually' means in theory, a record that has already imported, can eventually be made complete by virtue of including a strong identifier. Is that correct?
As it stands, this PR would apply to records of any source and would only allow import if the record is complete (after having checked BookWorm/the affiliate server for more metadata).
In terms of the workflow diagram, under "Complete now or later?", if a record is not currently complete, once a metadata look up is done by the strong ID, the idea is to still import it even if no further metadata is found, correct? So an incomplete but differentiable record with only a title and ISBN would be imported still, is that right?
Also, would you like this differentiable and sane requirement to apply to all imports, or only promise items?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scottbarnes A record "can eventually be made complete by virtue of including a strong identifier. Is that correct?" I guess in theory. When I was trying to see what could be done to complete some real ones, the existing record really was undifferentiatable. By adding an ISBN, you make a call as to what the correct book is. For some, I think deleting the undifferentiated record is just as good as adding a strong identifier. Basically, if you can do the work to locate the strong identifier, you may as well just import it fresh. Associating with an existing undifferentiated record doesn't really add anything.
An incomplete but differentiable record with only a title and ISBN should be imported. It'd be nice to attempt to get more metadata if possible, but that could be at a later stage by a different process. The goal would be to have it happen before a promised item was scanned. I think you're talking about an attempted lookup that fails to find anything, those need to be imported. We'd have a chance to try other lookups, and I think that's what we want to collect some stats on -- how many of these do we consistently fail to get full metadata for.
The differentiable and sane requirement should apply to all imports. I think undiferentiatable imports from any source could lead to the problem kinds of mixed metadata ending up at scribe. Promise items seem to be the main source, but I think it applies equally to all sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hornc, you are correct that with respect to a record with only a title and ISBN, I was talking about after an attempted lookup that failed to find more metadata.
I added a new commit that is I hope responsive to your vision. Now the following should be true:
- if a record is incomplete, try to make it complete by fetching metadata from BookWorm/the affiliate-server;
- if a record is still incomplete, after trying to fetch additional metadata, but has a title and a strong identifier, import it;
- if a record is still incomplete, after trying to fetch additional metadata, and does not have a strong identifier, do not import it.
This means that a look up for additional metadata will occur before the item is scanned (assuming that promise item imports aren't broken for some other reason).
Here, strong identifiers are limited to isbn_10
, isbn_13
, and lccn
. The issue had mentioned OCN, but I am not actually sure what that is--is it an ocaid
?
Currently this will not delete an undifferentiated record if there is a match, but will instead attempt to add metadata to empty fields on that existing, undifferentiated record. If the desire is still to delete such records, perhaps that could be a separate issue?
Additionally, there is no real work done in this PR to ensure the record is sane. The null
values should drop out for promise items, at least for the publish_date
, authors
, and publishers
fields, due to this function:
openlibrary/scripts/promise_batch_imports.py
Lines 46 to 49 in c8d3720
def clean_null(val: str | None) -> str | None: | |
if val in ('', 'null', 'null--'): | |
return None | |
return val |
To the end of better sanity checking, perhaps this PR could be dusted off: #8123.
Assuming this does what it purports to do, how is it looking at the moment, in terms of meeting the immediate goals?
(One issue of note is that the error message displayed when an import fails to validate both as complete and as differentiated is that the error message only mentions errors related to completeness. I did this on the theory that we want to encourage people to enter complete records and not just titles and strong identifiers, and also because I wasn't sure it was worth trying to combine the errors. I can do so if that is desired).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scottbarnes those changes sound good, thank you!
An OCN is a OCLC number, which is assigned by OCLC who operate Worldcat https://search.worldcat.org/
Not deleting records is fine for this piece of work.
As for validating that the data fields are sane, we can look at what is required for dates in more detail.
I'll review all the code against the flow diagram and let you know. It seems good so far.
Marking as Blocked while we wait for the data analysis. @hornc if it would be useful, we could mark this as "Ready for Review" and extend it in a future issue/PR to handle the none-ish values. Up to y'all. |
This commit changes the import requirements such that a record must either be: 1. complete, or 2. differentiable. See internetarchive#9440 for the definitions.
|
||
@pytest.mark.parametrize('field', ['isbn_13']) | ||
def test_validate_not_complete_no_strong_identifier(field): | ||
"""An incomplete record without a strong identifier won't validate.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scottbarnes we need to ensure that a complete record without a strong identifier will still validate, currently that case fails validation:
This is an example test I tried locally -- Pre-ISBN books with title+author+date should still validate.
def test_validate_differentiable_without_strong_identifier():
"""Confirm differentiable pre-ISBN and non LCCN books can still validate."""
preisbn = {
"title": "Beowulf",
"author": {"name": "William Morris"},
"publishers": ["Kelmscott Press"],
"publish_date": "1895",
}
assert validator.validate(preisbn)
For some of those test cases it's hard to tell at a glance what a valid import example is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hornc, that is a good point about the test cases. I will try to clean that up, and also figure out what's going on with the tests such that everything that should pass passes and things that shouldn't don't.
In terms of importing via the API, with authors
as a list
, and a source_records
field, but no isbn_*
field, a very similar complete import record will import for me in the local development environment, unless it's from Amazon, which has a separate, pre-existing ISBN requirement:
❯ curl -X POST http://localhost:8080/api/import -H "Content-Type: application/json" -b ~/cookies.txt -d '{
"title": "Beowulf",
"authors": [{"name": "William Morris"}],
"publishers": ["Kelmscott Press"],
"publish_date": "1895",
"source_records": ["amazon:09123456789"]
}'
{"success": false, "error_code": "unhandled-exception", "error": "SourceNeedsISBN()"}%
❮ curl -X POST http://localhost:8080/api/import -H "Content-Type: application/json" -b ~/cookies.txt -d '{
"title": "Beowulf",
"authors": [{"name": "William Morris"}],
"publishers": ["Kelmscott Press"],
"publish_date": "1895",
"source_records": ["not_amazon:09123456789"]
}'
{"authors": [{"key": "/authors/OL26A", "name": "William Morris", "status": "created"}], "success": true, "edition": {"key": "/books/OL60M", "status": "created"}, "work": {"key": "/works/OL31W", "status": "created"}}
Edit: I realized I should include an example that doesn't have anything ISBN-like at all, just to rule out an ISBN somehow being sourced from the source_records
field:
❮ curl -X POST http://localhost:8080/api/import -H "Content-Type: application/json" -b ~/cookies.txt -d '{
"title": "Beowulf Part Deux",
"authors": [{"name": "William Morris"}],
"publishers": ["Kelmscott Press"],
"publish_date": "2005",
"source_records": ["not_amazon:XYZ"]
}'
{"authors": [{"key": "/authors/OL26A", "name": "William Morris", "status": "matched"}], "success": true, "edition": {"key": "/books/OL61M", "status": "created"}, "work": {"key": "/works/OL32W", "status": "created"}}
One last edit: I realized I might as well include just title
, an isbn_10
field, and source_records
:
❮ curl -X POST http://localhost:8080/api/import -H "Content-Type: application/json" -b ~/cookies.txt -d '{
"title": "Beowulf The Third",
"isbn_10": ["012345689"],
"source_records": ["blah:123"]
}'
{"success": true, "edition": {"key": "/books/OL62M", "status": "created"}, "work": {"key": "/works/OL33W", "status": "created"}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
completing the testing has moved to #9753
def gauge(key: str, value: int, rate: float = 1.0) -> None: | ||
""" | ||
Gauges are a constant data type. Ordinarily the rate should be 1.0. | ||
|
||
See https://statsd.readthedocs.io/en/v3.3/types.html#gauges | ||
""" | ||
global client | ||
if client: | ||
pystats_logger.debug(f"Updating gauge {key} to {value}") | ||
client.gauge(key, value, rate=rate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potentially remove if unused or not working yet
@hornc re: #9587 (review), I think the tests do pass as @scottbarnes said, we may have had a discrepancy between field names. Since everything else lgtm, merging |
# This is the minimum to achieve a complete record. See: | ||
# https://github.com/internetarchive/openlibrary/issues/9440 | ||
# import_validator().validate() requires more fields. | ||
minimum_complete_fields = ["title", "authors", "publish_date"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scottbarnes This check is effectively re-implementing the import_validator
check again, we could use the same validation check. If the import candidate doesn't validate, try augmenting it, and re-validate using the same checks,
title: NonEmptyStr | ||
source_records: NonEmptyList[NonEmptyStr] | ||
authors: NonEmptyList[Author] | ||
publishers: NonEmptyList[NonEmptyStr] | ||
publish_date: NonEmptyStr | ||
|
||
|
||
class StrongIdentifierBookPlus(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we wanted more descriptive terminology to reflect the intent of this check, I was using 'differentiatable' to mean that we can determine what specific edition the import data relates to, and that captures the complete OR has a at least one hard identifier.
Well, it is great to see some attempt to ensure there is useful data, but “Poems” by “Smith” “Independently Published” 1 Jan 1900 would pass this. Can we do better? |
Closes #9440
Feature.
This PR both stops imports that are not complete per #9440 (i.e. it has all of
title
,authors
, andpublish_date
) and also attempts to stage records for any promise item that is not complete (based on BookWorm+AMZ).The stats currently are not being recorded, as I am likely doing something wrong. My plan is to feed the Graphite output into Grafana so it's easy to see, promise item versus promise item, how things are going in terms of the percentage of records that are complete, per pallet (or whatever the correct terminology is).
Additionally, the stats only record when BWB metadata is insufficient. It is not yet recording AMZ insufficiency as measured against promise-item-in-need-of-supplementation.
Technical
Testing
Screenshot
Stakeholders
@hornc
@mekarpeles