9440/augument more promise items with bookworm metadata if author or publish date missing #9587

scottbarnes · 2024-07-18T23:06:48Z

Feature.

This PR both stops imports that are not complete per #9440 (i.e. it has all of title, authors, and publish_date) and also attempts to stage records for any promise item that is not complete (based on BookWorm+AMZ).

The stats currently are not being recorded, as I am likely doing something wrong. My plan is to feed the Graphite output into Grafana so it's easy to see, promise item versus promise item, how things are going in terms of the percentage of records that are complete, per pallet (or whatever the correct terminology is).

Additionally, the stats only record when BWB metadata is insufficient. It is not yet recording AMZ insufficiency as measured against promise-item-in-need-of-supplementation.

Technical

Testing

Screenshot

Stakeholders

@hornc
@mekarpeles

When an import record is "incomplete" per internetarchive#9440, this commit will look to the `import_item` table to supplement that `rec` with data loaded by BookWorm. This attempt to supplement will apply to any import source if one of the required-for-completeness fields (title, authors, publish_date) is missing. See internetarchive#9440.

This commit changes `promise_batch_imports.py` so that it no longer adds variations of `????` in place of the following empty fields: - `authors` - `publish_date` - `title` These fields are once again required for all imports, including promise items. `????` will also no longer be removed from these fields when importing. When one of these fields is missing, a request is made to BookWorm to stage the item for import, if metadata can be found. A companion commit changes the import process so if one of the above required fields is missing, it will look to the `import_table` for `pending` or `staged` metadata. See internetarchive#9440.

This commit updates the `statsd` wrapper to add support for the `gauge` data type, and makes use of it to add a `gauge` to promise item imports so that we can know, per batch, the complete/incomplete rate.

hornc · 2024-07-21T22:36:59Z

openlibrary/plugins/importapi/code.py

+        # https://github.com/internetarchive/openlibrary/issues/9440
+        # import_validator().validate() requires more fields.
+        required_fields = ["title", "authors", "publish_date"]
+        has_all_required_fields = all(obj.get(field) for field in required_fields)


I'm going to double check the data format for empty dates -- we want to check that the date is a valid usable date rather than just there is data present. Strings like "None", "N/A", "unknown", "?" etc shouldn't count, but there are a range of date formats, I can't remember which are likely to be encountered.

We're specifically trying to deal with the data from BWB imports, so I'll check the range of possible values in that data.

@scottbarnes , if these imports are coming from the promise items like
https://archive.org/details/bwb_daily_pallets_2024-07-30

specifically the .ProductJSON.PublicationDate of each entry in the json file, then I see values like this:

"20010112" "20100101" "20020201" "20050131" null "197704" "20020331" "19680101" "20170910" "20230914" "20240107" "19710101" "20121226" "198601" "197703" "20041001" "19881201" "20121114" "19860327" "200310" "1986" null "20081001" "20040810" "20170502" null "199512" "200802" "20120321" "20210909" "20180406" "20120920" "19870501" "20090101" "20121020" "1976"

digits of length between 4 and 8 or JSON null, it seems like an empty string is also a rare possibility.

I did not see any of the wilder string values I was worried about. If you can confirm the data source is bwb_daily_pallets_2024-07-30 style JSON ProductJSON.PublicationDates we should be good.

@hornc, I fear I may have misunderstood the the requirements and flow chart in #9440.

In terms of this PR, I was contemplating that the data source would be any import source, and that if a record was not complete, it should not be imported.

I think what I misunderstood was the desire for every imported record to eventually be complete. I understood this as a goal of eventually making it a threshold requirement that records be complete, after initially making it a requirement that records be differentiable and sane.

On this understanding, I was hoping to go straight to that point, but now think 'eventually' means in theory, a record that has already imported, can eventually be made complete by virtue of including a strong identifier. Is that correct?

As it stands, this PR would apply to records of any source and would only allow import if the record is complete (after having checked BookWorm/the affiliate server for more metadata).

In terms of the workflow diagram, under "Complete now or later?", if a record is not currently complete, once a metadata look up is done by the strong ID, the idea is to still import it even if no further metadata is found, correct? So an incomplete but differentiable record with only a title and ISBN would be imported still, is that right?

Also, would you like this differentiable and sane requirement to apply to all imports, or only promise items?

@scottbarnes A record "can eventually be made complete by virtue of including a strong identifier. Is that correct?" I guess in theory. When I was trying to see what could be done to complete some real ones, the existing record really was undifferentiatable. By adding an ISBN, you make a call as to what the correct book is. For some, I think deleting the undifferentiated record is just as good as adding a strong identifier. Basically, if you can do the work to locate the strong identifier, you may as well just import it fresh. Associating with an existing undifferentiated record doesn't really add anything.

An incomplete but differentiable record with only a title and ISBN should be imported. It'd be nice to attempt to get more metadata if possible, but that could be at a later stage by a different process. The goal would be to have it happen before a promised item was scanned. I think you're talking about an attempted lookup that fails to find anything, those need to be imported. We'd have a chance to try other lookups, and I think that's what we want to collect some stats on -- how many of these do we consistently fail to get full metadata for.

The differentiable and sane requirement should apply to all imports. I think undiferentiatable imports from any source could lead to the problem kinds of mixed metadata ending up at scribe. Promise items seem to be the main source, but I think it applies equally to all sources.

@hornc, you are correct that with respect to a record with only a title and ISBN, I was talking about after an attempted lookup that failed to find more metadata.

I added a new commit that is I hope responsive to your vision. Now the following should be true:

if a record is incomplete, try to make it complete by fetching metadata from BookWorm/the affiliate-server;

if a record is still incomplete, after trying to fetch additional metadata, but has a title and a strong identifier, import it;

if a record is still incomplete, after trying to fetch additional metadata, and does not have a strong identifier, do not import it.

This means that a look up for additional metadata will occur before the item is scanned (assuming that promise item imports aren't broken for some other reason).

Here, strong identifiers are limited to isbn_10, isbn_13, and lccn. The issue had mentioned OCN, but I am not actually sure what that is--is it an ocaid?

Currently this will not delete an undifferentiated record if there is a match, but will instead attempt to add metadata to empty fields on that existing, undifferentiated record. If the desire is still to delete such records, perhaps that could be a separate issue?

Additionally, there is no real work done in this PR to ensure the record is sane. The null values should drop out for promise items, at least for the publish_date, authors, and publishers fields, due to this function:

openlibrary/scripts/promise_batch_imports.py

Lines 46 to 49 in c8d3720

def clean_null(val: str | None) -> str | None:

if val in ('', 'null', 'null--'):

return None

return val

To the end of better sanity checking, perhaps this PR could be dusted off: #8123.

Assuming this does what it purports to do, how is it looking at the moment, in terms of meeting the immediate goals?

(One issue of note is that the error message displayed when an import fails to validate both as complete and as differentiated is that the error message only mentions errors related to completeness. I did this on the theory that we want to encourage people to enter complete records and not just titles and strong identifiers, and also because I wasn't sure it was worth trying to combine the errors. I can do so if that is desired).

@scottbarnes those changes sound good, thank you!

An OCN is a OCLC number, which is assigned by OCLC who operate Worldcat https://search.worldcat.org/

Not deleting records is fine for this piece of work.

As for validating that the data fields are sane, we can look at what is required for dates in more detail.

I'll review all the code against the flow diagram and let you know. It seems good so far.

cdrini · 2024-07-29T19:41:14Z

Marking as Blocked while we wait for the data analysis.

@hornc if it would be useful, we could mark this as "Ready for Review" and extend it in a future issue/PR to handle the none-ish values. Up to y'all.

This commit changes the import requirements such that a record must either be: 1. complete, or 2. differentiable. See internetarchive#9440 for the definitions.

hornc · 2024-08-11T23:42:12Z

openlibrary/plugins/importapi/tests/test_import_validator.py

+
+@pytest.mark.parametrize('field', ['isbn_13'])
+def test_validate_not_complete_no_strong_identifier(field):
+    """An incomplete record without a strong identifier won't validate."""


@scottbarnes we need to ensure that a complete record without a strong identifier will still validate, currently that case fails validation:

This is an example test I tried locally -- Pre-ISBN books with title+author+date should still validate.

def test_validate_differentiable_without_strong_identifier(): """Confirm differentiable pre-ISBN and non LCCN books can still validate.""" preisbn = { "title": "Beowulf", "author": {"name": "William Morris"}, "publishers": ["Kelmscott Press"], "publish_date": "1895", } assert validator.validate(preisbn)

For some of those test cases it's hard to tell at a glance what a valid import example is.

@hornc, that is a good point about the test cases. I will try to clean that up, and also figure out what's going on with the tests such that everything that should pass passes and things that shouldn't don't.

In terms of importing via the API, with authors as a list, and a source_records field, but no isbn_* field, a very similar complete import record will import for me in the local development environment, unless it's from Amazon, which has a separate, pre-existing ISBN requirement:

❯ curl -X POST http://localhost:8080/api/import -H "Content-Type: application/json" -b ~/cookies.txt -d '{ "title": "Beowulf", "authors": [{"name": "William Morris"}], "publishers": ["Kelmscott Press"], "publish_date": "1895", "source_records": ["amazon:09123456789"] }' {"success": false, "error_code": "unhandled-exception", "error": "SourceNeedsISBN()"}% ❮ curl -X POST http://localhost:8080/api/import -H "Content-Type: application/json" -b ~/cookies.txt -d '{ "title": "Beowulf", "authors": [{"name": "William Morris"}], "publishers": ["Kelmscott Press"], "publish_date": "1895", "source_records": ["not_amazon:09123456789"] }' {"authors": [{"key": "/authors/OL26A", "name": "William Morris", "status": "created"}], "success": true, "edition": {"key": "/books/OL60M", "status": "created"}, "work": {"key": "/works/OL31W", "status": "created"}}

Edit: I realized I should include an example that doesn't have anything ISBN-like at all, just to rule out an ISBN somehow being sourced from the source_records field:

❮ curl -X POST http://localhost:8080/api/import -H "Content-Type: application/json" -b ~/cookies.txt -d '{ "title": "Beowulf Part Deux", "authors": [{"name": "William Morris"}], "publishers": ["Kelmscott Press"], "publish_date": "2005", "source_records": ["not_amazon:XYZ"] }' {"authors": [{"key": "/authors/OL26A", "name": "William Morris", "status": "matched"}], "success": true, "edition": {"key": "/books/OL61M", "status": "created"}, "work": {"key": "/works/OL32W", "status": "created"}}

One last edit: I realized I might as well include just title, an isbn_10 field, and source_records:

❮ curl -X POST http://localhost:8080/api/import -H "Content-Type: application/json" -b ~/cookies.txt -d '{ "title": "Beowulf The Third", "isbn_10": ["012345689"], "source_records": ["blah:123"] }' {"success": true, "edition": {"key": "/books/OL62M", "status": "created"}, "work": {"key": "/works/OL33W", "status": "created"}}

completing the testing has moved to #9753

mekarpeles · 2024-08-14T20:02:56Z

openlibrary/core/stats.py

+def gauge(key: str, value: int, rate: float = 1.0) -> None:
+    """
+    Gauges are a constant data type. Ordinarily the rate should be 1.0.
+
+    See https://statsd.readthedocs.io/en/v3.3/types.html#gauges
+    """
+    global client
+    if client:
+        pystats_logger.debug(f"Updating gauge {key} to {value}")
+        client.gauge(key, value, rate=rate)


potentially remove if unused or not working yet

mekarpeles · 2024-08-14T20:04:50Z

@hornc re: #9587 (review), I think the tests do pass as @scottbarnes said, we may have had a discrepancy between field names. Since everything else lgtm, merging

hornc · 2024-08-15T07:44:05Z

openlibrary/plugins/importapi/code.py

+        # This is the minimum to achieve a complete record. See:
+        # https://github.com/internetarchive/openlibrary/issues/9440
+        # import_validator().validate() requires more fields.
+        minimum_complete_fields = ["title", "authors", "publish_date"]


@scottbarnes This check is effectively re-implementing the import_validator check again, we could use the same validation check. If the import candidate doesn't validate, try augmenting it, and re-validate using the same checks,

hornc · 2024-08-15T07:48:20Z

openlibrary/plugins/importapi/import_validator.py

    title: NonEmptyStr
    source_records: NonEmptyList[NonEmptyStr]
    authors: NonEmptyList[Author]
    publishers: NonEmptyList[NonEmptyStr]
    publish_date: NonEmptyStr


+class StrongIdentifierBookPlus(BaseModel):


If we wanted more descriptive terminology to reflect the intent of this check, I was using 'differentiatable' to mean that we can determine what specific edition the import data relates to, and that captures the complete OR has a at least one hard identifier.

LeadSongDog · 2024-08-30T22:07:41Z

Well, it is great to see some attempt to ensure there is useful data, but “Poems” by “Smith” “Independently Published” 1 Jan 1900 would pass this. Can we do better?

scottbarnes · 2024-08-30T22:21:09Z

#7865 should have stopped importing "Independently Published" records, and #9138 should have stopped some variations of the 1900 dates for Amazon and BWB imports.

#9753, which is still open, has proposed some additional requirements on fields. More suggestions are always welcome. :)

scottbarnes added 3 commits July 18, 2024 16:01

WIP: Feature: add Graphite stats to promise item import

02f647f

This commit updates the `statsd` wrapper to add support for the `gauge` data type, and makes use of it to add a `gauge` to promise item imports so that we can know, per batch, the complete/incomplete rate.

scottbarnes marked this pull request as draft July 18, 2024 23:06

github-actions bot assigned hornc Jul 19, 2024

github-actions bot added the Priority: 2 Important, as time permits. [managed] label Jul 19, 2024

hornc reviewed Jul 21, 2024

View reviewed changes

cdrini added the State: Blocked Work has stopped, waiting for something (Info, Dependent fix, etc. See comments). [managed] label Jul 29, 2024

Imports: allow differentiable or complete records

00bec1e

This commit changes the import requirements such that a record must either be: 1. complete, or 2. differentiable. See internetarchive#9440 for the definitions.

hornc added Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] and removed State: Blocked Work has stopped, waiting for something (Info, Dependent fix, etc. See comments). [managed] labels Aug 11, 2024

hornc requested changes Aug 11, 2024

View reviewed changes

mekarpeles marked this pull request as ready for review August 14, 2024 20:02

mekarpeles reviewed Aug 14, 2024

View reviewed changes

mekarpeles assigned mekarpeles and unassigned hornc Aug 14, 2024

mekarpeles merged commit b112069 into internetarchive:master Aug 14, 2024
4 checks passed

hornc reviewed Aug 15, 2024

View reviewed changes

scottbarnes mentioned this pull request Aug 30, 2024

Data Cleanup: Populate blank publisher fields from ISBN prefix when known #2119

Closed

scottbarnes deleted the 9440/augument-more-promise-items-with-bookworm-metadata-if-author-or-publish-date-missing branch August 30, 2024 22:21

mekarpeles mentioned this pull request Dec 30, 2024

Permit webservices.amazon.com via 44.215.138.164 to avoid blockage #10230

Closed

	def clean_null(val: str \| None) -> str \| None:
	if val in ('', 'null', 'null--'):
	return None
	return val

Uh oh!

9440/augument more promise items with bookworm metadata if author or publish date missing #9587

9440/augument more promise items with bookworm metadata if author or publish date missing #9587

Uh oh!

Conversation

scottbarnes commented Jul 18, 2024

Technical

Testing

Screenshot

Stakeholders

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scottbarnes Aug 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cdrini commented Jul 29, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scottbarnes Aug 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mekarpeles commented Aug 14, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hornc Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LeadSongDog commented Aug 30, 2024

Uh oh!

scottbarnes commented Aug 30, 2024

Uh oh!

Uh oh!

scottbarnes Aug 8, 2024 •

edited

Loading

scottbarnes Aug 12, 2024 •

edited

Loading

hornc Aug 15, 2024 •

edited

Loading