Skip to content

9440/augument more promise items with bookworm metadata if author or publish date missing #9587

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

scottbarnes
Copy link
Collaborator

Closes #9440

Feature.

This PR both stops imports that are not complete per #9440 (i.e. it has all of title, authors, and publish_date) and also attempts to stage records for any promise item that is not complete (based on BookWorm+AMZ).

The stats currently are not being recorded, as I am likely doing something wrong. My plan is to feed the Graphite output into Grafana so it's easy to see, promise item versus promise item, how things are going in terms of the percentage of records that are complete, per pallet (or whatever the correct terminology is).

Additionally, the stats only record when BWB metadata is insufficient. It is not yet recording AMZ insufficiency as measured against promise-item-in-need-of-supplementation.

Technical

Testing

Screenshot

Stakeholders

@hornc
@mekarpeles

When an import record is "incomplete" per internetarchive#9440, this commit will look
to the `import_item` table to supplement that `rec` with data loaded by
BookWorm.

This attempt to supplement will apply to any import source if one of the
required-for-completeness fields (title, authors, publish_date) is
missing.

See internetarchive#9440.
This commit changes `promise_batch_imports.py` so that it no longer adds
variations of `????` in place of the following empty fields:
- `authors`
- `publish_date`
- `title`

These fields are once again required for all imports, including promise
items. `????` will also no longer be removed from these fields when
importing.

When one of these fields is missing, a request is made to BookWorm to
stage the item for import, if metadata can be found. A companion commit
changes the import process so if one of the above required fields is
missing, it will look to the `import_table` for `pending` or `staged`
metadata.

See internetarchive#9440.
This commit updates the `statsd` wrapper to add support for the `gauge`
data type, and makes use of it to add a `gauge` to promise item imports
so that we can know, per batch, the complete/incomplete rate.
@scottbarnes scottbarnes marked this pull request as draft July 18, 2024 23:06
@github-actions github-actions bot added the Priority: 2 Important, as time permits. [managed] label Jul 19, 2024
# https://github.com/internetarchive/openlibrary/issues/9440
# import_validator().validate() requires more fields.
required_fields = ["title", "authors", "publish_date"]
has_all_required_fields = all(obj.get(field) for field in required_fields)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to double check the data format for empty dates -- we want to check that the date is a valid usable date rather than just there is data present. Strings like "None", "N/A", "unknown", "?" etc shouldn't count, but there are a range of date formats, I can't remember which are likely to be encountered.

We're specifically trying to deal with the data from BWB imports, so I'll check the range of possible values in that data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scottbarnes , if these imports are coming from the promise items like
https://archive.org/details/bwb_daily_pallets_2024-07-30

specifically the .ProductJSON.PublicationDate of each entry in the json file, then I see values like this:

"20010112"
"20100101"
"20020201"
"20050131"
null
"197704"
"20020331"
"19680101"
"20170910"
"20230914"
"20240107"
"19710101"
"20121226"
"198601"
"197703"
"20041001"
"19881201"
"20121114"
"19860327"
"200310"
"1986"
null
"20081001"
"20040810"
"20170502"
null
"199512"
"200802"
"20120321"
"20210909"
"20180406"
"20120920"
"19870501"
"20090101"
"20121020"
"1976"

digits of length between 4 and 8 or JSON null, it seems like an empty string is also a rare possibility.

I did not see any of the wilder string values I was worried about. If you can confirm the data source is bwb_daily_pallets_2024-07-30 style JSON ProductJSON.PublicationDates we should be good.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hornc, I fear I may have misunderstood the the requirements and flow chart in #9440.

In terms of this PR, I was contemplating that the data source would be any import source, and that if a record was not complete, it should not be imported.

I think what I misunderstood was the desire for every imported record to eventually be complete. I understood this as a goal of eventually making it a threshold requirement that records be complete, after initially making it a requirement that records be differentiable and sane.

On this understanding, I was hoping to go straight to that point, but now think 'eventually' means in theory, a record that has already imported, can eventually be made complete by virtue of including a strong identifier. Is that correct?

As it stands, this PR would apply to records of any source and would only allow import if the record is complete (after having checked BookWorm/the affiliate server for more metadata).

In terms of the workflow diagram, under "Complete now or later?", if a record is not currently complete, once a metadata look up is done by the strong ID, the idea is to still import it even if no further metadata is found, correct? So an incomplete but differentiable record with only a title and ISBN would be imported still, is that right?

Also, would you like this differentiable and sane requirement to apply to all imports, or only promise items?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scottbarnes A record "can eventually be made complete by virtue of including a strong identifier. Is that correct?" I guess in theory. When I was trying to see what could be done to complete some real ones, the existing record really was undifferentiatable. By adding an ISBN, you make a call as to what the correct book is. For some, I think deleting the undifferentiated record is just as good as adding a strong identifier. Basically, if you can do the work to locate the strong identifier, you may as well just import it fresh. Associating with an existing undifferentiated record doesn't really add anything.

An incomplete but differentiable record with only a title and ISBN should be imported. It'd be nice to attempt to get more metadata if possible, but that could be at a later stage by a different process. The goal would be to have it happen before a promised item was scanned. I think you're talking about an attempted lookup that fails to find anything, those need to be imported. We'd have a chance to try other lookups, and I think that's what we want to collect some stats on -- how many of these do we consistently fail to get full metadata for.

The differentiable and sane requirement should apply to all imports. I think undiferentiatable imports from any source could lead to the problem kinds of mixed metadata ending up at scribe. Promise items seem to be the main source, but I think it applies equally to all sources.

Copy link
Collaborator Author

@scottbarnes scottbarnes Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hornc, you are correct that with respect to a record with only a title and ISBN, I was talking about after an attempted lookup that failed to find more metadata.

I added a new commit that is I hope responsive to your vision. Now the following should be true:

  • if a record is incomplete, try to make it complete by fetching metadata from BookWorm/the affiliate-server;
  • if a record is still incomplete, after trying to fetch additional metadata, but has a title and a strong identifier, import it;
  • if a record is still incomplete, after trying to fetch additional metadata, and does not have a strong identifier, do not import it.

This means that a look up for additional metadata will occur before the item is scanned (assuming that promise item imports aren't broken for some other reason).

Here, strong identifiers are limited to isbn_10, isbn_13, and lccn. The issue had mentioned OCN, but I am not actually sure what that is--is it an ocaid?

Currently this will not delete an undifferentiated record if there is a match, but will instead attempt to add metadata to empty fields on that existing, undifferentiated record. If the desire is still to delete such records, perhaps that could be a separate issue?

Additionally, there is no real work done in this PR to ensure the record is sane. The null values should drop out for promise items, at least for the publish_date, authors, and publishers fields, due to this function:

def clean_null(val: str | None) -> str | None:
if val in ('', 'null', 'null--'):
return None
return val

To the end of better sanity checking, perhaps this PR could be dusted off: #8123.

Assuming this does what it purports to do, how is it looking at the moment, in terms of meeting the immediate goals?

(One issue of note is that the error message displayed when an import fails to validate both as complete and as differentiated is that the error message only mentions errors related to completeness. I did this on the theory that we want to encourage people to enter complete records and not just titles and strong identifiers, and also because I wasn't sure it was worth trying to combine the errors. I can do so if that is desired).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scottbarnes those changes sound good, thank you!

An OCN is a OCLC number, which is assigned by OCLC who operate Worldcat https://search.worldcat.org/

Not deleting records is fine for this piece of work.

As for validating that the data fields are sane, we can look at what is required for dates in more detail.

I'll review all the code against the flow diagram and let you know. It seems good so far.

@cdrini cdrini added the State: Blocked Work has stopped, waiting for something (Info, Dependent fix, etc. See comments). [managed] label Jul 29, 2024
@cdrini
Copy link
Collaborator

cdrini commented Jul 29, 2024

Marking as Blocked while we wait for the data analysis.

@hornc if it would be useful, we could mark this as "Ready for Review" and extend it in a future issue/PR to handle the none-ish values. Up to y'all.

This commit changes the import requirements such that a record must
either be:
1. complete, or
2. differentiable.

See internetarchive#9440 for the
definitions.
@hornc hornc added Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] and removed State: Blocked Work has stopped, waiting for something (Info, Dependent fix, etc. See comments). [managed] labels Aug 11, 2024

@pytest.mark.parametrize('field', ['isbn_13'])
def test_validate_not_complete_no_strong_identifier(field):
"""An incomplete record without a strong identifier won't validate."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scottbarnes we need to ensure that a complete record without a strong identifier will still validate, currently that case fails validation:

This is an example test I tried locally -- Pre-ISBN books with title+author+date should still validate.

def test_validate_differentiable_without_strong_identifier():
    """Confirm differentiable pre-ISBN and non LCCN books can still validate."""
    preisbn = {
        "title": "Beowulf",
        "author": {"name": "William Morris"},
        "publishers": ["Kelmscott Press"],
        "publish_date": "1895",
    }
    assert validator.validate(preisbn)

For some of those test cases it's hard to tell at a glance what a valid import example is.

Copy link
Collaborator Author

@scottbarnes scottbarnes Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hornc, that is a good point about the test cases. I will try to clean that up, and also figure out what's going on with the tests such that everything that should pass passes and things that shouldn't don't.

In terms of importing via the API, with authors as a list, and a source_records field, but no isbn_* field, a very similar complete import record will import for me in the local development environment, unless it's from Amazon, which has a separate, pre-existing ISBN requirement:

❯ curl -X POST http://localhost:8080/api/import -H "Content-Type: application/json" -b ~/cookies.txt -d '{
    "title": "Beowulf",
    "authors": [{"name": "William Morris"}],
    "publishers": ["Kelmscott Press"],
    "publish_date": "1895",
    "source_records": ["amazon:09123456789"]
}'
{"success": false, "error_code": "unhandled-exception", "error": "SourceNeedsISBN()"}%                                 

❮ curl -X POST http://localhost:8080/api/import -H "Content-Type: application/json" -b ~/cookies.txt -d '{
    "title": "Beowulf",
    "authors": [{"name": "William Morris"}],
    "publishers": ["Kelmscott Press"],
    "publish_date": "1895",
    "source_records": ["not_amazon:09123456789"]
}'
{"authors": [{"key": "/authors/OL26A", "name": "William Morris", "status": "created"}], "success": true, "edition": {"key": "/books/OL60M", "status": "created"}, "work": {"key": "/works/OL31W", "status": "created"}}

Edit: I realized I should include an example that doesn't have anything ISBN-like at all, just to rule out an ISBN somehow being sourced from the source_records field:

❮ curl -X POST http://localhost:8080/api/import -H "Content-Type: application/json" -b ~/cookies.txt -d '{
    "title": "Beowulf Part Deux",
    "authors": [{"name": "William Morris"}],
    "publishers": ["Kelmscott Press"],
    "publish_date": "2005",
    "source_records": ["not_amazon:XYZ"]
}'
{"authors": [{"key": "/authors/OL26A", "name": "William Morris", "status": "matched"}], "success": true, "edition": {"key": "/books/OL61M", "status": "created"}, "work": {"key": "/works/OL32W", "status": "created"}}

One last edit: I realized I might as well include just title, an isbn_10 field, and source_records:

❮ curl -X POST http://localhost:8080/api/import -H "Content-Type: application/json" -b ~/cookies.txt -d '{
    "title": "Beowulf The Third",
    "isbn_10": ["012345689"],
    "source_records": ["blah:123"]
}'
{"success": true, "edition": {"key": "/books/OL62M", "status": "created"}, "work": {"key": "/works/OL33W", "status": "created"}}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

completing the testing has moved to #9753

@mekarpeles mekarpeles marked this pull request as ready for review August 14, 2024 20:02
Comment on lines +59 to +68
def gauge(key: str, value: int, rate: float = 1.0) -> None:
"""
Gauges are a constant data type. Ordinarily the rate should be 1.0.

See https://statsd.readthedocs.io/en/v3.3/types.html#gauges
"""
global client
if client:
pystats_logger.debug(f"Updating gauge {key} to {value}")
client.gauge(key, value, rate=rate)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially remove if unused or not working yet

@mekarpeles mekarpeles assigned mekarpeles and unassigned hornc Aug 14, 2024
@mekarpeles
Copy link
Member

@hornc re: #9587 (review), I think the tests do pass as @scottbarnes said, we may have had a discrepancy between field names. Since everything else lgtm, merging

@mekarpeles mekarpeles merged commit b112069 into internetarchive:master Aug 14, 2024
4 checks passed
# This is the minimum to achieve a complete record. See:
# https://github.com/internetarchive/openlibrary/issues/9440
# import_validator().validate() requires more fields.
minimum_complete_fields = ["title", "authors", "publish_date"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scottbarnes This check is effectively re-implementing the import_validator check again, we could use the same validation check. If the import candidate doesn't validate, try augmenting it, and re-validate using the same checks,

title: NonEmptyStr
source_records: NonEmptyList[NonEmptyStr]
authors: NonEmptyList[Author]
publishers: NonEmptyList[NonEmptyStr]
publish_date: NonEmptyStr


class StrongIdentifierBookPlus(BaseModel):
Copy link
Collaborator

@hornc hornc Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we wanted more descriptive terminology to reflect the intent of this check, I was using 'differentiatable' to mean that we can determine what specific edition the import data relates to, and that captures the complete OR has a at least one hard identifier.

@LeadSongDog
Copy link

Well, it is great to see some attempt to ensure there is useful data, but “Poems” by “Smith” “Independently Published” 1 Jan 1900 would pass this. Can we do better?

@scottbarnes
Copy link
Collaborator Author

#7865 should have stopped importing "Independently Published" records, and #9138 should have stopped some variations of the 1900 dates for Amazon and BWB imports.

#9753, which is still open, has proposed some additional requirements on fields. More suggestions are always welcome. :)

@scottbarnes scottbarnes deleted the 9440/augument-more-promise-items-with-bookworm-metadata-if-author-or-publish-date-missing branch August 30, 2024 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Priority: 2 Important, as time permits. [managed]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Promise item imports need to augment metadata by any ASIN/ISBN10 if only title + ASIN is provided
5 participants