Skip to content

Community Batch Import Request: arabooks #9726

Open
@avidseeker

Description

@avidseeker

I have a repository of +12K Arabic books metadata: https://github.com/avidseeker/arabooks

If there is a bot to mass-upload them that would be a great addition to OpenLibrary. It currently lacks a lot of Arabic books coverage. (There isn't even an Arabic translation for OpenLibrary: #9673)

Thanks in advance.


Edit:
To complete this issue one would need to parse the TSV files found at https://github.com/avidseeker/arabooks and create JSONL files that look similar to this:

{"identifiers": {"open_textbook_library": ["1581"]}, "source_records": ["open_textbook_library:1581"], "title": "Legal Fundamentals of Healthcare Law", "languages": ["eng"], "subjects": ["Medicine", "Law"], "publishers": ["University of West Florida Pressbooks"], "publish_date": "2024", "authors": [{"name": "Tiffany Jackman"}], "lc_classifications": ["RA440", "KF385.A4"]}
{"identifiers": {"open_textbook_library": ["1580"]}, "source_records": ["open_textbook_library:1580"], "title": "Introduction to Literature: Fairy Tales, Folk Tales, and How They Shape Us", "languages": ["eng"], "subjects": ["Humanities", "Literature, Rhetoric, and Poetry"], "publishers": ["University of West Florida Pressbooks"], "publish_date": "2023", "authors": [{"name": "Judy Young"}], "lc_classifications": ["PE1408"]}

The minimum required fields are: title, authors, publish_date, source_records, and publish_date. The part of the source_records could come from the name of the source, and an identifier, such as for loal-en.tsv from the Library of Arabic Literature, it might be "source_records": ["loal:9781479834129"] for the first item in the list.

Here are the publishers of the TSV files:

  1. awu-dam.tsv: Arabi Writers Union
  2. lisanarb.tsv: contains a pub entry
  3. loal-en.tsv and loal-ar.tsv: Library of Arabic Literature
  4. shamela.tsv: contains a publisher entry. Dates need to be merged from
    shamela-dates.tsv matching same title entry.
  5. waqfeya.tsv: set as "publishers": ["????"], since publishers need to be
    known on one by one basis.

Specifically, the values taken from the TSV and converted into JSONL would need to follow this schema. A script to do this for one line would look similar to this, but would probably use Python's csv module to read the TSV file, and then call json.dumps(line) on each line, after the data is in format specified in the import schema, and then it would be written to a JSONL file.

The output JSONL file could be tested using the endpoint from #8122, though you'd probably want to test with only a few records at a time rather than the whole file.

Metadata

Metadata

Assignees

Labels

Good First IssueEasy issue. Good for newcomers. [managed]Lead: @scottbarnesIssues overseen by Scott (Community Imports)Module: ImportIssues related to the configuration or use of importbot and other bulk import systems. [managed]Needs: HelpIssues, typically substantial ones, that need a dedicated developer to take them on. [managed]Priority: 3Issues that we can consider at our leisure. [managed]Type: QuestionThis issue doesn't require code. A question needs an answer. [managed]

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions