Skip to content

[REVIEW]: Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web #3557

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
39 of 40 tasks
whedon opened this issue Aug 2, 2021 · 41 comments
Assignees
Labels
accepted HTML published Papers published in JOSS Python recommend-accept Papers recommended for acceptance in JOSS. review Shell

Comments

@whedon
Copy link

whedon commented Aug 2, 2021

Submitting author: @AlbertWeichselbraun (Albert Weichselbraun)
Repository: https://github.com/weblyzard/inscriptis/
Version: 2.1.1
Editor: @sbenthall
Reviewer: @reality, @rlskoeser
Archive: 10.5281/zenodo.5562417

⚠️ JOSS reduced service mode ⚠️

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/6039d24c1ea4541fd544dfc398dcb5ca"><img src="https://joss.theoj.org/papers/6039d24c1ea4541fd544dfc398dcb5ca/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/6039d24c1ea4541fd544dfc398dcb5ca/status.svg)](https://joss.theoj.org/papers/6039d24c1ea4541fd544dfc398dcb5ca)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@reality & @rlskoeser, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

  1. Make sure you're logged in to your GitHub account
  2. Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @sbenthall know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Review checklist for @reality

Conflict of interest

  • I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

General checks

  • Repository: Is the source code for this software available at the repository url?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
  • Contribution and authorship: Has the submitting author (@AlbertWeichselbraun) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
  • Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

Functionality

  • Installation: Does installation proceed as outlined in the documentation?
  • Functionality: Have the functional claims of the software been confirmed?
  • Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

  • A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
  • Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
  • Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
  • Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
  • Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

  • Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
  • A statement of need: Does the paper have a section titled 'Statement of Need' that clearly states what problems the software is designed to solve and who the target audience is?
  • State of the field: Do the authors describe how this software compares to other commonly-used packages?
  • Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
  • References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

Review checklist for @rlskoeser

Conflict of interest

  • I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

General checks

  • Repository: Is the source code for this software available at the repository url?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
  • Contribution and authorship: Has the submitting author (@AlbertWeichselbraun) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
  • Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

Functionality

  • Installation: Does installation proceed as outlined in the documentation?
  • Functionality: Have the functional claims of the software been confirmed?
  • Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

  • A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
  • Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
  • Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
  • Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
  • Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

  • Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
  • A statement of need: Does the paper have a section titled 'Statement of Need' that clearly states what problems the software is designed to solve and who the target audience is?
  • State of the field: Do the authors describe how this software compares to other commonly-used packages?
  • Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
  • References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?
@whedon
Copy link
Author

whedon commented Aug 2, 2021

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @reality, @rlskoeser it looks like you're currently assigned to review this paper 🎉.

⚠️ JOSS reduced service mode ⚠️

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

⭐ Important ⭐

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

  1. Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

  1. You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf

@whedon
Copy link
Author

whedon commented Aug 2, 2021

Wordcount for paper.md is 990

@whedon
Copy link
Author

whedon commented Aug 2, 2021

Software report (experimental):

github.com/AlDanial/cloc v 1.88  T=0.10 s (1206.6 files/s, 83471.5 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
HTML                            44            283             79           2602
Python                          50            828           1010           2035
JSON                            13              2              0            251
TeX                              1             38              0            246
reStructuredText                 5            168            143            240
Markdown                         2             88              0            234
YAML                             3             15             32             76
INI                              1              7              0             55
Bourne Shell                     1              8             11             17
make                             2              4              6             16
Dockerfile                       1              3              2             10
-------------------------------------------------------------------------------
SUM:                           123           1444           1283           5782
-------------------------------------------------------------------------------


Statistical information for the repository '8e1f5988f429255a9a903464' was
gathered on 2021/08/02.
The following historical commit information, by author, was found:

Author                     Commits    Insertions      Deletions    % of changes
Albert Weichselbraun           240         12158           6538           98.13
Fabian Odoni                     1             6              4            0.05
Max Goebel                       1             1              1            0.01
fabian                           2            18              6            0.13
k3njiy                           4           143            151            1.54
max                              2            15             12            0.14

Below are the number of rows from each author that have survived and are still
intact in the current revision:

Author                     Rows      Stability          Age       % in comments
Albert Weichselbraun       3845           31.6          8.6               13.99
Fabian Odoni                  1           16.7         51.3                0.00
fabian                       15           83.3         38.9               26.67
k3njiy                       11            7.7          0.0                9.09
max                           1            6.7         31.5                0.00

@whedon
Copy link
Author

whedon commented Aug 2, 2021

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1007/s11042-019-08328-z is OK
- 10.3115/v1/D14-1162 is OK
- 10.1109/JSYST.2015.2466439 is OK
- 10.1016/j.ins.2014.03.096 is OK
- 10.3390/fi13030059 is OK
- 10.1145/3430937 is OK
- 10.1080/14740338.2018.1531847 is OK
- 10.1080/00437956.1954.11659520 is OK

MISSING DOIs

- 10.1109/hicss.2016.133 may be a valid DOI for title: Extracting Opinion Targets from Environmental Web Coverage and Social Media Streams
- 10.18653/v1/2021.acl-long.558 may be a valid DOI for title: SpanNER: Named Entity Re-/Recognition as Span Prediction

INVALID DOIs

- None

@whedon
Copy link
Author

whedon commented Aug 2, 2021

👉📄 Download article proof 📄 View article proof on GitHub 📄 👈

@whedon
Copy link
Author

whedon commented Aug 16, 2021

👋 @reality, please update us on how your review is going (this is an automated reminder).

@whedon
Copy link
Author

whedon commented Aug 16, 2021

👋 @rlskoeser, please update us on how your review is going (this is an automated reminder).

@debuos512
Copy link

Hi, very sorry for the late review. I will be able to provide it by Monday 30th, hopefully earlier!

@debuos512
Copy link

debuos512 commented Aug 30, 2021

If a task is not fulfilled completely, should I write the problem here?

  1. While the GitHub repository itself is annotated with the Apache 2 licence, the repository does not have a LICENSE file in the root.
  2. The documentation itself doesn't include a 'statement of need' in the sense that, in its intro, while it mentions it is layout-aware, it doesn't mention particularly that this will aid in knowledge extraction and data science conducted upon web data.
  3. Statement of need in the article should include comparison to other HTML parsing and extraction libraries such as cheerio and beautifulsoup. I note that Lynx is used an example of a text-only HTML parser that is not layout aware, but it's is not really a HTML extraction tool. Like you say, many NLP/CS/Biomedical tasks use data from the Web - what software have they been using?
  4. Optional: The documentation is overall very good, but the example feels a bit short of what inscriptis can achieve practically. Something like extracting figures or names from a Wikipedia article would really illustrate the usefulness of the software, I think.
  5. Given it's on GitHub, users can easily gain support, but there are no contribution guidelines (usually found in the form of CONTRIBUTORS.md)
  6. Other software should be cited, e.g. Lynx

@sbenthall
Copy link

Thanks for this review, @reality !

@AlbertWeichselbraun
Copy link

Dear @reality, dear @sbenthall

Thank you for this valuable feedback. Please find below a detailed description of how I addressed the suggestions outlined in the review. All the listed improvements have been merged into the project's master branch some minutes ago.

Best regards,
Albert

--

Issue 1: For historical reasons, the license has been in the COPYING file. I have renamed the file to LICENSE to address this comment.

Issue 2: Based on your suggestion, the project's documentation has been extended to include a statement of need. In addition, the introduction paragraph now also mentions that Inscriptis is particularly well suited for knowledge extraction and data science tasks.

Issue 3: Thank you for pointing this out. I have added a comparison with related software to the paper's "statement of need" section, that considers Beautiful Soup, lxml, Cheerio, HTML2Text, Lynx, jusText, TextSweeper, boilerpy3 and HARVEST. These libraries have been discussed based on the following classification:

  1. libraries that are focused on other use cases, therefore, do not interpret HTML semantics, and are likely to produce serious conversion errors (Beautiful Soup, lxml, Cheerio);
  2. conversion software that provides reasonable good text representations of HTML (HTML2Text, Lynx); and
  3. content-extraction tools that focus on specialized tasks such as boilerplate removal and forum extraction (jusText, TextSweeper, boilerpy3, HARVEST).

In addition, the statement of need now emphasizes the importance of Inscriptis by pointing out that even the popular Common Crawl corpus does not use a layout-aware HTML to text conversion approach, which in turn affects derived datasets such as CCAligned, multilingual C4 and OSCAR, and techniques that build upon them (e.g., the mT5 language model).

Issue 4: Based on your feedback, I have added examples of Inscriptis' annotation support to the documentation that cover the following four use cases:

  • Wikipedia tables and table metadata
  • References to entities, missing entities and citations from Wikipedia
  • Posts and post metadata from the XDA developer forum
  • Code and metadata from Stackoverflow pages

The prior section on Inscriptis' annotation support still uses the simple "toy" example, since it provides an easy-to-understand use case of how annotations are supposed to work.

Issue 5: Thank you for this suggestion. I have added a CONTRIBUTING.md file to the repository, which has also been integrated into the Inscriptis documentation.

Issue 6: Based on your feedback, I have replaced the URLs to other software with citations.

@rlskoeser
Copy link

@whedon generate pdf

@whedon
Copy link
Author

whedon commented Sep 3, 2021

👉📄 Download article proof 📄 View article proof on GitHub 📄 👈

@rlskoeser
Copy link

Apologies for the slow review, have been a bit overwhelmed by other commitments.

I worked through the instructions and opened a PR and couple of issues to correct and point out some minor problems I found in the directions; I think they should all be pretty trivial to address.

Thanks to the helpful review from @reality and your work to address them it seems like this is in good shape.

@rlskoeser
Copy link

@whedon check references

@whedon
Copy link
Author

whedon commented Sep 3, 2021

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1007/s11042-019-08328-z is OK
- 10.3115/v1/D14-1162 is OK
- 10.1109/JSYST.2015.2466439 is OK
- 10.1016/j.ins.2014.03.096 is OK
- 10.1109/hicss.2016.133 is OK
- 10.3390/fi13030059 is OK
- 10.1145/3430937 is OK
- 10.18653/v1/2021.acl-long.558 is OK
- 10.1080/14740338.2018.1531847 is OK
- 10.1080/00437956.1954.11659520 is OK
- 10.1109/WIIAT50758.2020.00065 is OK
- 10.1145/2487788.2487828 is OK
- 10.18653/v1/2020.emnlp-main.480 is OK
- 10.18653/v1/2021.naacl-main.41 is OK
- 10.14618/IDS-PUB-9021 is OK

MISSING DOIs

- None

INVALID DOIs

- None

@AlbertWeichselbraun
Copy link

Dear @rlskoeser, dear @sbenthall,

Thank you for your valuable feedback, and especially for spotting the reported documentation issues. Please find below a detailed description of how I addressed the outlined suggestions:

  1. I have merged the documentation fixes with the master branch.
  2. The documentation bug reported in #39 has been resolved.
  3. I have added the Inscriptis web service to the Python package and updated the documentation accordingly, resolving #40.

Best regards,
Albert

--

@sbenthall
Copy link

@reality there are some items not checked off in your review.
Have these issues been addressed by the author?

@AlbertWeichselbraun
Copy link

Dear @reality and @sbenthall,

@reality's review has triggered the improvements summarized in my comment from 3 September 2021.

To the best of my knowledge this should address all the issues raised in the review and the missing tasks in the review checklist. Please let me know if there is still any item that has not been completely fulfilled and/or needs improvement.

Cheers,
Albert :-)

@debuos512
Copy link

debuos512 commented Oct 3, 2021

Hi,

I have marked all items as complete. Apologies for the delay in confirming. Also, great work on the manuscript, the case is much clearer now :-)

Many thanks,
Luke

@sbenthall
Copy link

@whedon generate pdf

@whedon
Copy link
Author

whedon commented Oct 8, 2021

👉📄 Download article proof 📄 View article proof on GitHub 📄 👈

@sbenthall
Copy link

@whedon check references

@whedon
Copy link
Author

whedon commented Oct 8, 2021

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1007/s11042-019-08328-z is OK
- 10.3115/v1/D14-1162 is OK
- 10.1109/JSYST.2015.2466439 is OK
- 10.1016/j.ins.2014.03.096 is OK
- 10.1109/hicss.2016.133 is OK
- 10.3390/fi13030059 is OK
- 10.1145/3430937 is OK
- 10.18653/v1/2021.acl-long.558 is OK
- 10.1080/14740338.2018.1531847 is OK
- 10.1080/00437956.1954.11659520 is OK
- 10.1109/WIIAT50758.2020.00065 is OK
- 10.1145/2487788.2487828 is OK
- 10.18653/v1/2020.emnlp-main.480 is OK
- 10.18653/v1/2021.naacl-main.41 is OK
- 10.14618/IDS-PUB-9021 is OK

MISSING DOIs

- None

INVALID DOIs

- None

@sbenthall
Copy link

@AlbertWeichselbraun could you please:

  • make a tagged release
  • archive the tagged release (on Zenodo, for example)

and report the version number and archive DOI in this thread.

@AlbertWeichselbraun
Copy link

Dear @sbenthall,

I have tagged and archived a release on Zenodo:

Cheers,
Albert :-)

@sbenthall
Copy link

@whedon set 10.5281/zenodo.5562417 as archive

@whedon
Copy link
Author

whedon commented Oct 15, 2021

OK. 10.5281/zenodo.5562417 is the archive.

@sbenthall
Copy link

@whedon set 2.1.1 as version

@whedon
Copy link
Author

whedon commented Oct 15, 2021

OK. 2.1.1 is the version.

@sbenthall
Copy link

Thank you for this excellent contribution, @AlbertWeichselbraun
I recommend this work to the editors for publication.

@sbenthall
Copy link

@whedon recommend-accept

@whedon
Copy link
Author

whedon commented Oct 15, 2021

Attempting dry run of processing paper acceptance...

@whedon whedon added the recommend-accept Papers recommended for acceptance in JOSS. label Oct 15, 2021
@whedon
Copy link
Author

whedon commented Oct 15, 2021

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1007/s11042-019-08328-z is OK
- 10.3115/v1/D14-1162 is OK
- 10.1109/JSYST.2015.2466439 is OK
- 10.1016/j.ins.2014.03.096 is OK
- 10.1109/hicss.2016.133 is OK
- 10.3390/fi13030059 is OK
- 10.1145/3430937 is OK
- 10.18653/v1/2021.acl-long.558 is OK
- 10.1080/14740338.2018.1531847 is OK
- 10.1080/00437956.1954.11659520 is OK
- 10.1109/WIIAT50758.2020.00065 is OK
- 10.1145/2487788.2487828 is OK
- 10.18653/v1/2020.emnlp-main.480 is OK
- 10.18653/v1/2021.naacl-main.41 is OK
- 10.14618/IDS-PUB-9021 is OK

MISSING DOIs

- None

INVALID DOIs

- None

@whedon
Copy link
Author

whedon commented Oct 15, 2021

👋 @openjournals/joss-eics, this paper is ready to be accepted and published.

Check final proof 👉 openjournals/joss-papers#2685

If the paper PDF and Crossref deposit XML look good in openjournals/joss-papers#2685, then you can now move forward with accepting the submission by compiling again with the flag deposit=true e.g.

@whedon accept deposit=true

@arfon
Copy link
Member

arfon commented Oct 16, 2021

@whedon accept deposit=true

@whedon
Copy link
Author

whedon commented Oct 16, 2021

Doing it live! Attempting automated processing of paper acceptance...

@whedon whedon added accepted published Papers published in JOSS labels Oct 16, 2021
@whedon
Copy link
Author

whedon commented Oct 16, 2021

🐦🐦🐦 👉 Tweet for this paper 👈 🐦🐦🐦

@whedon
Copy link
Author

whedon commented Oct 16, 2021

🚨🚨🚨 THIS IS NOT A DRILL, YOU HAVE JUST ACCEPTED A PAPER INTO JOSS! 🚨🚨🚨

Here's what you must now do:

  1. Check final PDF and Crossref metadata that was deposited 👉 Creating pull request for 10.21105.joss.03557 joss-papers#2689
  2. Wait a couple of minutes, then verify that the paper DOI resolves https://doi.org/10.21105/joss.03557
  3. If everything looks good, then close this review issue.
  4. Party like you just published a paper! 🎉🌈🦄💃👻🤘

Any issues? Notify your editorial technical team...

@arfon
Copy link
Member

arfon commented Oct 16, 2021

@reality, @rlskoeser – many thanks for your reviews here and to @sbenthall for editing this submission! JOSS relies upon the volunteer effort of people like you and we simply wouldn't be able to do this without you ✨

@AlbertWeichselbraun – your paper is now accepted and published in JOSS ⚡🚀💥

@arfon arfon closed this as completed Oct 16, 2021
@whedon
Copy link
Author

whedon commented Oct 16, 2021

🎉🎉🎉 Congratulations on your paper acceptance! 🎉🎉🎉

If you would like to include a link to your paper from your README use the following code snippets:

Markdown:
[![DOI](https://joss.theoj.org/papers/10.21105/joss.03557/status.svg)](https://doi.org/10.21105/joss.03557)

HTML:
<a style="border-width:0" href="https://doi.org/10.21105/joss.03557">
  <img src="https://joss.theoj.org/papers/10.21105/joss.03557/status.svg" alt="DOI badge" >
</a>

reStructuredText:
.. image:: https://joss.theoj.org/papers/10.21105/joss.03557/status.svg
   :target: https://doi.org/10.21105/joss.03557

This is how it will look in your documentation:

DOI

We need your help!

Journal of Open Source Software is a community-run journal and relies upon volunteer effort. If you'd like to support us please consider doing either one (or both) of the the following:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted HTML published Papers published in JOSS Python recommend-accept Papers recommended for acceptance in JOSS. review Shell
Projects
None yet
Development

No branches or pull requests

6 participants