-
-
Notifications
You must be signed in to change notification settings - Fork 41
[REVIEW]: Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web #3557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @reality, @rlskoeser it looks like you're currently assigned to review this paper 🎉. Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post. ⭐ Important ⭐ If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿 To fix this do the following two things:
For a list of things I can do to help you, just type:
For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:
|
Wordcount for |
|
|
👋 @reality, please update us on how your review is going (this is an automated reminder). |
👋 @rlskoeser, please update us on how your review is going (this is an automated reminder). |
Hi, very sorry for the late review. I will be able to provide it by Monday 30th, hopefully earlier! |
If a task is not fulfilled completely, should I write the problem here?
|
Thanks for this review, @reality ! |
Dear @reality, dear @sbenthall Thank you for this valuable feedback. Please find below a detailed description of how I addressed the suggestions outlined in the review. All the listed improvements have been merged into the project's master branch some minutes ago. Best regards, -- Issue 1: For historical reasons, the license has been in the COPYING file. I have renamed the file to LICENSE to address this comment. Issue 2: Based on your suggestion, the project's documentation has been extended to include a statement of need. In addition, the introduction paragraph now also mentions that Inscriptis is particularly well suited for knowledge extraction and data science tasks. Issue 3: Thank you for pointing this out. I have added a comparison with related software to the paper's "statement of need" section, that considers Beautiful Soup, lxml, Cheerio, HTML2Text, Lynx, jusText, TextSweeper, boilerpy3 and HARVEST. These libraries have been discussed based on the following classification:
In addition, the statement of need now emphasizes the importance of Inscriptis by pointing out that even the popular Common Crawl corpus does not use a layout-aware HTML to text conversion approach, which in turn affects derived datasets such as CCAligned, multilingual C4 and OSCAR, and techniques that build upon them (e.g., the mT5 language model). Issue 4: Based on your feedback, I have added examples of Inscriptis' annotation support to the documentation that cover the following four use cases:
The prior section on Inscriptis' annotation support still uses the simple "toy" example, since it provides an easy-to-understand use case of how annotations are supposed to work. Issue 5: Thank you for this suggestion. I have added a Issue 6: Based on your feedback, I have replaced the URLs to other software with citations. |
@whedon generate pdf |
Apologies for the slow review, have been a bit overwhelmed by other commitments. I worked through the instructions and opened a PR and couple of issues to correct and point out some minor problems I found in the directions; I think they should all be pretty trivial to address. Thanks to the helpful review from @reality and your work to address them it seems like this is in good shape. |
@whedon check references |
|
Dear @rlskoeser, dear @sbenthall, Thank you for your valuable feedback, and especially for spotting the reported documentation issues. Please find below a detailed description of how I addressed the outlined suggestions:
Best regards, -- |
@reality there are some items not checked off in your review. |
Dear @reality and @sbenthall, @reality's review has triggered the improvements summarized in my comment from 3 September 2021. To the best of my knowledge this should address all the issues raised in the review and the missing tasks in the review checklist. Please let me know if there is still any item that has not been completely fulfilled and/or needs improvement. Cheers, |
Hi, I have marked all items as complete. Apologies for the delay in confirming. Also, great work on the manuscript, the case is much clearer now :-) Many thanks, |
@whedon generate pdf |
@whedon check references |
|
@AlbertWeichselbraun could you please:
and report the version number and archive DOI in this thread. |
Dear @sbenthall, I have tagged and archived a release on Zenodo:
Cheers, |
@whedon set 10.5281/zenodo.5562417 as archive |
OK. 10.5281/zenodo.5562417 is the archive. |
@whedon set 2.1.1 as version |
OK. 2.1.1 is the version. |
Thank you for this excellent contribution, @AlbertWeichselbraun |
@whedon recommend-accept |
|
|
👋 @openjournals/joss-eics, this paper is ready to be accepted and published. Check final proof 👉 openjournals/joss-papers#2685 If the paper PDF and Crossref deposit XML look good in openjournals/joss-papers#2685, then you can now move forward with accepting the submission by compiling again with the flag
|
@whedon accept deposit=true |
|
🐦🐦🐦 👉 Tweet for this paper 👈 🐦🐦🐦 |
🚨🚨🚨 THIS IS NOT A DRILL, YOU HAVE JUST ACCEPTED A PAPER INTO JOSS! 🚨🚨🚨 Here's what you must now do:
Any issues? Notify your editorial technical team... |
@reality, @rlskoeser – many thanks for your reviews here and to @sbenthall for editing this submission! JOSS relies upon the volunteer effort of people like you and we simply wouldn't be able to do this without you ✨ @AlbertWeichselbraun – your paper is now accepted and published in JOSS ⚡🚀💥 |
🎉🎉🎉 Congratulations on your paper acceptance! 🎉🎉🎉 If you would like to include a link to your paper from your README use the following code snippets:
This is how it will look in your documentation: We need your help! Journal of Open Source Software is a community-run journal and relies upon volunteer effort. If you'd like to support us please consider doing either one (or both) of the the following:
|
Submitting author: @AlbertWeichselbraun (Albert Weichselbraun)
Repository: https://github.com/weblyzard/inscriptis/
Version: 2.1.1
Editor: @sbenthall
Reviewer: @reality, @rlskoeser
Archive: 10.5281/zenodo.5562417
Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.
Status
Status badge code:
Reviewers and authors:
Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)
Reviewer instructions & questions
@reality & @rlskoeser, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:
The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @sbenthall know.
✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨
Review checklist for @reality
Conflict of interest
Code of Conduct
General checks
Functionality
Documentation
Software paper
Review checklist for @rlskoeser
Conflict of interest
Code of Conduct
General checks
Functionality
Documentation
Software paper
The text was updated successfully, but these errors were encountered: