Skip to content

Commit e23f79a

Browse files
Fix/joss review 2021 08 30 (#37)
* fix: use LICENSE rather than COPYING for the license file. * add: contribution guidelines These guidelines have been based on the guidelines used by the atom and gnome-todo projects and been adapted to inscripits. * chg: refined documentation. * add: more examples for annotation profiles. * chg: improved examples. * fix: documentation on annotation rules. * wip: improved statement of need. * add: improved statement of need based on the reviewer comments * fix: grammar, spelling and linking erros. * chg: extended guidelines. * fix: spacing and phrasing of non-specialized tools. * fix: codefactor issues. * chg: do not consider conf.py for codefactor.io * chg: do not consider conf.py for codefactor.io
1 parent 586dc22 commit e23f79a

11 files changed

+452
-44
lines changed

CONTRIBUTING.md

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# Contributing to Inscriptis
2+
3+
First off, thank you for considering contributing to inscriptis.
4+
There are many ways how you can contribute to the project and these guidelines aim at supporting you in doing so.
5+
6+
1. [Reporting bugs and seeking support](#reporting-bugs-and-seeking-support)
7+
2. [Suggesting enhancements](#suggesting-enhancements)
8+
3. [Pull requests](#pull-requests) (contributing code)
9+
4. [Python style guide](#python-style-guide)
10+
11+
12+
## Reporting bugs and seeking support
13+
14+
Bugs and support requests are tracked as GitHub issues.
15+
16+
To create an effective and high quality ticket, please include the following information in your
17+
ticket:
18+
19+
1. **Use a clear and descriptive title** for the issue to identify the problem. This also helps other users to quickly locate bug reports that affect them.
20+
2. **Describe the exact steps necessary for reproducing the problem** including at least information on
21+
- the affected URL
22+
- the command line parameters or function arguments you used
23+
3. What would have been the **expected behavior**?
24+
4. Describe the **observed behavior**.
25+
5. Provide any additional information which might be helpful in reproducing and/or fixing this issue.
26+
27+
28+
## Suggesting enhancements
29+
30+
Enhancements are also tracked as GitHub issues and should contain the following information:
31+
32+
1. **A clear and descriptive title** helps other people to identify enhancements they like, so that they can also add their thoughts and suggestions.
33+
2. **Provide a step-by-step description** of the suggested enhancement.
34+
3. **Describe the current behavior** and **explain which behavior you expected to see instead** and why.
35+
36+
37+
## Pull requests
38+
39+
1. Ensure that your code complies with our [Python style guide](#python-style-guide).
40+
2. Write a unit test that covers your new code and put it into the `./tests` directory.
41+
3. Execute `tox .` in the project's root directory to ensure that your code passes the static code analysis, coding style guidelines and security checks.
42+
4. In addition, please document any new API functions in the Inscriptis documentation.
43+
44+
45+
## Python style guide
46+
47+
Inscriptis code should comply to
48+
- the [PEP8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/), and
49+
- to the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html)
50+
51+
Please also ensure that
52+
1. functions are properly documented with docstrings that comply to the Google Python Style Guide, and
53+
2. any new code is covered by unit tests.
54+

COPYING renamed to LICENSE

File renamed without changes.

README.rst

Lines changed: 136 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,10 @@ inscriptis -- HTML to text conversion library, command line client and Web servi
3232

3333
A python based HTML to text conversion library, command line client and Web
3434
service with support for **nested tables**, a **subset of CSS** and optional
35-
support for providing an **annotated output**.
35+
support for providing an **annotated output**.
36+
37+
Inscriptis is particularly well suited for applications that require high-performance, high-quality (i.e., layout-aware) text representations of HTML content, and will aid knowledge extraction and data science tasks conducted upon Web data.
38+
3639
Please take a look at the
3740
`Rendering <https://github.com/weblyzard/inscriptis/blob/master/RENDERING.md>`_
3841
document for a demonstration of inscriptis' conversion quality.
@@ -47,6 +50,38 @@ This document provides a short introduction to Inscriptis.
4750

4851
.. contents:: Table of contents
4952

53+
Statement of need - why inscriptis?
54+
===================================
55+
56+
1. Inscriptis provides a **layout-aware** conversion of HTML that more closely resembles the rendering obtained from standard Web browsers and, therefore, better preserves the spatial arrangement of text elements.
57+
58+
Conversion quality becomes a factor once you need to move beyond simple HTML snippets. Non-specialized approaches and less sophisticated libraries do not correctly interpret HTML semantics and, therefore, fail to properly convert constructs such as itemizations, enumerations, and tables.
59+
60+
Beautiful Soup's `get_text()` function, for example, converts the following HTML enumeration to the string `firstsecond`.
61+
62+
.. code-block:: HTML
63+
64+
<ul>
65+
<li>first</li>
66+
<li>second</li>
67+
<ul>
68+
69+
70+
Inscriptis, in contrast, not only returns the correct output
71+
72+
.. code-block::
73+
74+
* first
75+
* second
76+
77+
but also supports much more complex constructs such as nested tables and also interprets a subset of HTML (e.g., `align`, `valign`) and CSS (e.g., `display`, `white-space`, `margin-top`, `vertical-align`, etc.) attributes that determine the text alignment. Any time the spatial alignment of text is relevant (e.g., for many knowledge extraction tasks, the computation of word embeddings and language models, and sentiment analysis) an accurate HTML to text conversion is essential.
78+
79+
2. Inscriptis supports `annotation rules <#annotation-rules>`_, i.e., user-provided mappings that allow for annotating the extracted text based on structural and semantic information encoded in HTML tags and attributes used for controlling structure and layout in the original HTML document. These rules might be used to
80+
81+
- provide downstream knowledge extraction components with additional information that may be leveraged to improve their respective performance.
82+
- assist manual document annotation processes (e.g., for qualitative analysis or gold standard creation). ``Inscriptis`` supports multiple export formats such as XML, annotated HTML and the JSONL format that is used by the open source annotation tool `doccano <https://github.com/doccano/doccano>`_.
83+
- enabling the use of ``Inscriptis`` for tasks such as content extraction (i.e., extract task-specific relevant content from a Web page) which rely on information on the HTML document's structure.
84+
5085

5186
Installation
5287
============
@@ -125,11 +160,8 @@ The inscript.py command line client supports the following parameters::
125160
-v, --version display version information
126161
127162

128-
Examples
129-
--------
130-
131163
HTML to text conversion
132-
~~~~~~~~~~~~~~~~~~~~~~~
164+
-----------------------
133165
convert the given page to text and output the result to the screen::
134166

135167
$ inscript.py https://www.fhgr.ch
@@ -144,7 +176,7 @@ convert HTML provided via stdin and save the output to output.txt::
144176

145177

146178
HTML to annotated text conversion
147-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
179+
---------------------------------
148180
convert and annotate HTML from a Web page using the provided annotation rules::
149181

150182
$ inscript.py https://www.fhgr.ch -r ./examples/annotation-profile.json
@@ -188,10 +220,11 @@ yields the following JSONL output
188220
The provided list of labels contains all annotated text elements with their
189221
start index, end index and the assigned label.
190222

223+
191224
Annotation postprocessors
192-
~~~~~~~~~~~~~~~~~~~~~~~~~
225+
-------------------------
193226
Annotation postprocessors enable the post processing of annotations to formats
194-
that are suitable for you particular application. Post processors can be
227+
that are suitable for your particular application. Post processors can be
195228
specified with the `-p` or `--postprocessor` command line argument::
196229

197230
$ inscript.py https://www.fhgr.ch \
@@ -233,7 +266,7 @@ Currently, inscriptis supports the following postprocessors:
233266

234267
.. figure:: https://github.com/weblyzard/inscriptis/raw/master/docs/paper/images/annotations.png
235268
:align: left
236-
:alt: Annotations extracted from the Wikipedia entry for Chur wht the `--postprocess html` postprocessor.
269+
:alt: Annotations extracted from the Wikipedia entry for Chur with the `--postprocess html` postprocessor.
237270

238271
Snippet of the rendered HTML file created with the following command line options and annotation rules:
239272

@@ -293,6 +326,97 @@ The service also supports a version call::
293326
$ curl http://localhost:5000/version
294327

295328

329+
Example annotation profiles
330+
===========================
331+
332+
The following section provides a number of example annotation profiles illustrating the use of Inscriptis' annotation support.
333+
The examples present the used annotation rules and an image that highlights a snippet with the annotated text on the converted web page, which has been
334+
created using the HTML postprocessor as outlined in Section `annotation postprocessors <#annotation-postprocessors>`_.
335+
336+
Wikipedia tables and table metadata
337+
-----------------------------------
338+
339+
340+
The following annotation rules extract tables from Wikipedia pages, and annotate table headings that are typically used to indicate column or row headings.
341+
342+
.. code-block:: json
343+
344+
{
345+
"table": ["table"],
346+
"th": ["tableheading"],
347+
"caption": ["caption"]
348+
}
349+
350+
The figure below outlines an example table from Wikipedia that has been annotated using these rules.
351+
352+
.. figure:: https://github.com/weblyzard/inscriptis/raw/master/docs/images/wikipedia-chur-table-annotation.png
353+
:alt: Table and table metadata annotations extracted from the Wikipedia entry for Chur.
354+
355+
356+
References to entities, missing entities and citations from Wikipedia
357+
---------------------------------------------------------------------
358+
359+
This profile extracts references to Wikipedia entities, missing entities and citations. Please note that the profile isn't perfect, since it also annotates `[ edit ]` links.
360+
361+
.. code-block:: json
362+
363+
{
364+
"a#title": ["entity"],
365+
"a#class=new": ["missing"],
366+
"class=reference": ["citation"]
367+
}
368+
369+
The figure shows entities and citations that have been identified on a Wikipedia page using these rules.
370+
371+
.. figure:: https://github.com/weblyzard/inscriptis/raw/master/docs/images/wikipedia-chur-entry-annotation.png
372+
:alt: Metadata on entries, missing entries and citations extracted from the Wikipedia entry for Chur.
373+
374+
375+
376+
377+
378+
Posts and post metadata from the XDA developer forum
379+
----------------------------------------------------
380+
381+
The annotation rules below, extract posts with metadata on the post's time, user and the user's job title from the XDA developer forum.
382+
383+
.. code-block:: json
384+
385+
{
386+
"article#class=message-body": ["article"],
387+
"li#class=u-concealed": ["time"],
388+
"#itemprop=name": ["user-name"],
389+
"#itemprop=jobTitle": ["user-title"]
390+
}
391+
392+
The figure illustrates the annotated metadata on posts from the XDA developer forum.
393+
394+
.. figure:: https://github.com/weblyzard/inscriptis/raw/master/docs/images/xda-posts-annotation.png
395+
:alt: Posts and post metadata extracted from the XDA developer forum.
396+
397+
398+
399+
Code and metadata from Stackoverflow pages
400+
------------------------------------------
401+
The rules below extracts code and metadata on users and comments from Stackoverflow pages.
402+
403+
.. code-block:: json
404+
405+
{
406+
"code": ["code"],
407+
"#itemprop=dateCreated": ["creation-date"],
408+
"#class=user-details": ["user"],
409+
"#class=reputation-score": ["reputation"],
410+
"#class=comment-date": ["comment-date"],
411+
"#class=comment-copy": ["comment-comment"]
412+
}
413+
414+
Applying these rules to a Stackoverflow page on text extraction from HTML yields the following snippet:
415+
416+
.. figure:: https://github.com/weblyzard/inscriptis/raw/master/docs/images/stackoverflow-code-annotation.png
417+
:alt: Code and metadata from Stackoverflow pages.
418+
419+
296420
Advanced topics
297421
===============
298422

@@ -354,7 +478,7 @@ The following options are available for fine tuning inscriptis' HTML rendering:
354478
parameter `indentation='extended'` to also use indentation for tags such as
355479
`<div>` and `<span>` that do not provide indentation in their standard
356480
definition. This strategy is the default in `inscript.py` and many other
357-
tools such as lynx. If you do not want extended indentation you can use the
481+
tools such as Lynx. If you do not want extended indentation you can use the
358482
parameter `indentation='standard'` instead.
359483
360484
2. **Overwriting the default CSS definition:** inscriptis uses CSS definitions
@@ -380,10 +504,11 @@ The following options are available for fine tuning inscriptis' HTML rendering:
380504
config = ParserConfig(css=css)
381505
parser = Inscriptis(html_tree, config)
382506
text = parser.get_text()
383-
507+
384508
385509
Changelog
386510
=========
387511
388512
A full list of changes can be found in the
389513
`release notes <https://github.com/weblyzard/inscriptis/releases>`_.
514+

docs/conf.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -37,15 +37,17 @@
3737
extensions = ['sphinx.ext.autodoc',
3838
'sphinx.ext.viewcode',
3939
'sphinx.ext.githubpages',
40-
'sphinx.ext.napoleon']
40+
'sphinx.ext.napoleon',
41+
'myst_parser']
4142

4243
# Add any paths that contain templates here, relative to this directory.
4344
templates_path = ['_templates']
4445

4546
# The suffix(es) of source filenames.
4647
# You can specify multiple suffix as a list of string:
4748
#
48-
source_suffix = ['.rst', '.md']
49+
source_suffix = {'.rst': 'restructuredtext',
50+
'.md': 'markdown'}
4951

5052
# The master toctree document.
5153
master_doc = 'index'
@@ -193,5 +195,3 @@
193195

194196
# A list of files that should not be packed into the epub file.
195197
epub_exclude_files = ['search.html']
196-
197-

docs/contributing.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../CONTRIBUTING.md

docs/index.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,9 @@ Contents:
1313
.. toctree::
1414
:maxdepth: 2
1515

16-
installation
16+
Documentation <README>
1717
benchmarking
18+
contributing
1819
inscriptis-module-documentation
1920

2021

docs/installation.rst

Lines changed: 0 additions & 19 deletions
This file was deleted.

0 commit comments

Comments
 (0)