You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* fix: use LICENSE rather than COPYING for the license file.
* add: contribution guidelines
These guidelines have been based on the guidelines used by the atom
and gnome-todo projects and been adapted to inscripits.
* chg: refined documentation.
* add: more examples for annotation profiles.
* chg: improved examples.
* fix: documentation on annotation rules.
* wip: improved statement of need.
* add: improved statement of need based on the reviewer comments
* fix: grammar, spelling and linking erros.
* chg: extended guidelines.
* fix: spacing and phrasing of non-specialized tools.
* fix: codefactor issues.
* chg: do not consider conf.py for codefactor.io
* chg: do not consider conf.py for codefactor.io
Bugs and support requests are tracked as GitHub issues.
15
+
16
+
To create an effective and high quality ticket, please include the following information in your
17
+
ticket:
18
+
19
+
1.**Use a clear and descriptive title** for the issue to identify the problem. This also helps other users to quickly locate bug reports that affect them.
20
+
2.**Describe the exact steps necessary for reproducing the problem** including at least information on
21
+
- the affected URL
22
+
- the command line parameters or function arguments you used
23
+
3. What would have been the **expected behavior**?
24
+
4. Describe the **observed behavior**.
25
+
5. Provide any additional information which might be helpful in reproducing and/or fixing this issue.
26
+
27
+
28
+
## Suggesting enhancements
29
+
30
+
Enhancements are also tracked as GitHub issues and should contain the following information:
31
+
32
+
1.**A clear and descriptive title** helps other people to identify enhancements they like, so that they can also add their thoughts and suggestions.
33
+
2.**Provide a step-by-step description** of the suggested enhancement.
34
+
3.**Describe the current behavior** and **explain which behavior you expected to see instead** and why.
35
+
36
+
37
+
## Pull requests
38
+
39
+
1. Ensure that your code complies with our [Python style guide](#python-style-guide).
40
+
2. Write a unit test that covers your new code and put it into the `./tests` directory.
41
+
3. Execute `tox .` in the project's root directory to ensure that your code passes the static code analysis, coding style guidelines and security checks.
42
+
4. In addition, please document any new API functions in the Inscriptis documentation.
43
+
44
+
45
+
## Python style guide
46
+
47
+
Inscriptis code should comply to
48
+
- the [PEP8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/), and
49
+
- to the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html)
50
+
51
+
Please also ensure that
52
+
1. functions are properly documented with docstrings that comply to the Google Python Style Guide, and
@@ -32,7 +32,10 @@ inscriptis -- HTML to text conversion library, command line client and Web servi
32
32
33
33
A python based HTML to text conversion library, command line client and Web
34
34
service with support for **nested tables**, a **subset of CSS** and optional
35
-
support for providing an **annotated output**.
35
+
support for providing an **annotated output**.
36
+
37
+
Inscriptis is particularly well suited for applications that require high-performance, high-quality (i.e., layout-aware) text representations of HTML content, and will aid knowledge extraction and data science tasks conducted upon Web data.
document for a demonstration of inscriptis' conversion quality.
@@ -47,6 +50,38 @@ This document provides a short introduction to Inscriptis.
47
50
48
51
.. contents:: Table of contents
49
52
53
+
Statement of need - why inscriptis?
54
+
===================================
55
+
56
+
1. Inscriptis provides a **layout-aware** conversion of HTML that more closely resembles the rendering obtained from standard Web browsers and, therefore, better preserves the spatial arrangement of text elements.
57
+
58
+
Conversion quality becomes a factor once you need to move beyond simple HTML snippets. Non-specialized approaches and less sophisticated libraries do not correctly interpret HTML semantics and, therefore, fail to properly convert constructs such as itemizations, enumerations, and tables.
59
+
60
+
Beautiful Soup's `get_text()` function, for example, converts the following HTML enumeration to the string `firstsecond`.
61
+
62
+
.. code-block:: HTML
63
+
64
+
<ul>
65
+
<li>first</li>
66
+
<li>second</li>
67
+
<ul>
68
+
69
+
70
+
Inscriptis, in contrast, not only returns the correct output
71
+
72
+
.. code-block::
73
+
74
+
* first
75
+
* second
76
+
77
+
but also supports much more complex constructs such as nested tables and also interprets a subset of HTML (e.g., `align`, `valign`) and CSS (e.g., `display`, `white-space`, `margin-top`, `vertical-align`, etc.) attributes that determine the text alignment. Any time the spatial alignment of text is relevant (e.g., for many knowledge extraction tasks, the computation of word embeddings and language models, and sentiment analysis) an accurate HTML to text conversion is essential.
78
+
79
+
2. Inscriptis supports `annotation rules <#annotation-rules>`_, i.e., user-provided mappings that allow for annotating the extracted text based on structural and semantic information encoded in HTML tags and attributes used for controlling structure and layout in the original HTML document. These rules might be used to
80
+
81
+
- provide downstream knowledge extraction components with additional information that may be leveraged to improve their respective performance.
82
+
- assist manual document annotation processes (e.g., for qualitative analysis or gold standard creation). ``Inscriptis`` supports multiple export formats such as XML, annotated HTML and the JSONL format that is used by the open source annotation tool `doccano <https://github.com/doccano/doccano>`_.
83
+
- enabling the use of ``Inscriptis`` for tasks such as content extraction (i.e., extract task-specific relevant content from a Web page) which rely on information on the HTML document's structure.
84
+
50
85
51
86
Installation
52
87
============
@@ -125,11 +160,8 @@ The inscript.py command line client supports the following parameters::
125
160
-v, --version display version information
126
161
127
162
128
-
Examples
129
-
--------
130
-
131
163
HTML to text conversion
132
-
~~~~~~~~~~~~~~~~~~~~~~~
164
+
-----------------------
133
165
convert the given page to text and output the result to the screen::
134
166
135
167
$ inscript.py https://www.fhgr.ch
@@ -144,7 +176,7 @@ convert HTML provided via stdin and save the output to output.txt::
144
176
145
177
146
178
HTML to annotated text conversion
147
-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
179
+
---------------------------------
148
180
convert and annotate HTML from a Web page using the provided annotation rules::
:alt:Annotations extracted from the Wikipedia entry for Chur wht the `--postprocess html` postprocessor.
269
+
:alt:Annotations extracted from the Wikipedia entry for Chur with the `--postprocess html` postprocessor.
237
270
238
271
Snippet of the rendered HTML file created with the following command line options and annotation rules:
239
272
@@ -293,6 +326,97 @@ The service also supports a version call::
293
326
$ curl http://localhost:5000/version
294
327
295
328
329
+
Example annotation profiles
330
+
===========================
331
+
332
+
The following section provides a number of example annotation profiles illustrating the use of Inscriptis' annotation support.
333
+
The examples present the used annotation rules and an image that highlights a snippet with the annotated text on the converted web page, which has been
334
+
created using the HTML postprocessor as outlined in Section `annotation postprocessors <#annotation-postprocessors>`_.
335
+
336
+
Wikipedia tables and table metadata
337
+
-----------------------------------
338
+
339
+
340
+
The following annotation rules extract tables from Wikipedia pages, and annotate table headings that are typically used to indicate column or row headings.
341
+
342
+
.. code-block:: json
343
+
344
+
{
345
+
"table": ["table"],
346
+
"th": ["tableheading"],
347
+
"caption": ["caption"]
348
+
}
349
+
350
+
The figure below outlines an example table from Wikipedia that has been annotated using these rules.
This profile extracts references to Wikipedia entities, missing entities and citations. Please note that the profile isn't perfect, since it also annotates `[ edit ]` links.
360
+
361
+
.. code-block:: json
362
+
363
+
{
364
+
"a#title": ["entity"],
365
+
"a#class=new": ["missing"],
366
+
"class=reference": ["citation"]
367
+
}
368
+
369
+
The figure shows entities and citations that have been identified on a Wikipedia page using these rules.
0 commit comments