From f7f3e971eebaf642d287021fde54e42b6cd60111 Mon Sep 17 00:00:00 2001 From: Ciprian Dinu <56022421+cipriandinu@users.noreply.github.com> Date: Fri, 15 Apr 2022 12:01:29 +0300 Subject: [PATCH 01/18] Create alto-4-4.xsd --- v4/alto-4-4.xsd | 1250 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 1250 insertions(+) create mode 100644 v4/alto-4-4.xsd diff --git a/v4/alto-4-4.xsd b/v4/alto-4-4.xsd new file mode 100644 index 0000000..81e96e6 --- /dev/null +++ b/v4/alto-4-4.xsd @@ -0,0 +1,1250 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ALTO (analyzed layout and text object) stores layout information and + OCR recognized text of pages of any kind of printed documents like books, journals and newspapers. + ALTO is a standardized XML format to store layout and content information. + It is designed to be used as an extension schema to METS (Metadata Encoding and Transmission Standard), + where METS provides metadata and structural information while ALTO contains content and physical information. + + + + + + + + Describes general settings of the alto file like measurement units and metadata + + + + + Styles define properties of layout elements. A style defined in a parent element is used as default style for all related children elements. + + + + + + Tag define properties of additional characteristic. The tags are referenced from related content element on Block or String element by attribute TAGREF via the tag ID. + This container element contains the individual elements for LayoutTags, StructureTags, RoleTags, NamedEntityTags and OtherTags + + + + + + + Describes alternative hierarchical orderings of the page (i.e. total orders over its segments, for linear text flow), + in addition to the explicit flat reading order defined by @IDNEXT on the block level, + and the implicit flat reading order implied by the segment element ordering. + + + + + + The root layout element. + + + + + + Schema version of the ALTO file. + + + + + + + + + + Element deprecated. 'Processing' should be used instead. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + There are following variation of tag types available: + LayoutTag – criteria about arrangement or graphical appearance + StructureTag – criteria about grouping or formation + RoleTag – criteria about function or mission + NamedEntityTag – criteria about assignment of terms to their relationship / meaning (NER) + OtherTag – criteria about any other characteristic not listed above, the TYPE attribute is intended to be used for classification within those. + + + + + + + + + + + + + + + + Defines one or more reading orders within the + page. Groups may be either unordered or ordered and can + contain other groups, e.g. a page containing + unrelated texts that are ordered individually + would be encoded as an UnorderedGroup containing + multiple OrderedGroups. The granularity of + elements can vary inside groups. + + + + + + + + + + + + + A reference to an element such as a block, TextLine, String, or Glyph. + + + + + + + A link to the referenced element. Valid + target elements are any block type, + TextLine, String, or Glyph. + + + + + + + Optionally annotates the role of the + referenced element in the reading order + with one or more tags. Examples could be + interlinear additions or marginalia. + + + + + + + + A group containing ordered elements (i.e. the sequence of OrderedGroup, UnorderedGroup or ElementRef subelements is ordered). + + + + + + + + + + + + + + Optionally annotates the role of the + group in the reading order + with one or more tags. Examples could be + distinguishing + parallel texts or apparatus criticus and + main text. + + + + + + + A link to the referenced element. Valid + target elements are any block type, + TextLine, or String. + + + + + + + + A group containing unordered elements (i.e. the sequence of OrderedGroup, UnorderedGroup or ElementRef subelements is arbitrary). + + + + + + + + + + + + + + + A link to the referenced element. Valid + target elements are any block type, + TextLine, or String. + + + + + + + Gives brief information about original page quality + + + + + + + + + + + + + + Gives more details about the original page quality, since QUALITY attribute gives only brief and restrictive information + + + + + + Position of the page. Could be lefthanded, righthanded, cover, foldout or single if it has no special position. + + + + + + + + + + + + Page Confidence: Confidence level of the ocr for this page. A value between 0 (unsure) and 1 (sure). + + + + + + + + + One page of a book or journal. + + + + + The area between the top line of print and the upper edge of the leaf. It may contain page number or running title. + + + + + The area between the printspace and the left border of a page. May contain margin notes. + + + + + The area between the printspace and the right border of a page. May contain margin notes. + + + + + The area between the bottom line of letterpress or writing and the bottom edge of the leaf. It may contain a page number, a signature number or a catch word. + + + + + Rectangle covering the printed area of a page. Page number and running title are not part of the print space. + + + + + + + Any user-defined class like title page. + + + + + + + + + The number of the page within the document. + + + + + The page number that is printed on the page. + + + + + + + + A link to the processing description that has been used for this page. + + + + + Estimated percentage of OCR Accuracy in range from 0 to 100 + + + + + + + + + + + + + A text style defines font properties of text. + + + + + + + A paragraph style defines formatting properties of text blocks. + + + + + Indicates the alignement of the paragraph. Could be left, right, center or justify. + + + + + + + + + + + + + Left indent of the paragraph in relation to the column. + + + + + Right indent of the paragraph in relation to the column. + + + + + Line spacing between two lines of the paragraph. Measurement calculated from baseline to baseline. + + + + + Indent of the first line of the paragraph if this is different from the other lines. A negative value indicates an indent to the left, a positive value indicates an indent to the right. + + + + + + + + + + + + + + + + + + + + + + + + + + + Group of available block types + + + + + A block of text. + + + + + A picture or image. + + + + + A graphic used to separate blocks. Usually a line or rectangle. + + + + + A block that consists of other blocks + + + + + + + Base type for any kind of block on the page. + + + + + + + + + + + + + + + Tells the rotation of e.g. text or illustration within the block. The value is in degree counterclockwise. + + + + + The next block in reading order of the page (if ReadingOrder is not specified, and elements are not in order). + + + + + Correction Status. Indicates whether manual correction has been done or not. The correction status should be recorded at the highest level possible (Block, TextLine, String). + + + + + + + A white space. + + + + + + + + + + Type of the substitution (if any). + + + + + + + + + + + + + + + Word Confidence: Confidence level of the ocr for this string. A value between 0 (unsure) and 1 (sure). + + + + + + + + + + Any alternative for the word. + Alternative can outline a variant of writing by new typing / spelling rules, typically manually done or by dictionary replacements. + The above sample is an old composed character "Æ" of ancient time, which is replaced now by "Ä". + As variant are meant alternatives of the real printed content which are options outlined by the text recognition process. + Similar sample: "Straße" vs. "Strasse". Such alternatives are not coming from text recognition. + + + + + + + Identifies the purpose of the alternative. + + + + + + + + A sequence of chars. Strings are separated by white spaces or hyphenation chars. + + + + + + + + + + + + + + + + + + + + Content of the substitution. + + + + + + Confidence level of each character in that string. A list of numbers, one number between 0 (sure) and 9 (unsure) for each character. + + + + + Correction Status. Indicates whether manual correction has been done or not. The correction status should be recorded at the highest level possible (Block, TextLine, String). + + + + + Attribute to record language of the string. The language should be recorded at the highest level possible. + + + + + + A region on a page + + + + + + + + + + + + + + + + + + A list of points + + + + + + Describes the bounding shape of a block, if it is not rectangular. + + + + + + + + + + Describes the inline base direction and line orientation of a line or of all lines inside a text block. + The meaning of these terms is defined by the W3C writing modes document: + These values should correspond to the base direction set in the BiDi algorithm to the respective elements during Unicode encoding. A value of "ttb" (top-to-bottom) implies a base direction of left-to-right, a value of "btt" (bottom-to-top) a base direction of right-to-left. + + + + + + + + + + + A polygon shape. + + + + + + An ellipse shape. HPOS and VPOS describe the center of the ellipse. + HLENGTH and VLENGTH are the width and height of the described ellipse. + The attribute ROTATION tells the rotation of the e.g. text or + illustration within the block. The value is in degrees counterclockwise. + + + + + + + + + + A circle shape. HPOS and VPOS describe the center of the circle. + + + + + + + + Formatting attributes. Note that these attributes are assumed to be inherited from ancestor elements of the document hierarchy. + + + + The font name. + + + + + + + The font size, in points (1/72 of an inch). + + + + + Font color as RGB value + + + + + + + Serif or Sans-Serif + + + + + + + + + fixed or proportional + + + + + + + + + + + All measurement values inside the alto file are related to + this unit, except the font size. + Coordinates as being used in HPOS and VPOS are absolute coordinates referring to the upper-left corner of a page. + The upper left corner of the page is defined as coordinate (0/0). + + values meaning: + mm10: 1/10th of millimeter + inch1200: 1/1200th of inch + pixel: 1 pixel + + The values for pixel will be related to the resolution of the image based + on which the layout is described. Incase the original image is not known + the scaling factor can be calculated based on total width and height of + the image and the according information of the PAGE element. + + + + + + + + + + + Information to identify the image file from which the OCR text was created. + + + + + + + + + + + + + + + + + + + A unique identifier for the image file. This is drawn from MIX. + This identifier must be unique within the local system. + To facilitate file sharing or interoperability with other systems, fileIdentifierLocation may be added to designate the system or application where the identifier is unique. + + + + + + A location qualifier, i.e., a namespace. + + + + + + + + + + + + + + A unique identifier for the document. + This identifier must be unique within the local system. + To facilitate file sharing or interoperability with other systems, documentIdentifierLocation may be added to designate the system or application where the identifier is unique. + + + + + + A location qualifier, i.e., a namespace. + + + + + + + + Deprecated. processingStepType should be used instead. + Information on how the text was created, including preprocessing, OCR processing, and postprocessing steps. Where possible, this draws from MIX's change history. + + + + + + + + + + Description of the processing step. + + + + + Classification of the category of operation, how the file was created, including generation, modification, preprocessing, postprocessing or any other steps. + + + + + Date or DateTime the image was processed. + + + + + Identifies the organizationlevel producer(s) of the processed image. + + + + + An ordinal listing of the image processing steps performed. For example, "image despeckling." + + + + + A description of any setting of the processing application. For example, for a multi-engine OCR application this might include the engines which were used. Ideally, this description should be adequate so that someone else using the same application can produce identical results. + + + + + + + + + + + + + + + + + + + + + Information about a software application. Where applicable, the preferred method for determining this information is by selecting Help -- About. + + + + + The name of the organization or company that created the application. + + + + + The name of the application. + + + + + The version of the application. + + + + + A description of any important characteristics of the application, especially for non-commercial applications. For example, if a non-commercial application is built using commercial components, e.g., an OCR engine SDK. Those components should be mentioned here. + + + + + + + + + + List of any combination of font styles + + + + + + + + + + + + + + + + + + + + + + + A block that consists of other blocks + + + + + + + + + A user defined string to identify the type of composed block (e.g. table, advertisement, ...) + + + + + An ID to link to an image which contains only the composed block. The ID and the file link is defined in the related METS file. + + + + + + + + A picture or image. + + + + + + A user defined string to identify the type of illustration like photo, map, drawing, chart, ... + + + + + A link to an image which contains only the illustration. + + + + + + + + A graphic used to separate blocks. Usually a line or rectangle. + + + + + + + + A block of text. + + + + + + + A single line of text. + + + + + + + + + + + + + A hyphenation char. Can appear only at the end of a line. + + + + + + + + + + + + + + + + + + + + + Pixel coordinates based on the left-hand top corner of an image which define a polyline on which a line of text rests. + + + + + Attribute to record language of the textline. + + + + + Correction Status. Indicates whether manual correction has been done or not. The correction status should be recorded at the highest level possible (Block, TextLine, String). + + + + + Indicates the inline base direction of this TextLine. Overrides the value on elements higher in the hierarchy. + + + + + + + + Attribute deprecated. LANG should be used instead. + + + + + Attribute to record language of the textblock. + + + + + Indicates the inline base direction of the TextBlock. + + + + + + + + + + + The xml data wrapper element XmlData is used to contain XML encoded metadata. + The content of an XmlData element can be in any namespace or in no namespace. + As permitted by the XML Schema Standard, the processContents attribute value for the + metadata in an XmlData is set to “lax”. Therefore, if the source schema and its location are + identified by means of an XML schemaLocation attribute, then an XML processor will validate + the elements for which it can find declarations. If a source schema is not identified, or cannot be + found at the specified schemaLocation, then an XML validator will check for well-formedness, + but otherwise skip over the elements appearing in the XmlData element. + + + + + + + + + + + + + Type can be used to classify and group the information within each tag element type. + + + + + Content / information value of the tag. + + + + + Description text for tag information for clarification. + + + + + Any URI for authority or description relevant information. + + + + + + + Modern OCR software stores information on glyph level. A glyph is essentially a character or ligature. + Accordingly the value for the glyph element will be defined as follows: + Pre-composed representation = base + combining character(s) (decomposed representation) + See http://www.fileformat.info/info/unicode/char/0101/index.htm + "U+0101" = (U+0061) + (U+0304) + "combining characters" ("base characters" in combination with non-spacing marks or characters which are combined to one) are represented as one "glyph", e.g. áàâ. + + Each glyph has its own coordinate information and must be separately addressable as a distinct object. + Correction and verification processes can be carried out for individual characters. + + Post-OCR analysis of the text as well as adaptive OCR algorithm must be able to record information on glyph level. + In order to reproduce the decision of the OCR software, optional characters must be recorded. These are called variants. + The OCR software evaluates each variant and picks the one with the highest confidence score as the glyph. + The confidence score expresses how confident the OCR software is that a single glyph had been recognized correctly. + + The glyph elements are in order of the word. Each glyph need to be recorded to built up the whole word sequence. + + The glyph’s CONTENT attribute is no replacement for the string’s CONTENT attribute. + Due to post-processing steps such as correction the values of both attributes may be inconsistent. + + + + + + + + + + + CONTENT contains the precomposed representation (combining character) of the character from the parent String element. + The sequence position of the Gylph element matches the position of the character in the String. + + + + + + + + + + + + + This GC attribute records a float value between 0.0 and 1.0 that expresses the level of confidence for the glyph where 1 is certain. + This attribute is optional. If it is not available, the default value for the glyph is “0”. + The GC attribute semantic is the same as the WC attribute on the String element and VC on Variant element. + + + + + + + + + + + + + + + + + + Alternative (combined) character for the glyph, outlined by OCR engine or similar recognition processes. + In case the variant are two (combining) characters, two characters are outlined in one Variant element. + E.g. a Glyph element with CONTENT="m" can have a Variant element with the content "rn". + Details for different use-cases see on the samples on GitHub. + + + + + + Each Variant represents an option for the glyph that the OCR software detected as possible alternatives. + In case the variant are two (combining) characters, two characters are outlined in one Variant element. + E.g. a Glyph element with CONTENT="m" can have a Variant element with the content "rn". + Details for different use-cases see on the samples on GitHub. + + + + + + + + + + + + + This VC attribute records a float value between 0.0 and 1.0 that expresses the level of confidence for the variant where is 1 is certain. + This attribute is optional. If it is not available, the default value for the variant is “0”. + The VC attribute semantic is the same as the GC attribute on the Glyph element. + + + + + + + + + + + From 002d6bb0e0d424048f55faab1ed4b612c4ce9132 Mon Sep 17 00:00:00 2001 From: Ciprian Dinu <56022421+cipriandinu@users.noreply.github.com> Date: Fri, 15 Apr 2022 13:03:14 +0300 Subject: [PATCH 02/18] Update alto-4-4.xsd Issue 55 (https://github.com/altoxml/schema/issues/55) --- v4/alto-4-4.xsd | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/v4/alto-4-4.xsd b/v4/alto-4-4.xsd index 81e96e6..9fdcc09 100644 --- a/v4/alto-4-4.xsd +++ b/v4/alto-4-4.xsd @@ -110,6 +110,8 @@ For the full text see https://creativecommons.org/licenses/by-sa/4.0/legalcode. 2. Add support for explicit reading order definitions with "ReadingOrder" element containing "UnorderedGroup"s, "OrderedGroup"s, and "ElementRef"s. --> @@ -431,6 +433,16 @@ For the full text see https://creativecommons.org/licenses/by-sa/4.0/legalcode. + + + Default rotation for text or illustrations on this page. The value is in degree counterclockwise. The default value can be overwritten on lower levels (Textblock, Textline, etc) + + + + + Default language for text on this page. The default value can be overwritten on lower levels (Textblock, Textline, etc) + + From 536c55435aeeddcb8c03ede576e9087601ca8b37 Mon Sep 17 00:00:00 2001 From: Ciprian Dinu <56022421+cipriandinu@users.noreply.github.com> Date: Fri, 15 Apr 2022 13:42:56 +0300 Subject: [PATCH 03/18] Update alto-4-4.xsd possible solution for Issue 66 --- v4/alto-4-4.xsd | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/v4/alto-4-4.xsd b/v4/alto-4-4.xsd index 9fdcc09..fd1a063 100644 --- a/v4/alto-4-4.xsd +++ b/v4/alto-4-4.xsd @@ -443,7 +443,15 @@ For the full text see https://creativecommons.org/licenses/by-sa/4.0/legalcode. Default language for text on this page. The default value can be overwritten on lower levels (Textblock, Textline, etc) + + + Other languages that appear on this page. Provides a convinient way to summarize all the languages found on a particular page, without parsing the entire file + + + + + From ca271e09d0a0f258229e81a36a3c2cbc4bd387c9 Mon Sep 17 00:00:00 2001 From: Ciprian Dinu <56022421+cipriandinu@users.noreply.github.com> Date: Fri, 15 Apr 2022 13:45:54 +0300 Subject: [PATCH 04/18] Update alto-4-4.xsd modify change history --- v4/alto-4-4.xsd | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/v4/alto-4-4.xsd b/v4/alto-4-4.xsd index fd1a063..33a68d5 100644 --- a/v4/alto-4-4.xsd +++ b/v4/alto-4-4.xsd @@ -110,8 +110,9 @@ For the full text see https://creativecommons.org/licenses/by-sa/4.0/legalcode. 2. Add support for explicit reading order definitions with "ReadingOrder" element containing "UnorderedGroup"s, "OrderedGroup"s, and "ElementRef"s. --> From e92da64d1ae3197c58c045fcb3e094e5571fbc6b Mon Sep 17 00:00:00 2001 From: Ciprian Dinu <56022421+cipriandinu@users.noreply.github.com> Date: Fri, 15 Apr 2022 13:46:52 +0300 Subject: [PATCH 05/18] Update alto-4-4.xsd --- v4/alto-4-4.xsd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/v4/alto-4-4.xsd b/v4/alto-4-4.xsd index 33a68d5..7cfee8d 100644 --- a/v4/alto-4-4.xsd +++ b/v4/alto-4-4.xsd @@ -8,7 +8,7 @@ For the full text see https://creativecommons.org/licenses/by-sa/4.0/legalcode. - + @@ -114,7 +114,7 @@ For the full text see https://creativecommons.org/licenses/by-sa/4.0/legalcode. 2. Add @ROTATION attribute on PageType level to describe the default rotation used in document 3. Add @OTHERLANGS attribute on PageType to summarize all the languages present into a particular document --> - + From 30cda334978472f60c4dd963b85f8fa43532256e Mon Sep 17 00:00:00 2001 From: Ciprian Dinu <56022421+cipriandinu@users.noreply.github.com> Date: Fri, 16 Sep 2022 11:33:51 +0300 Subject: [PATCH 06/18] Update v4/alto-4-4.xsd Co-authored-by: Stefan Weil --- v4/alto-4-4.xsd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/v4/alto-4-4.xsd b/v4/alto-4-4.xsd index 7cfee8d..a97583a 100644 --- a/v4/alto-4-4.xsd +++ b/v4/alto-4-4.xsd @@ -1197,7 +1197,7 @@ For the full text see https://creativecommons.org/licenses/by-sa/4.0/legalcode. CONTENT contains the precomposed representation (combining character) of the character from the parent String element. - The sequence position of the Gylph element matches the position of the character in the String. + The sequence position of the Glyph element matches the position of the character in the String. From 8b1a09a0bd5ef170e7ee33058a54d8782767aa23 Mon Sep 17 00:00:00 2001 From: Ciprian Dinu <56022421+cipriandinu@users.noreply.github.com> Date: Fri, 16 Sep 2022 11:34:01 +0300 Subject: [PATCH 07/18] Update v4/alto-4-4.xsd Co-authored-by: Stefan Weil --- v4/alto-4-4.xsd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/v4/alto-4-4.xsd b/v4/alto-4-4.xsd index a97583a..98f7411 100644 --- a/v4/alto-4-4.xsd +++ b/v4/alto-4-4.xsd @@ -473,7 +473,7 @@ For the full text see https://creativecommons.org/licenses/by-sa/4.0/legalcode. - Indicates the alignement of the paragraph. Could be left, right, center or justify. + Indicates the alignment of the paragraph. Could be left, right, center or justify. From 5e1f9a6b94b87671f091f2fc787ded7049516966 Mon Sep 17 00:00:00 2001 From: Ciprian Dinu <56022421+cipriandinu@users.noreply.github.com> Date: Fri, 16 Sep 2022 11:34:19 +0300 Subject: [PATCH 08/18] Update v4/alto-4-4.xsd Co-authored-by: Stefan Weil --- v4/alto-4-4.xsd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/v4/alto-4-4.xsd b/v4/alto-4-4.xsd index 98f7411..4af0224 100644 --- a/v4/alto-4-4.xsd +++ b/v4/alto-4-4.xsd @@ -20,7 +20,7 @@ For the full text see https://creativecommons.org/licenses/by-sa/4.0/legalcode. 4. internal changes to validate with Xerces parser 5. define fontstyles by enumerations 6. change "WC" (word confidence) attribute to xsd:float in range of "0" to "1". - 7. Add "ALTERNATIVE" als childs to "STRING" element + 7. Add "ALTERNATIVE" as children to "STRING" element 8. Add "language" attribute to "Textblock" and "STRING" element --> - @@ -583,7 +585,11 @@ For the full text see https://creativecommons.org/licenses/by-sa/4.0/legalcode. Correction Status. Indicates whether manual correction has been done or not. The correction status should be recorded at the highest level possible (Block, TextLine, String). - + + + Obsolete. Planned to be removed in future versions due to issues created on mixed validation and because in practice it is not used very often + + @@ -702,12 +708,13 @@ For the full text see https://creativecommons.org/licenses/by-sa/4.0/legalcode. A list of coordinate-pairs that are absolute to the upper-left corner of a page. - The upper left corner of the page is defined as coordinate (0,0) - Even there are no rules to enforce a particular format for a points list recommended formats are: + The upper left corner of the page is defined as x=0 and y=0 + Currently there are no rules to enforce a particular format for a points list but recommended formats are: "x1 y1 x2 y2 ... xn yn" "x1,y1 x2,y2 ... xn,yn" "(x1 y1) (x2 y2) ... (xn yn)" "(x1,y1) (x2,y2) ... (xn,yn)" + On future versions is planned to enforce these rules accordingly From 2288e47a9c3c873d0e59f2faf0579dc5893462ad Mon Sep 17 00:00:00 2001 From: Ciprian Dinu <56022421+cipriandinu@users.noreply.github.com> Date: Fri, 27 Jan 2023 10:43:20 +0200 Subject: [PATCH 13/18] Update alto-4-4.xsd --- v4/alto-4-4.xsd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/v4/alto-4-4.xsd b/v4/alto-4-4.xsd index 13ff22d..7ef8054 100644 --- a/v4/alto-4-4.xsd +++ b/v4/alto-4-4.xsd @@ -587,7 +587,7 @@ For the full text see https://creativecommons.org/licenses/by-sa/4.0/legalcode. - Obsolete. Planned to be removed in future versions due to issues created on mixed validation and because in practice it is not used very often + Attribute group deprecated. Planned to be removed in future versions due to issues created on mixed validation and because in practice it is not used very often From c83508998e67cda7e21709cad77dff178a4ba472 Mon Sep 17 00:00:00 2001 From: Stefan Weil Date: Fri, 16 Sep 2022 12:49:59 +0200 Subject: [PATCH 14/18] Replace CRLF by LF and remove whitespace at line endings Signed-off-by: Stefan Weil --- v4/alto-4-2.xsd | 2208 ++++++++++++++++++++--------------------- v4/alto-4-3.xsd | 2496 +++++++++++++++++++++++------------------------ 2 files changed, 2352 insertions(+), 2352 deletions(-) diff --git a/v4/alto-4-2.xsd b/v4/alto-4-2.xsd index cfae776..6bd8b3a 100644 --- a/v4/alto-4-2.xsd +++ b/v4/alto-4-2.xsd @@ -1,1105 +1,1105 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ALTO (analyzed layout and text object) stores layout information and - OCR recognized text of pages of any kind of printed documents like books, journals and newspapers. - ALTO is a standardized XML format to store layout and content information. - It is designed to be used as an extension schema to METS (Metadata Encoding and Transmission Standard), - where METS provides metadata and structural information while ALTO contains content and physical information. - - - - - - - - Describes general settings of the alto file like measurement units and metadata - - - - - Styles define properties of layout elements. A style defined in a parent element is used as default style for all related children elements. - - - - - - Tag define properties of additional characteristic. The tags are referenced from related content element on Block or String element by attribute TAGREF via the tag ID. - This container element contains the individual elements for LayoutTags, StructureTags, RoleTags, NamedEntityTags and OtherTags - - - - - - The root layout element. - - - - - - Schema version of the ALTO file. - - - - - - - - - - Element deprecated. 'Processing' should be used instead. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - There are following variation of tag types available: - LayoutTag – criteria about arrangement or graphical appearance - StructureTag – criteria about grouping or formation - RoleTag – criteria about function or mission - NamedEntityTag – criteria about assignment of terms to their relationship / meaning (NER) - OtherTag – criteria about any other characteristic not listed above, the TYPE attribute is intended to be used for classification within those. - - - - - - - - - - - - - - - Gives brief information about original page quality - - - - - - - - - - - - - - Gives more details about the original page quality, since QUALITY attribute gives only brief and restrictive information - - - - - - Position of the page. Could be lefthanded, righthanded, cover, foldout or single if it has no special position. - - - - - - - - - - - - Page Confidence: Confidence level of the ocr for this page. A value between 0 (unsure) and 1 (sure). - - - - - - - - - One page of a book or journal. - - - - - The area between the top line of print and the upper edge of the leaf. It may contain page number or running title. - - - - - The area between the printspace and the left border of a page. May contain margin notes. - - - - - The area between the printspace and the right border of a page. May contain margin notes. - - - - - The area between the bottom line of letterpress or writing and the bottom edge of the leaf. It may contain a page number, a signature number or a catch word. - - - - - Rectangle covering the printed area of a page. Page number and running title are not part of the print space. - - - - - - - Any user-defined class like title page. - - - - - - - - - The number of the page within the document. - - - - - The page number that is printed on the page. - - - - - - - - A link to the processing description that has been used for this page. - - - - - Estimated percentage of OCR Accuracy in range from 0 to 100 - - - - - - - - - - - - - A text style defines font properties of text. - - - - - - - A paragraph style defines formatting properties of text blocks. - - - - - Indicates the alignement of the paragraph. Could be left, right, center or justify. - - - - - - - - - - - - - Left indent of the paragraph in relation to the column. - - - - - Right indent of the paragraph in relation to the column. - - - - - Line spacing between two lines of the paragraph. Measurement calculated from baseline to baseline. - - - - - Indent of the first line of the paragraph if this is different from the other lines. A negative value indicates an indent to the left, a positive value indicates an indent to the right. - - - - - - - - - - - - - - - - - - - - - - - - - - - Group of available block types - - - - - A block of text. - - - - - A picture or image. - - - - - A graphic used to separate blocks. Usually a line or rectangle. - - - - - A block that consists of other blocks - - - - - - - Base type for any kind of block on the page. - - - - - - - - - - - - - - - Tells the rotation of e.g. text or illustration within the block. The value is in degree counterclockwise. - - - - - The next block in reading sequence on the page. - - - - - Correction Status. Indicates whether manual correction has been done or not. The correction status should be recorded at the highest level possible (Block, TextLine, String). - - - - - - - A white space. - - - - - - - - - - Type of the substitution (if any). - - - - - - - - - - - - - - - Word Confidence: Confidence level of the ocr for this string. A value between 0 (unsure) and 1 (sure). - - - - - - - - - - Any alternative for the word. - Alternative can outline a variant of writing by new typing / spelling rules, typically manually done or by dictionary replacements. - The above sample is an old composed character "Æ" of ancient time, which is replaced now by "Ä". - As variant are meant alternatives of the real printed content which are options outlined by the text recognition process. - Similar sample: "Straße" vs. "Strasse". Such alternatives are not coming from text recognition. - - - - - - - Identifies the purpose of the alternative. - - - - - - - - A sequence of chars. Strings are separated by white spaces or hyphenation chars. - - - - - - - - - - - - - - - - - - - - Content of the substitution. - - - - - - Confidence level of each character in that string. A list of numbers, one number between 0 (sure) and 9 (unsure) for each character. - - - - - Correction Status. Indicates whether manual correction has been done or not. The correction status should be recorded at the highest level possible (Block, TextLine, String). - - - - - Attribute to record language of the string. The language should be recorded at the highest level possible. - - - - - - A region on a page - - - - - - - - - - - - - - - - - - A list of points - - - - - - Describes the bounding shape of a block, if it is not rectangular. - - - - - - - - - - A polygon shape. - - - - - - An ellipse shape. HPOS and VPOS describe the center of the ellipse. - HLENGTH and VLENGTH are the width and height of the described ellipse. - The attribute ROTATION tells the rotation of the e.g. text or - illustration within the block. The value is in degrees counterclockwise. - - - - - - - - - - A circle shape. HPOS and VPOS describe the center of the circle. - - - - - - - - Formatting attributes. Note that these attributes are assumed to be inherited from ancestor elements of the document hierarchy. - - - - The font name. - - - - - - - The font size, in points (1/72 of an inch). - - - - - Font color as RGB value - - - - - - - Serif or Sans-Serif - - - - - - - - - fixed or proportional - - - - - - - - - - - All measurement values inside the alto file are related to - this unit, except the font size. - Coordinates as being used in HPOS and VPOS are absolute coordinates referring to the upper-left corner of a page. - The upper left corner of the page is defined as coordinate (0/0). - - values meaning: - mm10: 1/10th of millimeter - inch1200: 1/1200th of inch - pixel: 1 pixel - - The values for pixel will be related to the resolution of the image based - on which the layout is described. Incase the original image is not known - the scaling factor can be calculated based on total width and height of - the image and the according information of the PAGE element. - - - - - - - - - - - Information to identify the image file from which the OCR text was created. - - - - - - - - - - - - - - - - - - - A unique identifier for the image file. This is drawn from MIX. - This identifier must be unique within the local system. - To facilitate file sharing or interoperability with other systems, fileIdentifierLocation may be added to designate the system or application where the identifier is unique. - - - - - - A location qualifier, i.e., a namespace. - - - - - - - - - - - - - - A unique identifier for the document. - This identifier must be unique within the local system. - To facilitate file sharing or interoperability with other systems, documentIdentifierLocation may be added to designate the system or application where the identifier is unique. - - - - - - A location qualifier, i.e., a namespace. - - - - - - - - Deprecated. processingType should be used instead. - Information on how the text was created, including preprocessing, OCR processing, and postprocessing steps. Where possible, this draws from MIX's change history. - - - - - - - - - - Description of the processing step. - - - - - Classification of the category of operation, how the file was created, including generation, modification, preprocessing, postprocessing or any other steps. - - - - - Date or DateTime the image was processed. - - - - - Identifies the organizationlevel producer(s) of the processed image. - - - - - An ordinal listing of the image processing steps performed. For example, "image despeckling." - - - - - A description of any setting of the processing application. For example, for a multi-engine OCR application this might include the engines which were used. Ideally, this description should be adequate so that someone else using the same application can produce identical results. - - - - - - - - - - - - - - - - - - - - - Information about a software application. Where applicable, the preferred method for determining this information is by selecting Help -- About. - - - - - The name of the organization or company that created the application. - - - - - The name of the application. - - - - - The version of the application. - - - - - A description of any important characteristics of the application, especially for non-commercial applications. For example, if a non-commercial application is built using commercial components, e.g., an OCR engine SDK. Those components should be mentioned here. - - - - - - - - - - List of any combination of font styles - - - - - - - - - - - - - - - - - - - - - - - A block that consists of other blocks - - - - - - - - - A user defined string to identify the type of composed block (e.g. table, advertisement, ...) - - - - - An ID to link to an image which contains only the composed block. The ID and the file link is defined in the related METS file. - - - - - - - - A picture or image. - - - - - - A user defined string to identify the type of illustration like photo, map, drawing, chart, ... - - - - - A link to an image which contains only the illustration. - - - - - - - - A graphic used to separate blocks. Usually a line or rectangle. - - - - - - - - A block of text. - - - - - - - A single line of text. - - - - - - - - - - - - - A hyphenation char. Can appear only at the end of a line. - - - - - - - - - - - - - - - - - - - - - Pixel coordinates based on the left-hand top corner of an image which define a polyline on which a line of text rests. - - - - - Attribute to record language of the textline. - - - - - Correction Status. Indicates whether manual correction has been done or not. The correction status should be recorded at the highest level possible (Block, TextLine, String). - - - - - - - - Attribute deprecated. LANG should be used instead. - - - - - Attribute to record language of the textblock. - - - - - - - - - - - The xml data wrapper element XmlData is used to contain XML encoded metadata. - The content of an XmlData element can be in any namespace or in no namespace. - As permitted by the XML Schema Standard, the processContents attribute value for the - metadata in an XmlData is set to “lax”. Therefore, if the source schema and its location are - identified by means of an XML schemaLocation attribute, then an XML processor will validate - the elements for which it can find declarations. If a source schema is not identified, or cannot be - found at the specified schemaLocation, then an XML validator will check for well-formedness, - but otherwise skip over the elements appearing in the XmlData element. - - - - - - - - - - - - - Type can be used to classify and group the information within each tag element type. - - - - - Content / information value of the tag. - - - - - Description text for tag information for clarification. - - - - - Any URI for authority or description relevant information. - - - - - - - Modern OCR software stores information on glyph level. A glyph is essentially a character or ligature. - Accordingly the value for the glyph element will be defined as follows: - Pre-composed representation = base + combining character(s) (decomposed representation) - See http://www.fileformat.info/info/unicode/char/0101/index.htm - "U+0101" = (U+0061) + (U+0304) - "combining characters" ("base characters" in combination with non-spacing marks or characters which are combined to one) are represented as one "glyph", e.g. áàâ. - - Each glyph has its own coordinate information and must be separately addressable as a distinct object. - Correction and verification processes can be carried out for individual characters. - - Post-OCR analysis of the text as well as adaptive OCR algorithm must be able to record information on glyph level. - In order to reproduce the decision of the OCR software, optional characters must be recorded. These are called variants. - The OCR software evaluates each variant and picks the one with the highest confidence score as the glyph. - The confidence score expresses how confident the OCR software is that a single glyph had been recognized correctly. - - The glyph elements are in order of the word. Each glyph need to be recorded to built up the whole word sequence. - - The glyph’s CONTENT attribute is no replacement for the string’s CONTENT attribute. - Due to post-processing steps such as correction the values of both attributes may be inconsistent. - - - - - - - - - - - CONTENT contains the precomposed representation (combining character) of the character from the parent String element. - The sequence position of the Gylph element matches the position of the character in the String. - - - - - - - - - - - - - This GC attribute records a float value between 0.0 and 1.0 that expresses the level of confidence for the variant where is 1 is certain. - This attribute is optional. If it is not available, the default value for the variant is “0”. - The GC attribute semantic is the same as the WC attribute on the String element and VC on Variant element. - - - - - - - - - - - - - - - - - - Alternative (combined) character for the glyph, outlined by OCR engine or similar recognition processes. - In case the variant are two (combining) characters, two characters are outlined in one Variant element. - E.g. a Glyph element with CONTENT="m" can have a Variant element with the content "rn". - Details for different use-cases see on the samples on GitHub. - - - - - - Each Variant represents an option for the glyph that the OCR software detected as possible alternatives. - In case the variant are two (combining) characters, two characters are outlined in one Variant element. - E.g. a Glyph element with CONTENT="m" can have a Variant element with the content "rn". - Details for different use-cases see on the samples on GitHub. - - - - - - - - - - - - - This VC attribute records a float value between 0.0 and 1.0 that expresses the level of confidence for the variant where is 1 is certain. - This attribute is optional. If it is not available, the default value for the variant is “0”. - The VC attribute semantic is the same as the GC attribute on the Glyph element. - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ALTO (analyzed layout and text object) stores layout information and + OCR recognized text of pages of any kind of printed documents like books, journals and newspapers. + ALTO is a standardized XML format to store layout and content information. + It is designed to be used as an extension schema to METS (Metadata Encoding and Transmission Standard), + where METS provides metadata and structural information while ALTO contains content and physical information. + + + + + + + + Describes general settings of the alto file like measurement units and metadata + + + + + Styles define properties of layout elements. A style defined in a parent element is used as default style for all related children elements. + + + + + + Tag define properties of additional characteristic. The tags are referenced from related content element on Block or String element by attribute TAGREF via the tag ID. + This container element contains the individual elements for LayoutTags, StructureTags, RoleTags, NamedEntityTags and OtherTags + + + + + + The root layout element. + + + + + + Schema version of the ALTO file. + + + + + + + + + + Element deprecated. 'Processing' should be used instead. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + There are following variation of tag types available: + LayoutTag – criteria about arrangement or graphical appearance + StructureTag – criteria about grouping or formation + RoleTag – criteria about function or mission + NamedEntityTag – criteria about assignment of terms to their relationship / meaning (NER) + OtherTag – criteria about any other characteristic not listed above, the TYPE attribute is intended to be used for classification within those. + + + + + + + + + + + + + + + Gives brief information about original page quality + + + + + + + + + + + + + + Gives more details about the original page quality, since QUALITY attribute gives only brief and restrictive information + + + + + + Position of the page. Could be lefthanded, righthanded, cover, foldout or single if it has no special position. + + + + + + + + + + + + Page Confidence: Confidence level of the ocr for this page. A value between 0 (unsure) and 1 (sure). + + + + + + + + + One page of a book or journal. + + + + + The area between the top line of print and the upper edge of the leaf. It may contain page number or running title. + + + + + The area between the printspace and the left border of a page. May contain margin notes. + + + + + The area between the printspace and the right border of a page. May contain margin notes. + + + + + The area between the bottom line of letterpress or writing and the bottom edge of the leaf. It may contain a page number, a signature number or a catch word. + + + + + Rectangle covering the printed area of a page. Page number and running title are not part of the print space. + + + + + + + Any user-defined class like title page. + + + + + + + + + The number of the page within the document. + + + + + The page number that is printed on the page. + + + + + + + + A link to the processing description that has been used for this page. + + + + + Estimated percentage of OCR Accuracy in range from 0 to 100 + + + + + + + + + + + + + A text style defines font properties of text. + + + + + + + A paragraph style defines formatting properties of text blocks. + + + + + Indicates the alignement of the paragraph. Could be left, right, center or justify. + + + + + + + + + + + + + Left indent of the paragraph in relation to the column. + + + + + Right indent of the paragraph in relation to the column. + + + + + Line spacing between two lines of the paragraph. Measurement calculated from baseline to baseline. + + + + + Indent of the first line of the paragraph if this is different from the other lines. A negative value indicates an indent to the left, a positive value indicates an indent to the right. + + + + + + + + + + + + + + + + + + + + + + + + + + + Group of available block types + + + + + A block of text. + + + + + A picture or image. + + + + + A graphic used to separate blocks. Usually a line or rectangle. + + + + + A block that consists of other blocks + + + + + + + Base type for any kind of block on the page. + + + + + + + + + + + + + + + Tells the rotation of e.g. text or illustration within the block. The value is in degree counterclockwise. + + + + + The next block in reading sequence on the page. + + + + + Correction Status. Indicates whether manual correction has been done or not. The correction status should be recorded at the highest level possible (Block, TextLine, String). + + + + + + + A white space. + + + + + + + + + + Type of the substitution (if any). + + + + + + + + + + + + + + + Word Confidence: Confidence level of the ocr for this string. A value between 0 (unsure) and 1 (sure). + + + + + + + + + + Any alternative for the word. + Alternative can outline a variant of writing by new typing / spelling rules, typically manually done or by dictionary replacements. + The above sample is an old composed character "Æ" of ancient time, which is replaced now by "Ä". + As variant are meant alternatives of the real printed content which are options outlined by the text recognition process. + Similar sample: "Straße" vs. "Strasse". Such alternatives are not coming from text recognition. + + + + + + + Identifies the purpose of the alternative. + + + + + + + + A sequence of chars. Strings are separated by white spaces or hyphenation chars. + + + + + + + + + + + + + + + + + + + + Content of the substitution. + + + + + + Confidence level of each character in that string. A list of numbers, one number between 0 (sure) and 9 (unsure) for each character. + + + + + Correction Status. Indicates whether manual correction has been done or not. The correction status should be recorded at the highest level possible (Block, TextLine, String). + + + + + Attribute to record language of the string. The language should be recorded at the highest level possible. + + + + + + A region on a page + + + + + + + + + + + + + + + + + + A list of points + + + + + + Describes the bounding shape of a block, if it is not rectangular. + + + + + + + + + + A polygon shape. + + + + + + An ellipse shape. HPOS and VPOS describe the center of the ellipse. + HLENGTH and VLENGTH are the width and height of the described ellipse. + The attribute ROTATION tells the rotation of the e.g. text or + illustration within the block. The value is in degrees counterclockwise. + + + + + + + + + + A circle shape. HPOS and VPOS describe the center of the circle. + + + + + + + + Formatting attributes. Note that these attributes are assumed to be inherited from ancestor elements of the document hierarchy. + + + + The font name. + + + + + + + The font size, in points (1/72 of an inch). + + + + + Font color as RGB value + + + + + + + Serif or Sans-Serif + + + + + + + + + fixed or proportional + + + + + + + + + + + All measurement values inside the alto file are related to + this unit, except the font size. + Coordinates as being used in HPOS and VPOS are absolute coordinates referring to the upper-left corner of a page. + The upper left corner of the page is defined as coordinate (0/0). + + values meaning: + mm10: 1/10th of millimeter + inch1200: 1/1200th of inch + pixel: 1 pixel + + The values for pixel will be related to the resolution of the image based + on which the layout is described. Incase the original image is not known + the scaling factor can be calculated based on total width and height of + the image and the according information of the PAGE element. + + + + + + + + + + + Information to identify the image file from which the OCR text was created. + + + + + + + + + + + + + + + + + + + A unique identifier for the image file. This is drawn from MIX. + This identifier must be unique within the local system. + To facilitate file sharing or interoperability with other systems, fileIdentifierLocation may be added to designate the system or application where the identifier is unique. + + + + + + A location qualifier, i.e., a namespace. + + + + + + + + + + + + + + A unique identifier for the document. + This identifier must be unique within the local system. + To facilitate file sharing or interoperability with other systems, documentIdentifierLocation may be added to designate the system or application where the identifier is unique. + + + + + + A location qualifier, i.e., a namespace. + + + + + + + + Deprecated. processingType should be used instead. + Information on how the text was created, including preprocessing, OCR processing, and postprocessing steps. Where possible, this draws from MIX's change history. + + + + + + + + + + Description of the processing step. + + + + + Classification of the category of operation, how the file was created, including generation, modification, preprocessing, postprocessing or any other steps. + + + + + Date or DateTime the image was processed. + + + + + Identifies the organizationlevel producer(s) of the processed image. + + + + + An ordinal listing of the image processing steps performed. For example, "image despeckling." + + + + + A description of any setting of the processing application. For example, for a multi-engine OCR application this might include the engines which were used. Ideally, this description should be adequate so that someone else using the same application can produce identical results. + + + + + + + + + + + + + + + + + + + + + Information about a software application. Where applicable, the preferred method for determining this information is by selecting Help -- About. + + + + + The name of the organization or company that created the application. + + + + + The name of the application. + + + + + The version of the application. + + + + + A description of any important characteristics of the application, especially for non-commercial applications. For example, if a non-commercial application is built using commercial components, e.g., an OCR engine SDK. Those components should be mentioned here. + + + + + + + + + + List of any combination of font styles + + + + + + + + + + + + + + + + + + + + + + + A block that consists of other blocks + + + + + + + + + A user defined string to identify the type of composed block (e.g. table, advertisement, ...) + + + + + An ID to link to an image which contains only the composed block. The ID and the file link is defined in the related METS file. + + + + + + + + A picture or image. + + + + + + A user defined string to identify the type of illustration like photo, map, drawing, chart, ... + + + + + A link to an image which contains only the illustration. + + + + + + + + A graphic used to separate blocks. Usually a line or rectangle. + + + + + + + + A block of text. + + + + + + + A single line of text. + + + + + + + + + + + + + A hyphenation char. Can appear only at the end of a line. + + + + + + + + + + + + + + + + + + + + + Pixel coordinates based on the left-hand top corner of an image which define a polyline on which a line of text rests. + + + + + Attribute to record language of the textline. + + + + + Correction Status. Indicates whether manual correction has been done or not. The correction status should be recorded at the highest level possible (Block, TextLine, String). + + + + + + + + Attribute deprecated. LANG should be used instead. + + + + + Attribute to record language of the textblock. + + + + + + + + + + + The xml data wrapper element XmlData is used to contain XML encoded metadata. + The content of an XmlData element can be in any namespace or in no namespace. + As permitted by the XML Schema Standard, the processContents attribute value for the + metadata in an XmlData is set to “lax”. Therefore, if the source schema and its location are + identified by means of an XML schemaLocation attribute, then an XML processor will validate + the elements for which it can find declarations. If a source schema is not identified, or cannot be + found at the specified schemaLocation, then an XML validator will check for well-formedness, + but otherwise skip over the elements appearing in the XmlData element. + + + + + + + + + + + + + Type can be used to classify and group the information within each tag element type. + + + + + Content / information value of the tag. + + + + + Description text for tag information for clarification. + + + + + Any URI for authority or description relevant information. + + + + + + + Modern OCR software stores information on glyph level. A glyph is essentially a character or ligature. + Accordingly the value for the glyph element will be defined as follows: + Pre-composed representation = base + combining character(s) (decomposed representation) + See http://www.fileformat.info/info/unicode/char/0101/index.htm + "U+0101" = (U+0061) + (U+0304) + "combining characters" ("base characters" in combination with non-spacing marks or characters which are combined to one) are represented as one "glyph", e.g. áàâ. + + Each glyph has its own coordinate information and must be separately addressable as a distinct object. + Correction and verification processes can be carried out for individual characters. + + Post-OCR analysis of the text as well as adaptive OCR algorithm must be able to record information on glyph level. + In order to reproduce the decision of the OCR software, optional characters must be recorded. These are called variants. + The OCR software evaluates each variant and picks the one with the highest confidence score as the glyph. + The confidence score expresses how confident the OCR software is that a single glyph had been recognized correctly. + + The glyph elements are in order of the word. Each glyph need to be recorded to built up the whole word sequence. + + The glyph’s CONTENT attribute is no replacement for the string’s CONTENT attribute. + Due to post-processing steps such as correction the values of both attributes may be inconsistent. + + + + + + + + + + + CONTENT contains the precomposed representation (combining character) of the character from the parent String element. + The sequence position of the Gylph element matches the position of the character in the String. + + + + + + + + + + + + + This GC attribute records a float value between 0.0 and 1.0 that expresses the level of confidence for the variant where is 1 is certain. + This attribute is optional. If it is not available, the default value for the variant is “0”. + The GC attribute semantic is the same as the WC attribute on the String element and VC on Variant element. + + + + + + + + + + + + + + + + + + Alternative (combined) character for the glyph, outlined by OCR engine or similar recognition processes. + In case the variant are two (combining) characters, two characters are outlined in one Variant element. + E.g. a Glyph element with CONTENT="m" can have a Variant element with the content "rn". + Details for different use-cases see on the samples on GitHub. + + + + + + Each Variant represents an option for the glyph that the OCR software detected as possible alternatives. + In case the variant are two (combining) characters, two characters are outlined in one Variant element. + E.g. a Glyph element with CONTENT="m" can have a Variant element with the content "rn". + Details for different use-cases see on the samples on GitHub. + + + + + + + + + + + + + This VC attribute records a float value between 0.0 and 1.0 that expresses the level of confidence for the variant where is 1 is certain. + This attribute is optional. If it is not available, the default value for the variant is “0”. + The VC attribute semantic is the same as the GC attribute on the Glyph element. + + + + + + + + + + \ No newline at end of file diff --git a/v4/alto-4-3.xsd b/v4/alto-4-3.xsd index 7130407..5fd8220 100644 --- a/v4/alto-4-3.xsd +++ b/v4/alto-4-3.xsd @@ -1,1248 +1,1248 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ALTO (analyzed layout and text object) stores layout information and - OCR recognized text of pages of any kind of printed documents like books, journals and newspapers. - ALTO is a standardized XML format to store layout and content information. - It is designed to be used as an extension schema to METS (Metadata Encoding and Transmission Standard), - where METS provides metadata and structural information while ALTO contains content and physical information. - - - - - - - - Describes general settings of the alto file like measurement units and metadata - - - - - Styles define properties of layout elements. A style defined in a parent element is used as default style for all related children elements. - - - - - - Tag define properties of additional characteristic. The tags are referenced from related content element on Block or String element by attribute TAGREF via the tag ID. - This container element contains the individual elements for LayoutTags, StructureTags, RoleTags, NamedEntityTags and OtherTags - - - - - - - Describes alternative hierarchical orderings of the page (i.e. total orders over its segments, for linear text flow), - in addition to the explicit flat reading order defined by @IDNEXT on the block level, - and the implicit flat reading order implied by the segment element ordering. - - - - - - The root layout element. - - - - - - Schema version of the ALTO file. - - - - - - - - - - Element deprecated. 'Processing' should be used instead. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - There are following variation of tag types available: - LayoutTag – criteria about arrangement or graphical appearance - StructureTag – criteria about grouping or formation - RoleTag – criteria about function or mission - NamedEntityTag – criteria about assignment of terms to their relationship / meaning (NER) - OtherTag – criteria about any other characteristic not listed above, the TYPE attribute is intended to be used for classification within those. - - - - - - - - - - - - - - - - Defines one or more reading orders within the - page. Groups may be either unordered or ordered and can - contain other groups, e.g. a page containing - unrelated texts that are ordered individually - would be encoded as an UnorderedGroup containing - multiple OrderedGroups. The granularity of - elements can vary inside groups. - - - - - - - - - - - - - A reference to an element such as a block, TextLine, String, or Glyph. - - - - - - - A link to the referenced element. Valid - target elements are any block type, - TextLine, String, or Glyph. - - - - - - - Optionally annotates the role of the - referenced element in the reading order - with one or more tags. Examples could be - interlinear additions or marginalia. - - - - - - - - A group containing ordered elements (i.e. the sequence of OrderedGroup, UnorderedGroup or ElementRef subelements is ordered). - - - - - - - - - - - - - - Optionally annotates the role of the - group in the reading order - with one or more tags. Examples could be - distinguishing - parallel texts or apparatus criticus and - main text. - - - - - - - A link to the referenced element. Valid - target elements are any block type, - TextLine, or String. - - - - - - - - A group containing unordered elements (i.e. the sequence of OrderedGroup, UnorderedGroup or ElementRef subelements is arbitrary). - - - - - - - - - - - - - - - A link to the referenced element. Valid - target elements are any block type, - TextLine, or String. - - - - - - - Gives brief information about original page quality - - - - - - - - - - - - - - Gives more details about the original page quality, since QUALITY attribute gives only brief and restrictive information - - - - - - Position of the page. Could be lefthanded, righthanded, cover, foldout or single if it has no special position. - - - - - - - - - - - - Page Confidence: Confidence level of the ocr for this page. A value between 0 (unsure) and 1 (sure). - - - - - - - - - One page of a book or journal. - - - - - The area between the top line of print and the upper edge of the leaf. It may contain page number or running title. - - - - - The area between the printspace and the left border of a page. May contain margin notes. - - - - - The area between the printspace and the right border of a page. May contain margin notes. - - - - - The area between the bottom line of letterpress or writing and the bottom edge of the leaf. It may contain a page number, a signature number or a catch word. - - - - - Rectangle covering the printed area of a page. Page number and running title are not part of the print space. - - - - - - - Any user-defined class like title page. - - - - - - - - - The number of the page within the document. - - - - - The page number that is printed on the page. - - - - - - - - A link to the processing description that has been used for this page. - - - - - Estimated percentage of OCR Accuracy in range from 0 to 100 - - - - - - - - - - - - - A text style defines font properties of text. - - - - - - - A paragraph style defines formatting properties of text blocks. - - - - - Indicates the alignement of the paragraph. Could be left, right, center or justify. - - - - - - - - - - - - - Left indent of the paragraph in relation to the column. - - - - - Right indent of the paragraph in relation to the column. - - - - - Line spacing between two lines of the paragraph. Measurement calculated from baseline to baseline. - - - - - Indent of the first line of the paragraph if this is different from the other lines. A negative value indicates an indent to the left, a positive value indicates an indent to the right. - - - - - - - - - - - - - - - - - - - - - - - - - - - Group of available block types - - - - - A block of text. - - - - - A picture or image. - - - - - A graphic used to separate blocks. Usually a line or rectangle. - - - - - A block that consists of other blocks - - - - - - - Base type for any kind of block on the page. - - - - - - - - - - - - - - - Tells the rotation of e.g. text or illustration within the block. The value is in degree counterclockwise. - - - - - The next block in reading order of the page (if ReadingOrder is not specified, and elements are not in order). - - - - - Correction Status. Indicates whether manual correction has been done or not. The correction status should be recorded at the highest level possible (Block, TextLine, String). - - - - - - - A white space. - - - - - - - - - - Type of the substitution (if any). - - - - - - - - - - - - - - - Word Confidence: Confidence level of the ocr for this string. A value between 0 (unsure) and 1 (sure). - - - - - - - - - - Any alternative for the word. - Alternative can outline a variant of writing by new typing / spelling rules, typically manually done or by dictionary replacements. - The above sample is an old composed character "Æ" of ancient time, which is replaced now by "Ä". - As variant are meant alternatives of the real printed content which are options outlined by the text recognition process. - Similar sample: "Straße" vs. "Strasse". Such alternatives are not coming from text recognition. - - - - - - - Identifies the purpose of the alternative. - - - - - - - - A sequence of chars. Strings are separated by white spaces or hyphenation chars. - - - - - - - - - - - - - - - - - - - - Content of the substitution. - - - - - - Confidence level of each character in that string. A list of numbers, one number between 0 (sure) and 9 (unsure) for each character. - - - - - Correction Status. Indicates whether manual correction has been done or not. The correction status should be recorded at the highest level possible (Block, TextLine, String). - - - - - Attribute to record language of the string. The language should be recorded at the highest level possible. - - - - - - A region on a page - - - - - - - - - - - - - - - - - - A list of points - - - - - - Describes the bounding shape of a block, if it is not rectangular. - - - - - - - - - - Describes the inline base direction and line orientation of a line or of all lines inside a text block. - The meaning of these terms is defined by the W3C writing modes document: - These values should correspond to the base direction set in the BiDi algorithm to the respective elements during Unicode encoding. A value of "ttb" (top-to-bottom) implies a base direction of left-to-right, a value of "btt" (bottom-to-top) a base direction of right-to-left. - - - - - - - - - - - A polygon shape. - - - - - - An ellipse shape. HPOS and VPOS describe the center of the ellipse. - HLENGTH and VLENGTH are the width and height of the described ellipse. - The attribute ROTATION tells the rotation of the e.g. text or - illustration within the block. The value is in degrees counterclockwise. - - - - - - - - - - A circle shape. HPOS and VPOS describe the center of the circle. - - - - - - - - Formatting attributes. Note that these attributes are assumed to be inherited from ancestor elements of the document hierarchy. - - - - The font name. - - - - - - - The font size, in points (1/72 of an inch). - - - - - Font color as RGB value - - - - - - - Serif or Sans-Serif - - - - - - - - - fixed or proportional - - - - - - - - - - - All measurement values inside the alto file are related to - this unit, except the font size. - Coordinates as being used in HPOS and VPOS are absolute coordinates referring to the upper-left corner of a page. - The upper left corner of the page is defined as coordinate (0/0). - - values meaning: - mm10: 1/10th of millimeter - inch1200: 1/1200th of inch - pixel: 1 pixel - - The values for pixel will be related to the resolution of the image based - on which the layout is described. Incase the original image is not known - the scaling factor can be calculated based on total width and height of - the image and the according information of the PAGE element. - - - - - - - - - - - Information to identify the image file from which the OCR text was created. - - - - - - - - - - - - - - - - - - - A unique identifier for the image file. This is drawn from MIX. - This identifier must be unique within the local system. - To facilitate file sharing or interoperability with other systems, fileIdentifierLocation may be added to designate the system or application where the identifier is unique. - - - - - - A location qualifier, i.e., a namespace. - - - - - - - - - - - - - - A unique identifier for the document. - This identifier must be unique within the local system. - To facilitate file sharing or interoperability with other systems, documentIdentifierLocation may be added to designate the system or application where the identifier is unique. - - - - - - A location qualifier, i.e., a namespace. - - - - - - - - Deprecated. processingStepType should be used instead. - Information on how the text was created, including preprocessing, OCR processing, and postprocessing steps. Where possible, this draws from MIX's change history. - - - - - - - - - - Description of the processing step. - - - - - Classification of the category of operation, how the file was created, including generation, modification, preprocessing, postprocessing or any other steps. - - - - - Date or DateTime the image was processed. - - - - - Identifies the organizationlevel producer(s) of the processed image. - - - - - An ordinal listing of the image processing steps performed. For example, "image despeckling." - - - - - A description of any setting of the processing application. For example, for a multi-engine OCR application this might include the engines which were used. Ideally, this description should be adequate so that someone else using the same application can produce identical results. - - - - - - - - - - - - - - - - - - - - - Information about a software application. Where applicable, the preferred method for determining this information is by selecting Help -- About. - - - - - The name of the organization or company that created the application. - - - - - The name of the application. - - - - - The version of the application. - - - - - A description of any important characteristics of the application, especially for non-commercial applications. For example, if a non-commercial application is built using commercial components, e.g., an OCR engine SDK. Those components should be mentioned here. - - - - - - - - - - List of any combination of font styles - - - - - - - - - - - - - - - - - - - - - - - A block that consists of other blocks - - - - - - - - - A user defined string to identify the type of composed block (e.g. table, advertisement, ...) - - - - - An ID to link to an image which contains only the composed block. The ID and the file link is defined in the related METS file. - - - - - - - - A picture or image. - - - - - - A user defined string to identify the type of illustration like photo, map, drawing, chart, ... - - - - - A link to an image which contains only the illustration. - - - - - - - - A graphic used to separate blocks. Usually a line or rectangle. - - - - - - - - A block of text. - - - - - - - A single line of text. - - - - - - - - - - - - - A hyphenation char. Can appear only at the end of a line. - - - - - - - - - - - - - - - - - - - - - Pixel coordinates based on the left-hand top corner of an image which define a polyline on which a line of text rests. - - - - - Attribute to record language of the textline. - - - - - Correction Status. Indicates whether manual correction has been done or not. The correction status should be recorded at the highest level possible (Block, TextLine, String). - - - - - Indicates the inline base direction of this TextLine. Overrides the value on elements higher in the hierarchy. - - - - - - - - Attribute deprecated. LANG should be used instead. - - - - - Attribute to record language of the textblock. - - - - - Indicates the inline base direction of the TextBlock. - - - - - - - - - - - The xml data wrapper element XmlData is used to contain XML encoded metadata. - The content of an XmlData element can be in any namespace or in no namespace. - As permitted by the XML Schema Standard, the processContents attribute value for the - metadata in an XmlData is set to “lax”. Therefore, if the source schema and its location are - identified by means of an XML schemaLocation attribute, then an XML processor will validate - the elements for which it can find declarations. If a source schema is not identified, or cannot be - found at the specified schemaLocation, then an XML validator will check for well-formedness, - but otherwise skip over the elements appearing in the XmlData element. - - - - - - - - - - - - - Type can be used to classify and group the information within each tag element type. - - - - - Content / information value of the tag. - - - - - Description text for tag information for clarification. - - - - - Any URI for authority or description relevant information. - - - - - - - Modern OCR software stores information on glyph level. A glyph is essentially a character or ligature. - Accordingly the value for the glyph element will be defined as follows: - Pre-composed representation = base + combining character(s) (decomposed representation) - See http://www.fileformat.info/info/unicode/char/0101/index.htm - "U+0101" = (U+0061) + (U+0304) - "combining characters" ("base characters" in combination with non-spacing marks or characters which are combined to one) are represented as one "glyph", e.g. áàâ. - - Each glyph has its own coordinate information and must be separately addressable as a distinct object. - Correction and verification processes can be carried out for individual characters. - - Post-OCR analysis of the text as well as adaptive OCR algorithm must be able to record information on glyph level. - In order to reproduce the decision of the OCR software, optional characters must be recorded. These are called variants. - The OCR software evaluates each variant and picks the one with the highest confidence score as the glyph. - The confidence score expresses how confident the OCR software is that a single glyph had been recognized correctly. - - The glyph elements are in order of the word. Each glyph need to be recorded to built up the whole word sequence. - - The glyph’s CONTENT attribute is no replacement for the string’s CONTENT attribute. - Due to post-processing steps such as correction the values of both attributes may be inconsistent. - - - - - - - - - - - CONTENT contains the precomposed representation (combining character) of the character from the parent String element. - The sequence position of the Gylph element matches the position of the character in the String. - - - - - - - - - - - - - This GC attribute records a float value between 0.0 and 1.0 that expresses the level of confidence for the glyph where 1 is certain. - This attribute is optional. If it is not available, the default value for the glyph is “0”. - The GC attribute semantic is the same as the WC attribute on the String element and VC on Variant element. - - - - - - - - - - - - - - - - - - Alternative (combined) character for the glyph, outlined by OCR engine or similar recognition processes. - In case the variant are two (combining) characters, two characters are outlined in one Variant element. - E.g. a Glyph element with CONTENT="m" can have a Variant element with the content "rn". - Details for different use-cases see on the samples on GitHub. - - - - - - Each Variant represents an option for the glyph that the OCR software detected as possible alternatives. - In case the variant are two (combining) characters, two characters are outlined in one Variant element. - E.g. a Glyph element with CONTENT="m" can have a Variant element with the content "rn". - Details for different use-cases see on the samples on GitHub. - - - - - - - - - - - - - This VC attribute records a float value between 0.0 and 1.0 that expresses the level of confidence for the variant where is 1 is certain. - This attribute is optional. If it is not available, the default value for the variant is “0”. - The VC attribute semantic is the same as the GC attribute on the Glyph element. - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ALTO (analyzed layout and text object) stores layout information and + OCR recognized text of pages of any kind of printed documents like books, journals and newspapers. + ALTO is a standardized XML format to store layout and content information. + It is designed to be used as an extension schema to METS (Metadata Encoding and Transmission Standard), + where METS provides metadata and structural information while ALTO contains content and physical information. + + + + + + + + Describes general settings of the alto file like measurement units and metadata + + + + + Styles define properties of layout elements. A style defined in a parent element is used as default style for all related children elements. + + + + + + Tag define properties of additional characteristic. The tags are referenced from related content element on Block or String element by attribute TAGREF via the tag ID. + This container element contains the individual elements for LayoutTags, StructureTags, RoleTags, NamedEntityTags and OtherTags + + + + + + + Describes alternative hierarchical orderings of the page (i.e. total orders over its segments, for linear text flow), + in addition to the explicit flat reading order defined by @IDNEXT on the block level, + and the implicit flat reading order implied by the segment element ordering. + + + + + + The root layout element. + + + + + + Schema version of the ALTO file. + + + + + + + + + + Element deprecated. 'Processing' should be used instead. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + There are following variation of tag types available: + LayoutTag – criteria about arrangement or graphical appearance + StructureTag – criteria about grouping or formation + RoleTag – criteria about function or mission + NamedEntityTag – criteria about assignment of terms to their relationship / meaning (NER) + OtherTag – criteria about any other characteristic not listed above, the TYPE attribute is intended to be used for classification within those. + + + + + + + + + + + + + + + + Defines one or more reading orders within the + page. Groups may be either unordered or ordered and can + contain other groups, e.g. a page containing + unrelated texts that are ordered individually + would be encoded as an UnorderedGroup containing + multiple OrderedGroups. The granularity of + elements can vary inside groups. + + + + + + + + + + + + + A reference to an element such as a block, TextLine, String, or Glyph. + + + + + + + A link to the referenced element. Valid + target elements are any block type, + TextLine, String, or Glyph. + + + + + + + Optionally annotates the role of the + referenced element in the reading order + with one or more tags. Examples could be + interlinear additions or marginalia. + + + + + + + + A group containing ordered elements (i.e. the sequence of OrderedGroup, UnorderedGroup or ElementRef subelements is ordered). + + + + + + + + + + + + + + Optionally annotates the role of the + group in the reading order + with one or more tags. Examples could be + distinguishing + parallel texts or apparatus criticus and + main text. + + + + + + + A link to the referenced element. Valid + target elements are any block type, + TextLine, or String. + + + + + + + + A group containing unordered elements (i.e. the sequence of OrderedGroup, UnorderedGroup or ElementRef subelements is arbitrary). + + + + + + + + + + + + + + + A link to the referenced element. Valid + target elements are any block type, + TextLine, or String. + + + + + + + Gives brief information about original page quality + + + + + + + + + + + + + + Gives more details about the original page quality, since QUALITY attribute gives only brief and restrictive information + + + + + + Position of the page. Could be lefthanded, righthanded, cover, foldout or single if it has no special position. + + + + + + + + + + + + Page Confidence: Confidence level of the ocr for this page. A value between 0 (unsure) and 1 (sure). + + + + + + + + + One page of a book or journal. + + + + + The area between the top line of print and the upper edge of the leaf. It may contain page number or running title. + + + + + The area between the printspace and the left border of a page. May contain margin notes. + + + + + The area between the printspace and the right border of a page. May contain margin notes. + + + + + The area between the bottom line of letterpress or writing and the bottom edge of the leaf. It may contain a page number, a signature number or a catch word. + + + + + Rectangle covering the printed area of a page. Page number and running title are not part of the print space. + + + + + + + Any user-defined class like title page. + + + + + + + + + The number of the page within the document. + + + + + The page number that is printed on the page. + + + + + + + + A link to the processing description that has been used for this page. + + + + + Estimated percentage of OCR Accuracy in range from 0 to 100 + + + + + + + + + + + + + A text style defines font properties of text. + + + + + + + A paragraph style defines formatting properties of text blocks. + + + + + Indicates the alignement of the paragraph. Could be left, right, center or justify. + + + + + + + + + + + + + Left indent of the paragraph in relation to the column. + + + + + Right indent of the paragraph in relation to the column. + + + + + Line spacing between two lines of the paragraph. Measurement calculated from baseline to baseline. + + + + + Indent of the first line of the paragraph if this is different from the other lines. A negative value indicates an indent to the left, a positive value indicates an indent to the right. + + + + + + + + + + + + + + + + + + + + + + + + + + + Group of available block types + + + + + A block of text. + + + + + A picture or image. + + + + + A graphic used to separate blocks. Usually a line or rectangle. + + + + + A block that consists of other blocks + + + + + + + Base type for any kind of block on the page. + + + + + + + + + + + + + + + Tells the rotation of e.g. text or illustration within the block. The value is in degree counterclockwise. + + + + + The next block in reading order of the page (if ReadingOrder is not specified, and elements are not in order). + + + + + Correction Status. Indicates whether manual correction has been done or not. The correction status should be recorded at the highest level possible (Block, TextLine, String). + + + + + + + A white space. + + + + + + + + + + Type of the substitution (if any). + + + + + + + + + + + + + + + Word Confidence: Confidence level of the ocr for this string. A value between 0 (unsure) and 1 (sure). + + + + + + + + + + Any alternative for the word. + Alternative can outline a variant of writing by new typing / spelling rules, typically manually done or by dictionary replacements. + The above sample is an old composed character "Æ" of ancient time, which is replaced now by "Ä". + As variant are meant alternatives of the real printed content which are options outlined by the text recognition process. + Similar sample: "Straße" vs. "Strasse". Such alternatives are not coming from text recognition. + + + + + + + Identifies the purpose of the alternative. + + + + + + + + A sequence of chars. Strings are separated by white spaces or hyphenation chars. + + + + + + + + + + + + + + + + + + + + Content of the substitution. + + + + + + Confidence level of each character in that string. A list of numbers, one number between 0 (sure) and 9 (unsure) for each character. + + + + + Correction Status. Indicates whether manual correction has been done or not. The correction status should be recorded at the highest level possible (Block, TextLine, String). + + + + + Attribute to record language of the string. The language should be recorded at the highest level possible. + + + + + + A region on a page + + + + + + + + + + + + + + + + + + A list of points + + + + + + Describes the bounding shape of a block, if it is not rectangular. + + + + + + + + + + Describes the inline base direction and line orientation of a line or of all lines inside a text block. + The meaning of these terms is defined by the W3C writing modes document: + These values should correspond to the base direction set in the BiDi algorithm to the respective elements during Unicode encoding. A value of "ttb" (top-to-bottom) implies a base direction of left-to-right, a value of "btt" (bottom-to-top) a base direction of right-to-left. + + + + + + + + + + + A polygon shape. + + + + + + An ellipse shape. HPOS and VPOS describe the center of the ellipse. + HLENGTH and VLENGTH are the width and height of the described ellipse. + The attribute ROTATION tells the rotation of the e.g. text or + illustration within the block. The value is in degrees counterclockwise. + + + + + + + + + + A circle shape. HPOS and VPOS describe the center of the circle. + + + + + + + + Formatting attributes. Note that these attributes are assumed to be inherited from ancestor elements of the document hierarchy. + + + + The font name. + + + + + + + The font size, in points (1/72 of an inch). + + + + + Font color as RGB value + + + + + + + Serif or Sans-Serif + + + + + + + + + fixed or proportional + + + + + + + + + + + All measurement values inside the alto file are related to + this unit, except the font size. + Coordinates as being used in HPOS and VPOS are absolute coordinates referring to the upper-left corner of a page. + The upper left corner of the page is defined as coordinate (0/0). + + values meaning: + mm10: 1/10th of millimeter + inch1200: 1/1200th of inch + pixel: 1 pixel + + The values for pixel will be related to the resolution of the image based + on which the layout is described. Incase the original image is not known + the scaling factor can be calculated based on total width and height of + the image and the according information of the PAGE element. + + + + + + + + + + + Information to identify the image file from which the OCR text was created. + + + + + + + + + + + + + + + + + + + A unique identifier for the image file. This is drawn from MIX. + This identifier must be unique within the local system. + To facilitate file sharing or interoperability with other systems, fileIdentifierLocation may be added to designate the system or application where the identifier is unique. + + + + + + A location qualifier, i.e., a namespace. + + + + + + + + + + + + + + A unique identifier for the document. + This identifier must be unique within the local system. + To facilitate file sharing or interoperability with other systems, documentIdentifierLocation may be added to designate the system or application where the identifier is unique. + + + + + + A location qualifier, i.e., a namespace. + + + + + + + + Deprecated. processingStepType should be used instead. + Information on how the text was created, including preprocessing, OCR processing, and postprocessing steps. Where possible, this draws from MIX's change history. + + + + + + + + + + Description of the processing step. + + + + + Classification of the category of operation, how the file was created, including generation, modification, preprocessing, postprocessing or any other steps. + + + + + Date or DateTime the image was processed. + + + + + Identifies the organizationlevel producer(s) of the processed image. + + + + + An ordinal listing of the image processing steps performed. For example, "image despeckling." + + + + + A description of any setting of the processing application. For example, for a multi-engine OCR application this might include the engines which were used. Ideally, this description should be adequate so that someone else using the same application can produce identical results. + + + + + + + + + + + + + + + + + + + + + Information about a software application. Where applicable, the preferred method for determining this information is by selecting Help -- About. + + + + + The name of the organization or company that created the application. + + + + + The name of the application. + + + + + The version of the application. + + + + + A description of any important characteristics of the application, especially for non-commercial applications. For example, if a non-commercial application is built using commercial components, e.g., an OCR engine SDK. Those components should be mentioned here. + + + + + + + + + + List of any combination of font styles + + + + + + + + + + + + + + + + + + + + + + + A block that consists of other blocks + + + + + + + + + A user defined string to identify the type of composed block (e.g. table, advertisement, ...) + + + + + An ID to link to an image which contains only the composed block. The ID and the file link is defined in the related METS file. + + + + + + + + A picture or image. + + + + + + A user defined string to identify the type of illustration like photo, map, drawing, chart, ... + + + + + A link to an image which contains only the illustration. + + + + + + + + A graphic used to separate blocks. Usually a line or rectangle. + + + + + + + + A block of text. + + + + + + + A single line of text. + + + + + + + + + + + + + A hyphenation char. Can appear only at the end of a line. + + + + + + + + + + + + + + + + + + + + + Pixel coordinates based on the left-hand top corner of an image which define a polyline on which a line of text rests. + + + + + Attribute to record language of the textline. + + + + + Correction Status. Indicates whether manual correction has been done or not. The correction status should be recorded at the highest level possible (Block, TextLine, String). + + + + + Indicates the inline base direction of this TextLine. Overrides the value on elements higher in the hierarchy. + + + + + + + + Attribute deprecated. LANG should be used instead. + + + + + Attribute to record language of the textblock. + + + + + Indicates the inline base direction of the TextBlock. + + + + + + + + + + + The xml data wrapper element XmlData is used to contain XML encoded metadata. + The content of an XmlData element can be in any namespace or in no namespace. + As permitted by the XML Schema Standard, the processContents attribute value for the + metadata in an XmlData is set to “lax”. Therefore, if the source schema and its location are + identified by means of an XML schemaLocation attribute, then an XML processor will validate + the elements for which it can find declarations. If a source schema is not identified, or cannot be + found at the specified schemaLocation, then an XML validator will check for well-formedness, + but otherwise skip over the elements appearing in the XmlData element. + + + + + + + + + + + + + Type can be used to classify and group the information within each tag element type. + + + + + Content / information value of the tag. + + + + + Description text for tag information for clarification. + + + + + Any URI for authority or description relevant information. + + + + + + + Modern OCR software stores information on glyph level. A glyph is essentially a character or ligature. + Accordingly the value for the glyph element will be defined as follows: + Pre-composed representation = base + combining character(s) (decomposed representation) + See http://www.fileformat.info/info/unicode/char/0101/index.htm + "U+0101" = (U+0061) + (U+0304) + "combining characters" ("base characters" in combination with non-spacing marks or characters which are combined to one) are represented as one "glyph", e.g. áàâ. + + Each glyph has its own coordinate information and must be separately addressable as a distinct object. + Correction and verification processes can be carried out for individual characters. + + Post-OCR analysis of the text as well as adaptive OCR algorithm must be able to record information on glyph level. + In order to reproduce the decision of the OCR software, optional characters must be recorded. These are called variants. + The OCR software evaluates each variant and picks the one with the highest confidence score as the glyph. + The confidence score expresses how confident the OCR software is that a single glyph had been recognized correctly. + + The glyph elements are in order of the word. Each glyph need to be recorded to built up the whole word sequence. + + The glyph’s CONTENT attribute is no replacement for the string’s CONTENT attribute. + Due to post-processing steps such as correction the values of both attributes may be inconsistent. + + + + + + + + + + + CONTENT contains the precomposed representation (combining character) of the character from the parent String element. + The sequence position of the Gylph element matches the position of the character in the String. + + + + + + + + + + + + + This GC attribute records a float value between 0.0 and 1.0 that expresses the level of confidence for the glyph where 1 is certain. + This attribute is optional. If it is not available, the default value for the glyph is “0”. + The GC attribute semantic is the same as the WC attribute on the String element and VC on Variant element. + + + + + + + + + + + + + + + + + + Alternative (combined) character for the glyph, outlined by OCR engine or similar recognition processes. + In case the variant are two (combining) characters, two characters are outlined in one Variant element. + E.g. a Glyph element with CONTENT="m" can have a Variant element with the content "rn". + Details for different use-cases see on the samples on GitHub. + + + + + + Each Variant represents an option for the glyph that the OCR software detected as possible alternatives. + In case the variant are two (combining) characters, two characters are outlined in one Variant element. + E.g. a Glyph element with CONTENT="m" can have a Variant element with the content "rn". + Details for different use-cases see on the samples on GitHub. + + + + + + + + + + + + + This VC attribute records a float value between 0.0 and 1.0 that expresses the level of confidence for the variant where is 1 is certain. + This attribute is optional. If it is not available, the default value for the variant is “0”. + The VC attribute semantic is the same as the GC attribute on the Glyph element. + + + + + + + + + + + From 7c41170c8581ad6b6de3ccab14378129fb3626fa Mon Sep 17 00:00:00 2001 From: Ciprian Dinu <56022421+cipriandinu@users.noreply.github.com> Date: Thu, 16 Feb 2023 09:02:40 +0200 Subject: [PATCH 15/18] Update alto-4-4.xsd --- v4/alto-4-4.xsd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/v4/alto-4-4.xsd b/v4/alto-4-4.xsd index 7ef8054..7034161 100644 --- a/v4/alto-4-4.xsd +++ b/v4/alto-4-4.xsd @@ -105,7 +105,7 @@ For the full text see https://creativecommons.org/licenses/by-sa/4.0/legalcode. 2. Make FONTSIZE optional. 3. Add "strikethrough" to list of allowed values for FONTSTYLE. --> - From 4e2c06e0d29c7a0597e787481001ae428f10075d Mon Sep 17 00:00:00 2001 From: Ciprian Dinu <56022421+cipriandinu@users.noreply.github.com> Date: Thu, 16 Feb 2023 09:11:56 +0200 Subject: [PATCH 16/18] Update alto-4-3.xsd --- v4/alto-4-3.xsd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/v4/alto-4-3.xsd b/v4/alto-4-3.xsd index 5fd8220..0f2f508 100644 --- a/v4/alto-4-3.xsd +++ b/v4/alto-4-3.xsd @@ -105,7 +105,7 @@ For the full text see https://creativecommons.org/licenses/by-sa/4.0/legalcode. 2. Make FONTSIZE optional. 3. Add "strikethrough" to list of allowed values for FONTSTYLE. --> - From fa33084f9b9d8856f3ca5b557e5b3f01927d0d80 Mon Sep 17 00:00:00 2001 From: Ciprian Dinu <56022421+cipriandinu@users.noreply.github.com> Date: Thu, 16 Feb 2023 09:18:08 +0200 Subject: [PATCH 17/18] Update alto-4-4.xsd --- v4/alto-4-4.xsd | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/v4/alto-4-4.xsd b/v4/alto-4-4.xsd index 7034161..6eb189f 100644 --- a/v4/alto-4-4.xsd +++ b/v4/alto-4-4.xsd @@ -106,13 +106,13 @@ For the full text see https://creativecommons.org/licenses/by-sa/4.0/legalcode. 3. Add "strikethrough" to list of allowed values for FONTSTYLE. --> From a4e9e0338691ca934397262ef41d4e204af2f7a5 Mon Sep 17 00:00:00 2001 From: Ciprian Dinu <56022421+cipriandinu@users.noreply.github.com> Date: Thu, 16 Feb 2023 17:52:03 +0200 Subject: [PATCH 18/18] Update alto-4-4.xsd --- v4/alto-4-4.xsd | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/v4/alto-4-4.xsd b/v4/alto-4-4.xsd index 6eb189f..c6cd8cc 100644 --- a/v4/alto-4-4.xsd +++ b/v4/alto-4-4.xsd @@ -709,12 +709,9 @@ For the full text see https://creativecommons.org/licenses/by-sa/4.0/legalcode. A list of coordinate-pairs that are absolute to the upper-left corner of a page. The upper left corner of the page is defined as x=0 and y=0 - Currently there are no rules to enforce a particular format for a points list but recommended formats are: - "x1 y1 x2 y2 ... xn yn" - "x1,y1 x2,y2 ... xn,yn" - "(x1 y1) (x2 y2) ... (xn yn)" - "(x1,y1) (x2,y2) ... (xn,yn)" - On future versions is planned to enforce these rules accordingly + Currently there are no rules to enforce a particular format for a points list but in future versions is planned to restrict it to following options: + "x1,y1 x2,y2 ... xn,yn" - highly recommended as widely used and easy to read by both human and machine + "x1 y1 x2 y2 ... xn yn" - kept for back compatibility, since currently there are tools using this format