Skip to content

Commit bb89dc3

Browse files
Shreeshriizdenop
authored andcommitted
Add info regarding LSTM components and options (#1346)
1 parent 44588a3 commit bb89dc3

File tree

1 file changed

+62
-21
lines changed

1 file changed

+62
-21
lines changed

doc/combine_tessdata.1.asc

+62-21
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,15 @@ COMBINE_TESSDATA(1)
33

44
NAME
55
----
6-
combine_tessdata - combine/extract/overwrite Tesseract data
6+
combine_tessdata - combine/extract/overwrite/list/compact Tesseract data
77

88
SYNOPSIS
99
--------
1010
*combine_tessdata* ['OPTION'] 'FILE'...
1111

1212
DESCRIPTION
1313
-----------
14-
combine_tessdata(1) is the main program to combine/extract/overwrite
14+
combine_tessdata(1) is the main program to combine/extract/overwrite/list/compact
1515
tessdata components in [lang].traineddata files.
1616

1717
To combine all the individual tessdata components (unicharset, DAWGs,
@@ -56,6 +56,13 @@ components from tessdata/eng.traineddata.
5656

5757
OPTIONS
5858
-------
59+
60+
*-c* '.traineddata' 'FILE'...:
61+
Compacts the LSTM component in the .traineddata file to int.
62+
63+
*-d* '.traineddata' 'FILE'...:
64+
Lists directory of components from the .traineddata file.
65+
5966
*-e* '.traineddata' 'FILE'...:
6067
Extracts the specified components from the .traineddata file
6168
@@ -74,69 +81,103 @@ CAVEATS
7481
COMPONENTS
7582
----------
7683
The components in a Tesseract lang.traineddata file as of
77-
Tesseract 3.02 are briefly described below; For more information on
84+
Tesseract 4.00alpha are briefly described below; For more information on
7885
many of these files, see
7986
<https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract>
87+
and
88+
<https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>
8089
8190
lang.config::
8291
(Optional) Language-specific overrides to default config variables.
92+
For 4.00alpha traineddata files, lang.config provides control parameters which
93+
can affect layout analysis, and sub-languages.
8394
8495
lang.unicharset::
85-
(Required) The list of symbols that Tesseract recognizes, with properties.
96+
(Required - 3.0x legacy tesseract) The list of symbols that Tesseract recognizes, with properties.
8697
See unicharset(5).
8798
8899
lang.unicharambigs::
89-
(Optional) This file contains information on pairs of recognized symbols
100+
(Optional - 3.0x legacy tesseract) This file contains information on pairs of recognized symbols
90101
which are often confused. For example, 'rn' and 'm'.
91102
92103
lang.inttemp::
93-
(Required) Character shape templates for each unichar. Produced by
104+
(Required - 3.0x legacy tesseract) Character shape templates for each unichar. Produced by
94105
mftraining(1).
95106
96107
lang.pffmtable::
97-
(Required) The number of features expected for each unichar.
108+
(Required - 3.0x legacy tesseract) The number of features expected for each unichar.
98109
Produced by mftraining(1) from *.tr* files.
99110
100111
lang.normproto::
101-
(Required) Character normalization prototypes generated by cntraining(1)
112+
(Required - 3.0x legacy tesseract) Character normalization prototypes generated by cntraining(1)
102113
from *.tr* files.
103114
104115
lang.punc-dawg::
105-
(Optional) A dawg made from punctuation patterns found around words.
116+
(Optional - 3.0x legacy tesseract) A dawg made from punctuation patterns found around words.
106117
The "word" part is replaced by a single space.
107118
108119
lang.word-dawg::
109-
(Optional) A dawg made from dictionary words from the language.
120+
(Optional - 3.0x legacy tesseract) A dawg made from dictionary words from the language.
110121
111122
lang.number-dawg::
112-
(Optional) A dawg made from tokens which originally contained digits.
123+
(Optional - 3.0x legacy tesseract) A dawg made from tokens which originally contained digits.
113124
Each digit is replaced by a space character.
114125
115126
lang.freq-dawg::
116-
(Optional) A dawg made from the most frequent words which would have
127+
(Optional - 3.0x legacy tesseract) A dawg made from the most frequent words which would have
117128
gone into word-dawg.
118129
119130
lang.fixed-length-dawgs::
120-
(Optional) Several dawgs of different fixed lengths -- useful for
131+
(Optional - 3.0x legacy tesseract) Several dawgs of different fixed lengths -- useful for
121132
languages like Chinese.
122133
123134
lang.shapetable::
124-
(Optional) When present, a shapetable is an extra layer between the character
135+
(Optional - 3.0x legacy tesseract) When present, a shapetable is an extra layer between the character
125136
classifier and the word recognizer that allows the character classifier to
126137
return a collection of unichar ids and fonts instead of a single unichar-id
127138
and font.
128139
129140
lang.bigram-dawg::
130-
(Optional) A dawg of word bigrams where the words are separated by a space
141+
(Optional - 3.0x legacy tesseract) A dawg of word bigrams where the words are separated by a space
131142
and each digit is replaced by a '?'.
132143
133144
lang.unambig-dawg::
134-
(Optional) TODO: Describe.
135-
136-
lang.params-training-model::
137-
(Optional) TODO: Describe.
138-
139-
145+
(Optional - 3.0x legacy tesseract) .
146+
147+
lang.params-model::
148+
(Optional - 3.0x legacy tesseract) .
149+
150+
lang.lstm::
151+
(Required - 4.00alpha LSTM) Neural net trained recognition model generated by lstmtraining.
152+
153+
lang.lstm-punc-dawg::
154+
(Optional - 4.00alpha LSTM) A dawg made from punctuation patterns found around words.
155+
The "word" part is replaced by a single space. Uses lang.lstm-unicharset.
156+
157+
lang.lstm-word-dawg::
158+
(Optional - 4.00alpha LSTM) A dawg made from dictionary words from the language.
159+
Uses lang.lstm-unicharset.
160+
161+
lang.lstm-number-dawg::
162+
(Optional - 4.00alpha LSTM) A dawg made from tokens which originally contained digits.
163+
Each digit is replaced by a space character. Uses lang.lstm-unicharset.
164+
165+
lang.lstm-unicharset::
166+
(Required - 4.00alpha LSTM) The unicode character set that Tesseract recognizes, with properties.
167+
Same unicharset must be used to train the LSTM and build the lstm-*-dawgs files.
168+
169+
lang.lstm-recoder::
170+
(Required - 4.00alpha LSTM) Unicharcompress, aka the recoder, which maps the unicharset
171+
further to the codes actually used by the neural network recognizer. This is created as
172+
part of the starter traineddata by combine_lang_model.
173+
174+
lang.version::
175+
(Optional) Version string for the traineddata file.
176+
First appeared in version 4.00alpha of Tesseract.
177+
Old version of traineddata files will report Version string:Pre-4.0.0.
178+
4.00alpha version of traineddata files may include the network spec
179+
used for LSTM training as part of version string.
180+
140181
HISTORY
141182
-------
142183
combine_tessdata(1) first appeared in version 3.00 of Tesseract

0 commit comments

Comments
 (0)