@@ -3,15 +3,15 @@ COMBINE_TESSDATA(1)
3
3
4
4
NAME
5
5
----
6
- combine_tessdata - combine/extract/overwrite Tesseract data
6
+ combine_tessdata - combine/extract/overwrite/list/compact Tesseract data
7
7
8
8
SYNOPSIS
9
9
--------
10
10
*combine_tessdata* ['OPTION' ] 'FILE' ...
11
11
12
12
DESCRIPTION
13
13
-----------
14
- combine_tessdata(1) is the main program to combine/extract/overwrite
14
+ combine_tessdata(1) is the main program to combine/extract/overwrite/list/compact
15
15
tessdata components in [lang].traineddata files.
16
16
17
17
To combine all the individual tessdata components (unicharset, DAWGs,
@@ -56,6 +56,13 @@ components from tessdata/eng.traineddata.
56
56
57
57
OPTIONS
58
58
-------
59
+
60
+ *-c* '.traineddata' 'FILE'...:
61
+ Compacts the LSTM component in the .traineddata file to int.
62
+
63
+ *-d* '.traineddata' 'FILE'...:
64
+ Lists directory of components from the .traineddata file.
65
+
59
66
*-e* '.traineddata' 'FILE'...:
60
67
Extracts the specified components from the .traineddata file
61
68
@@ -74,69 +81,103 @@ CAVEATS
74
81
COMPONENTS
75
82
----------
76
83
The components in a Tesseract lang.traineddata file as of
77
- Tesseract 3.02 are briefly described below; For more information on
84
+ Tesseract 4.00alpha are briefly described below; For more information on
78
85
many of these files, see
79
86
<https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract>
87
+ and
88
+ <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>
80
89
81
90
lang.config::
82
91
(Optional) Language-specific overrides to default config variables.
92
+ For 4.00alpha traineddata files, lang.config provides control parameters which
93
+ can affect layout analysis, and sub-languages.
83
94
84
95
lang.unicharset::
85
- (Required) The list of symbols that Tesseract recognizes, with properties.
96
+ (Required - 3.0x legacy tesseract ) The list of symbols that Tesseract recognizes, with properties.
86
97
See unicharset(5).
87
98
88
99
lang.unicharambigs::
89
- (Optional) This file contains information on pairs of recognized symbols
100
+ (Optional - 3.0x legacy tesseract ) This file contains information on pairs of recognized symbols
90
101
which are often confused. For example, 'rn' and 'm'.
91
102
92
103
lang.inttemp::
93
- (Required) Character shape templates for each unichar. Produced by
104
+ (Required - 3.0x legacy tesseract ) Character shape templates for each unichar. Produced by
94
105
mftraining(1).
95
106
96
107
lang.pffmtable::
97
- (Required) The number of features expected for each unichar.
108
+ (Required - 3.0x legacy tesseract ) The number of features expected for each unichar.
98
109
Produced by mftraining(1) from *.tr* files.
99
110
100
111
lang.normproto::
101
- (Required) Character normalization prototypes generated by cntraining(1)
112
+ (Required - 3.0x legacy tesseract ) Character normalization prototypes generated by cntraining(1)
102
113
from *.tr* files.
103
114
104
115
lang.punc-dawg::
105
- (Optional) A dawg made from punctuation patterns found around words.
116
+ (Optional - 3.0x legacy tesseract ) A dawg made from punctuation patterns found around words.
106
117
The "word" part is replaced by a single space.
107
118
108
119
lang.word-dawg::
109
- (Optional) A dawg made from dictionary words from the language.
120
+ (Optional - 3.0x legacy tesseract ) A dawg made from dictionary words from the language.
110
121
111
122
lang.number-dawg::
112
- (Optional) A dawg made from tokens which originally contained digits.
123
+ (Optional - 3.0x legacy tesseract ) A dawg made from tokens which originally contained digits.
113
124
Each digit is replaced by a space character.
114
125
115
126
lang.freq-dawg::
116
- (Optional) A dawg made from the most frequent words which would have
127
+ (Optional - 3.0x legacy tesseract ) A dawg made from the most frequent words which would have
117
128
gone into word-dawg.
118
129
119
130
lang.fixed-length-dawgs::
120
- (Optional) Several dawgs of different fixed lengths -- useful for
131
+ (Optional - 3.0x legacy tesseract ) Several dawgs of different fixed lengths -- useful for
121
132
languages like Chinese.
122
133
123
134
lang.shapetable::
124
- (Optional) When present, a shapetable is an extra layer between the character
135
+ (Optional - 3.0x legacy tesseract ) When present, a shapetable is an extra layer between the character
125
136
classifier and the word recognizer that allows the character classifier to
126
137
return a collection of unichar ids and fonts instead of a single unichar-id
127
138
and font.
128
139
129
140
lang.bigram-dawg::
130
- (Optional) A dawg of word bigrams where the words are separated by a space
141
+ (Optional - 3.0x legacy tesseract ) A dawg of word bigrams where the words are separated by a space
131
142
and each digit is replaced by a '?'.
132
143
133
144
lang.unambig-dawg::
134
- (Optional) TODO: Describe.
135
-
136
- lang.params-training-model::
137
- (Optional) TODO: Describe.
138
-
139
-
145
+ (Optional - 3.0x legacy tesseract) .
146
+
147
+ lang.params-model::
148
+ (Optional - 3.0x legacy tesseract) .
149
+
150
+ lang.lstm::
151
+ (Required - 4.00alpha LSTM) Neural net trained recognition model generated by lstmtraining.
152
+
153
+ lang.lstm-punc-dawg::
154
+ (Optional - 4.00alpha LSTM) A dawg made from punctuation patterns found around words.
155
+ The "word" part is replaced by a single space. Uses lang.lstm-unicharset.
156
+
157
+ lang.lstm-word-dawg::
158
+ (Optional - 4.00alpha LSTM) A dawg made from dictionary words from the language.
159
+ Uses lang.lstm-unicharset.
160
+
161
+ lang.lstm-number-dawg::
162
+ (Optional - 4.00alpha LSTM) A dawg made from tokens which originally contained digits.
163
+ Each digit is replaced by a space character. Uses lang.lstm-unicharset.
164
+
165
+ lang.lstm-unicharset::
166
+ (Required - 4.00alpha LSTM) The unicode character set that Tesseract recognizes, with properties.
167
+ Same unicharset must be used to train the LSTM and build the lstm-*-dawgs files.
168
+
169
+ lang.lstm-recoder::
170
+ (Required - 4.00alpha LSTM) Unicharcompress, aka the recoder, which maps the unicharset
171
+ further to the codes actually used by the neural network recognizer. This is created as
172
+ part of the starter traineddata by combine_lang_model.
173
+
174
+ lang.version::
175
+ (Optional) Version string for the traineddata file.
176
+ First appeared in version 4.00alpha of Tesseract.
177
+ Old version of traineddata files will report Version string:Pre-4.0.0.
178
+ 4.00alpha version of traineddata files may include the network spec
179
+ used for LSTM training as part of version string.
180
+
140
181
HISTORY
141
182
-------
142
183
combine_tessdata(1) first appeared in version 3.00 of Tesseract
0 commit comments