Add info regarding LSTM components and options (#1346)

Shreeshrii · zdenop · commit bb89dc3594a9 · 2018-02-23T21:59:50.000+01:00
diff --git a/doc/combine_tessdata.1.asc b/doc/combine_tessdata.1.asc
@@ -3,15 +3,15 @@ COMBINE_TESSDATA(1)
 
 NAME
 ----
-combine_tessdata - combine/extract/overwrite Tesseract data
+combine_tessdata - combine/extract/overwrite/list/compact Tesseract data
 
 SYNOPSIS
 --------
 *combine_tessdata* ['OPTION'] 'FILE'...
 
 DESCRIPTION
 -----------
-combine_tessdata(1) is the main program to combine/extract/overwrite
+combine_tessdata(1) is the main program to combine/extract/overwrite/list/compact 
 tessdata components in [lang].traineddata files.
 
 To combine all the individual tessdata components (unicharset, DAWGs,
@@ -56,6 +56,13 @@ components from tessdata/eng.traineddata.
 
 OPTIONS
 -------
+
+*-c* '.traineddata' 'FILE'...:
+    Compacts the LSTM component in the .traineddata file to int.
+    
+*-d* '.traineddata' 'FILE'...:
+    Lists directory of components from the .traineddata file.
+    
 *-e* '.traineddata' 'FILE'...:
     Extracts the specified components from the .traineddata file
 
@@ -74,69 +81,103 @@ CAVEATS
 COMPONENTS
 ----------
 The components in a Tesseract lang.traineddata file as of
-Tesseract 3.02 are briefly described below; For more information on
+Tesseract 4.00alpha are briefly described below; For more information on
 many of these files, see
 <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract>
+and
+<https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>
 
 lang.config::
   (Optional) Language-specific overrides to default config variables.
+  For 4.00alpha traineddata files, lang.config provides control parameters which
+  can affect layout analysis, and sub-languages.
 
 lang.unicharset::
-  (Required) The list of symbols that Tesseract recognizes, with properties.
+  (Required - 3.0x  legacy tesseract) The list of symbols that Tesseract recognizes, with properties.
   See unicharset(5).
 
 lang.unicharambigs::
-  (Optional) This file contains information on pairs of recognized symbols
+  (Optional - 3.0x  legacy tesseract) This file contains information on pairs of recognized symbols
   which are often confused.  For example, 'rn' and 'm'.
 
 lang.inttemp::
-  (Required) Character shape templates for each unichar.  Produced by
+  (Required - 3.0x  legacy tesseract) Character shape templates for each unichar.  Produced by
   mftraining(1).
 
 lang.pffmtable::
-  (Required) The number of features expected for each unichar.
+  (Required - 3.0x  legacy tesseract) The number of features expected for each unichar.
   Produced by mftraining(1) from *.tr* files.
 
 lang.normproto::
-  (Required) Character normalization prototypes generated by cntraining(1)
+  (Required - 3.0x  legacy tesseract) Character normalization prototypes generated by cntraining(1)
   from *.tr* files.
 
 lang.punc-dawg::
-  (Optional) A dawg made from punctuation patterns found around words.
+  (Optional - 3.0x  legacy tesseract) A dawg made from punctuation patterns found around words.
   The "word" part is replaced by a single space.
 
 lang.word-dawg::
-  (Optional) A dawg made from dictionary words from the language.
+  (Optional - 3.0x  legacy tesseract) A dawg made from dictionary words from the language.
 
 lang.number-dawg::
-  (Optional) A dawg made from tokens which originally contained digits.
+  (Optional - 3.0x  legacy tesseract) A dawg made from tokens which originally contained digits.
   Each digit is replaced by a space character.
 
 lang.freq-dawg::
-  (Optional) A dawg made from the most frequent words which would have
+  (Optional - 3.0x  legacy tesseract) A dawg made from the most frequent words which would have
   gone into word-dawg.
 
 lang.fixed-length-dawgs::
-  (Optional) Several dawgs of different fixed lengths -- useful for
+  (Optional - 3.0x  legacy tesseract) Several dawgs of different fixed lengths -- useful for
   languages like Chinese.
 
 lang.shapetable::
-  (Optional) When present, a shapetable is an extra layer between the character
+  (Optional - 3.0x  legacy tesseract) When present, a shapetable is an extra layer between the character
   classifier and the word recognizer that allows the character classifier to
   return a collection of unichar ids and fonts instead of a single unichar-id
   and font.
 
 lang.bigram-dawg::
-  (Optional) A dawg of word bigrams where the words are separated by a space
+  (Optional - 3.0x  legacy tesseract) A dawg of word bigrams where the words are separated by a space
   and each digit is replaced by a '?'.
 
 lang.unambig-dawg::
-  (Optional) TODO: Describe.
-
-lang.params-training-model::
-  (Optional) TODO: Describe.
-
-
+  (Optional - 3.0x  legacy tesseract) .
+
+lang.params-model::
+  (Optional - 3.0x  legacy tesseract) .
+
+lang.lstm::
+  (Required - 4.00alpha LSTM) Neural net trained recognition model generated by lstmtraining.
+
+lang.lstm-punc-dawg::
+  (Optional - 4.00alpha LSTM) A dawg made from punctuation patterns found around words.
+  The "word" part is replaced by a single space. Uses lang.lstm-unicharset.
+  
+lang.lstm-word-dawg::
+  (Optional - 4.00alpha LSTM) A dawg made from dictionary words from the language.
+  Uses lang.lstm-unicharset.
+
+lang.lstm-number-dawg::
+  (Optional - 4.00alpha LSTM) A dawg made from tokens which originally contained digits.
+  Each digit is replaced by a space character. Uses lang.lstm-unicharset.
+  
+lang.lstm-unicharset::
+  (Required - 4.00alpha LSTM) The unicode character set that Tesseract recognizes, with properties.
+  Same unicharset must be used to train the LSTM and build the lstm-*-dawgs files.
+
+lang.lstm-recoder::
+  (Required - 4.00alpha LSTM) Unicharcompress, aka the recoder, which maps the unicharset 
+  further to the codes actually used by the neural network recognizer. This is created as
+  part of the starter traineddata by combine_lang_model.
+  
+lang.version::
+  (Optional) Version string for the traineddata file. 
+  First appeared in version 4.00alpha of Tesseract. 
+  Old version of traineddata files will report Version string:Pre-4.0.0. 
+  4.00alpha version of traineddata files may include the network spec
+  used for LSTM training as part of version string.
+  
 HISTORY
 -------
 combine_tessdata(1) first appeared in version 3.00 of Tesseract