Skip to content

Commit df58108

Browse files
Shreeshriizdenop
authored andcommitted
Manpages (#1378)
* Add missing man pages * Update lstmeval.1.asc * Update combine_lang_model.1.asc * Update lstmtraining.1.asc * Update merge_unicharsets.1.asc * Update set_unicharset_properties.1.asc * Update text2image.1.asc * Update text2image.1.asc * Update combine_lang_model.1.asc
1 parent 79c6fa6 commit df58108

8 files changed

+594
-5
lines changed

doc/Makefile.am

+21-5
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,27 @@ if MAINTAINER_MODE
22

33
asciidoc=asciidoc -d manpage
44

5-
man_MANS = cntraining.1 combine_tessdata.1 mftraining.1 tesseract.1 \
6-
unicharset_extractor.1 wordlist2dawg.1 unicharambigs.5 \
7-
unicharset.5 ambiguous_words.1 shapeclustering.1 dawg2wordlist.1
8-
9-
EXTRA_DIST = $(man_MANS) Doxyfile
5+
man_MANS = \
6+
ambiguous_words.1 \
7+
classifier_tester.1 \
8+
cntraining.1 \
9+
combine_lang_model.1 \
10+
combine_tessdata.1 \
11+
dawg2wordlist.1 \
12+
lstmeval.1 \
13+
lstmtraining.1 \
14+
merge_unicharsets.1 \
15+
mftraining.1 \
16+
set_unicharset_properties.1 \
17+
shapeclustering.1 \
18+
tesseract.1 \
19+
text2image.1 \
20+
unicharambigs.5 \
21+
unicharset.5 \
22+
unicharset_extractor.1 \
23+
wordlist2dawg.1
24+
25+
EXTRA_DIST = $(man_MANS) Doxyfile
1026

1127
%: %.asc
1228
$(asciidoc) -o $@ $<

doc/classifier_tester.1.asc

+61
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
CLASSIFIER_TESTER(1)
2+
====================
3+
4+
NAME
5+
----
6+
classifier_tester - for *legacy tesseract* engine.
7+
8+
SYNOPSIS
9+
--------
10+
*classifier_tester* -U 'unicharset_file' -F 'font_properties_file' -X 'xheights_file' -classifier 'x' -lang 'lang' [-output_trainer trainer] *.tr
11+
12+
DESCRIPTION
13+
-----------
14+
classifier_tester(1) runs Tesseract in a special mode.
15+
It takes a list of .tr files and tests a character classifier
16+
on data as formatted for training,
17+
but it doesn't have to be the same as the training data.
18+
19+
IN/OUT ARGUMENTS
20+
----------------
21+
22+
a list of .tr files
23+
24+
OPTIONS
25+
-------
26+
-l 'lang'::
27+
(Input) three character language code; default value 'eng'.
28+
29+
-classifier 'x'::
30+
(Input) One of "pruner", "full".
31+
32+
33+
-U 'unicharset'::
34+
(Input) The unicharset for the language.
35+
36+
-F 'font_properties_file'::
37+
(Input) font properties file, each line is of the following form, where each field other than the font name is 0 or 1:
38+
39+
*font_name* *italic* *bold* *fixed_pitch* *serif* *fraktur*
40+
41+
-X 'xheights_file'::
42+
(Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ]
43+
44+
*font_name* *xheight*
45+
46+
-output_trainer 'trainer'::
47+
(Output, Optional) Filename for output trainer.
48+
49+
SEE ALSO
50+
--------
51+
tesseract(1)
52+
53+
COPYING
54+
-------
55+
Copyright \(C) 2012 Google, Inc.
56+
Licensed under the Apache License, Version 2.0
57+
58+
AUTHOR
59+
------
60+
The Tesseract OCR engine was written by Ray Smith and his research groups
61+
at Hewlett Packard (1985-1995) and Google (2006-present).

doc/combine_lang_model.1.asc

+71
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
COMBINE_LANG_MODEL(1)
2+
=====================
3+
:doctype: manpage
4+
5+
NAME
6+
----
7+
combine_lang_model - generate starter traineddata
8+
9+
SYNOPSIS
10+
--------
11+
*combine_lang_model* --input_unicharset 'filename' --script_dir 'dirname' --output_dir 'rootdir' --lang 'lang' [--lang_is_rtl] [pass_through_recoder] [--words file --puncs file --numbers file]
12+
13+
DESCRIPTION
14+
-----------
15+
combine_lang_model(1) generates a starter traineddata file that can be used to train an LSTM-based neural network model. It takes as input a unicharset and an optional set of wordlists. It eliminates the need to run set_unicharset_properties(1), wordlist2dawg(1), some non-existent binary to generate the recoder (unicode compressor), and finally combine_tessdata(1).
16+
17+
OPTIONS
18+
-------
19+
'-l lang'::
20+
The language to use.
21+
Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES)
22+
23+
'--script_dir PATH'::
24+
Directory name for input script unicharsets. It should point to the location of langdata (github repo) directory. (type:string default:)
25+
26+
'--input_unicharset FILE'::
27+
Unicharset to complete and use in encoding. It can be a hand-created file with incomplete fields. Its basic and script properties will be set before it is used. (type:string default:)
28+
29+
'--lang_is_rtl BOOL'::
30+
True if language being processed is written right-to-left (eg Arabic/Hebrew). (type:bool default:false)
31+
32+
'--pass_through_recoder BOOL'::
33+
If true, the recoder is a simple pass-through of the unicharset. Otherwise, potentially a compression of it by encoding Hangul in Jamos, decomposing multi-unicode symbols into sequences of unicodes, and encoding Han using the data in the radical_table_data, which must be the content of the file: langdata/radical-stroke.txt. (type:bool default:false)
34+
35+
'--version_str STRING'::
36+
An arbitrary version label to add to traineddata file (type:string default:)
37+
38+
'--words FILE'::
39+
(Optional) File listing words to use for the system dictionary (type:string default:)
40+
41+
'--numbers FILE'::
42+
(Optional) File listing number patterns (type:string default:)
43+
44+
'--puncs FILE'::
45+
(Optional) File listing punctuation patterns. The words/puncs/numbers lists may be all empty. If any are non-empty then puncs must be non-empty. (type:string default:)
46+
47+
'--output_dir PATH'::
48+
Root directory for output files. Output files will be written to <output_dir>/<lang>/<lang>.* (type:string default:)
49+
50+
HISTORY
51+
-------
52+
combine_lang_model(1) was first made available for tesseract4.00.00alpha.
53+
54+
RESOURCES
55+
---------
56+
Main web site: <https://github.com/tesseract-ocr> +
57+
Information on training tesseract LSTM: <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>
58+
59+
SEE ALSO
60+
--------
61+
tesseract(1)
62+
63+
COPYING
64+
-------
65+
Copyright \(C) 2012 Google, Inc.
66+
Licensed under the Apache License, Version 2.0
67+
68+
AUTHOR
69+
------
70+
The Tesseract OCR engine was written by Ray Smith and his research groups
71+
at Hewlett Packard (1985-1995) and Google (2006-present).

doc/lstmeval.1.asc

+55
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
LSTMEVAL(1)
2+
===========
3+
:doctype: manpage
4+
5+
NAME
6+
----
7+
lstmeval - Evaluation program for LSTM-based networks.
8+
9+
SYNOPSIS
10+
--------
11+
*lstmeval* --model 'lang.lstm|langtrain_checkpoint|pluscharsN.NNN_NN.checkpoint' [--traineddata lang/lang.traineddata] --eval_listfile 'lang.eval_files.txt' [--verbosity N] [--max_image_MB NNNN]
12+
13+
DESCRIPTION
14+
-----------
15+
lstmeval(1) evaluates LSTM-based networks. Either a recognition model or a training checkpoint can be given as input for evaluation along with a list of lstmf files. If evaluating a training checkpoint, '--traineddata' should also be specified.
16+
17+
OPTIONS
18+
-------
19+
'--model FILE'::
20+
Name of model file (training or recognition) (type:string default:)
21+
22+
'--traineddata FILE'::
23+
If model is a training checkpoint, then traineddata must be the traineddata file that was given to the trainer (type:string default:)
24+
25+
'--eval_listfile FILE'::
26+
File listing sample files in lstmf training format. (type:string default:)
27+
28+
'--max_image_MB INT'::
29+
Max memory to use for images. (type:int default:2000)
30+
31+
'--verbosity INT'::
32+
Amount of diagnosting information to output (0-2). (type:int default:1)
33+
34+
HISTORY
35+
-------
36+
lstmeval(1) was first made available for tesseract4.00.00alpha.
37+
38+
RESOURCES
39+
---------
40+
Main web site: <https://github.com/tesseract-ocr> +
41+
Information on training tesseract LSTM: <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>
42+
43+
SEE ALSO
44+
--------
45+
tesseract(1)
46+
47+
COPYING
48+
-------
49+
Copyright \(C) 2012 Google, Inc.
50+
Licensed under the Apache License, Version 2.0
51+
52+
AUTHOR
53+
------
54+
The Tesseract OCR engine was written by Ray Smith and his research groups
55+
at Hewlett Packard (1985-1995) and Google (2006-present).

doc/lstmtraining.1.asc

+117
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
LSTMTRAINING(1)
2+
===============
3+
:doctype: manpage
4+
5+
NAME
6+
----
7+
lstmtraining - Training program for LSTM-based networks.
8+
9+
SYNOPSIS
10+
--------
11+
*lstmtraining*
12+
--continue_from 'train_output_dir/continue_from_lang.lstm'
13+
--old_traineddata 'bestdata_dir/continue_from_lang.traineddata'
14+
--traineddata 'train_output_dir/lang/lang.traineddata'
15+
--max_iterations 'NNN'
16+
--debug_interval '0|-1'
17+
--train_listfile 'train_output_dir/lang.training_files.txt'
18+
--model_output 'train_output_dir/newlstmmodel'
19+
20+
DESCRIPTION
21+
-----------
22+
lstmtraining(1) trains LSTM-based networks using a list of lstmf files and starter traineddata file as the main input. Training from scratch is not recommended to be done by users. Finetuning (example command shown in synopsis above) or replacing a layer options can be used instead. Different options apply to different types of training. Read [Training Wiki page](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00) for details.
23+
24+
OPTIONS
25+
-------
26+
27+
'--debug_interval '::
28+
How often to display the alignment. (type:int default:0)
29+
30+
'--net_mode '::
31+
Controls network behavior. (type:int default:192)
32+
33+
'--perfect_sample_delay '::
34+
How many imperfect samples between perfect ones. (type:int default:0)
35+
36+
'--max_image_MB '::
37+
Max memory to use for images. (type:int default:6000)
38+
39+
'--append_index '::
40+
Index in continue_from Network at which to attach the new network defined by net_spec (type:int default:-1)
41+
42+
'--max_iterations '::
43+
If set, exit after this many iterations (type:int default:0)
44+
45+
'--target_error_rate '::
46+
Final error rate in percent. (type:double default:0.01)
47+
48+
'--weight_range '::
49+
Range of initial random weights. (type:double default:0.1)
50+
51+
'--learning_rate '::
52+
Weight factor for new deltas. (type:double default:0.001)
53+
54+
'--momentum '::
55+
Decay factor for repeating deltas. (type:double default:0.5)
56+
57+
'--adam_beta '::
58+
Decay factor for repeating deltas. (type:double default:0.999)
59+
60+
'--stop_training '::
61+
Just convert the training model to a runtime model. (type:bool default:false)
62+
63+
'--convert_to_int '::
64+
Convert the recognition model to an integer model. (type:bool default:false)
65+
66+
'--sequential_training '::
67+
Use the training files sequentially instead of round-robin. (type:bool default:false)
68+
69+
'--debug_network '::
70+
Get info on distribution of weight values (type:bool default:false)
71+
72+
'--randomly_rotate '::
73+
Train OSD and randomly turn training samples upside-down (type:bool default:false)
74+
75+
'--net_spec '::
76+
Network specification (type:string default:)
77+
78+
'--continue_from '::
79+
Existing model to extend (type:string default:)
80+
81+
'--model_output '::
82+
Basename for output models (type:string default:lstmtrain)
83+
84+
'--train_listfile '::
85+
File listing training files in lstmf training format. (type:string default:)
86+
87+
'--eval_listfile '::
88+
File listing eval files in lstmf training format. (type:string default:)
89+
90+
'--traineddata '::
91+
Starter traineddata with combined Dawgs/Unicharset/Recoder for language model (type:string default:)
92+
93+
'--old_traineddata '::
94+
When changing the character set, this specifies the traineddata with the old character set that is to be replaced (type:string default:)
95+
96+
HISTORY
97+
-------
98+
lstmtraining(1) was first made available for tesseract4.00.00alpha.
99+
100+
RESOURCES
101+
---------
102+
Main web site: <https://github.com/tesseract-ocr> +
103+
Information on training tesseract LSTM: <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>
104+
105+
SEE ALSO
106+
--------
107+
tesseract(1)
108+
109+
COPYING
110+
-------
111+
Copyright \(C) 2012 Google, Inc.
112+
Licensed under the Apache License, Version 2.0
113+
114+
AUTHOR
115+
------
116+
The Tesseract OCR engine was written by Ray Smith and his research groups
117+
at Hewlett Packard (1985-1995) and Google (2006-present).

doc/merge_unicharsets.1.asc

+51
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
MERGE_UNICHARSETS(1)
2+
====================
3+
:doctype: manpage
4+
5+
NAME
6+
----
7+
merge_unicharsets - Simple tool to merge two or more unicharsets.
8+
9+
SYNOPSIS
10+
--------
11+
*merge_unicharsets* 'unicharset-in-1' ... 'unicharset-in-n' 'unicharset-out'
12+
13+
DESCRIPTION
14+
-----------
15+
merge_unicharsets(1) is a simple tool to merge two or more unicharsets.
16+
It could be used to create a combined unicharset for a script-level engine,
17+
like the new Latin or Devanagari.
18+
19+
IN/OUT ARGUMENTS
20+
----------------
21+
'unicharset-in-1'::
22+
(Input) The name of the first unicharset file to be merged.
23+
24+
'unicharset-in-n'::
25+
(Input) The name of the nth unicharset file to be merged.
26+
27+
'unicharset-out'::
28+
(Output) The name of the merged unicharset file.
29+
30+
HISTORY
31+
-------
32+
merge_unicharsets(1) was first made available for tesseract4.00.00alpha.
33+
34+
RESOURCES
35+
---------
36+
Main web site: <https://github.com/tesseract-ocr> +
37+
Information on training tesseract LSTM: <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>
38+
39+
SEE ALSO
40+
--------
41+
tesseract(1)
42+
43+
COPYING
44+
-------
45+
Copyright \(C) 2012 Google, Inc.
46+
Licensed under the Apache License, Version 2.0
47+
48+
AUTHOR
49+
------
50+
The Tesseract OCR engine was written by Ray Smith and his research groups
51+
at Hewlett Packard (1985-1995) and Google (2006-present).

0 commit comments

Comments
 (0)