|
1 | 1 | [](https://travis-ci.org/tesseract-ocr/tesseract)
|
2 | 2 | [](https://ci.appveyor.com/project/zdenop/tesseract/)
|
3 | 3 |
|
4 |
| -Note that this is possibly out-of-date version of the wiki ReadMe, |
5 |
| -which is located at: |
6 | 4 |
|
7 |
| - https://github.com/tesseract-ocr/tesseract/blob/master/README.md |
| 5 | +#About |
8 | 6 |
|
9 |
| -Introduction |
10 |
| -============ |
| 7 | +This package contains an OCR engine - `libtesseract` and a command line program - `tesseract`. |
11 | 8 |
|
12 |
| -This package contains the Tesseract Open Source OCR Engine. |
13 |
| -Originally developed at Hewlett-Packard Laboratories Bristol and |
14 |
| -at Hewlett-Packard Co, Greeley Colorado, all the code |
15 |
| -in this distribution is now licensed under the Apache License: |
| 9 | +The lead developer is Ray Smith. The maintainer is Zdenko Podobny. |
| 10 | +For a list of contributors see [AUTHORS](https://github.com/tesseract-ocr/tesseract/blob/master/AUTHORS) and github's log of [contributors](https://github.com/tesseract-ocr/tesseract/graphs/contributors). |
16 | 11 |
|
17 |
| - Licensed under the Apache License, Version 2.0 (the "License"); |
18 |
| - you may not use this file except in compliance with the License. |
19 |
| - You may obtain a copy of the License at |
20 |
| - |
21 |
| - http://www.apache.org/licenses/LICENSE-2.0 |
22 |
| - |
23 |
| - Unless required by applicable law or agreed to in writing, software |
24 |
| - distributed under the License is distributed on an "AS IS" BASIS, |
25 |
| - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
26 |
| - See the License for the specific language governing permissions and |
27 |
| - limitations under the License. |
28 |
| - |
29 |
| - |
30 |
| -Dependencies and Licenses |
31 |
| -========================= |
32 |
| - |
33 |
| -[Leptonica](http://www.leptonica.com) is required. Tesseract no longer |
34 |
| -compiles without Leptonica. |
35 |
| - |
36 |
| -Libtiff is no longer required as a direct dependency. |
37 |
| - |
38 |
| - |
39 |
| -Installing and Running Tesseract |
40 |
| --------------------------------- |
41 |
| - |
42 |
| -All Users Do NOT Ignore! |
43 |
| - |
44 |
| -The tarballs are split into pieces. |
45 |
| - |
46 |
| -tesseract-x.xx.tar.gz contains all the source code. |
47 |
| - |
48 |
| -tesseract-x.xx.`<lang>`.tar.gz contains the language data files for `<lang>`. |
49 |
| -You need at least one of these or Tesseract will not work. |
| 12 | +Tesseract has unicode (UTF-8) support, and can recognize more than 100 |
| 13 | +languages "out of the box". It can be trained to recognize other languages. See [Tesseract Training](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract) for more information. |
50 | 14 |
|
51 |
| -Note that tesseract-x.xx.tar.gz unpacks to the tesseract-ocr directory. |
52 |
| -tesseract-x.xx.`<lang>`.tar.gz unpacks to the tessdata directory which |
53 |
| -belongs inside your tesseract-ocr directory. It is therefore best to |
54 |
| -download them into your tesseract-x.xx directory, so you can use unpack |
55 |
| -here or equivalent. You can unpack as many of the language packs as you |
56 |
| -care to, as they all contain different files. Note that if you are using |
57 |
| -make install you should unpack your language data to your source tree |
58 |
| -before you run make install. If you unpack them as root to the |
59 |
| -destination directory of make install, then the user ids and access |
60 |
| -permissions might be messed up. |
| 15 | +Tesseract supports various output formats: plain-text, hocr(html), pdf. |
61 | 16 |
|
62 |
| -boxtiff-2.xx.`<lang>`.tar.gz contains data that was used in training for |
63 |
| -those that want to do their own training. Most users should NOT download |
64 |
| -these files. |
| 17 | +The latest stable version is 3.04, released in July 2015. |
65 | 18 |
|
66 |
| -Instructions for using the training tools are documented separately at |
67 |
| -[Tesseract Training wiki](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract) |
| 19 | +#Brief history |
68 | 20 |
|
| 21 | +Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and |
| 22 | +at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some |
| 23 | +more changes made in 1996 to port to Windows, and some C++izing in 1998. |
69 | 24 |
|
70 |
| -Windows |
71 |
| -------- |
72 |
| - |
73 |
| -Please use the installer (for 3.00 and above). Tesseract is a library with a |
74 |
| -command line interface. If you need a GUI, please check the [3rdParty wiki page](https://github.com/tesseract-ocr/tesseract/wiki/3rdParty#gui). |
75 |
| - |
76 |
| -If you are building from the sources, the recommended build platform is |
77 |
| -VC++ Express 2010. |
78 |
| - |
79 |
| -The executables are built with static linking, so they stand more chance |
80 |
| -of working out of the box on more Windows systems. |
81 |
| - |
82 |
| -The executable must reside in the same directory as the tessdata |
83 |
| -directory or you need to set up environment variable TESSDATA_PREFIX. |
84 |
| -Installer will set it up for you. |
| 25 | +In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google. |
85 | 26 |
|
86 |
| -The command line is: |
| 27 | +[Release Notes](https://github.com/tesseract-ocr/tesseract/wiki/ReleaseNotes) |
87 | 28 |
|
88 |
| - tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...] |
| 29 | +#License |
89 | 30 |
|
90 |
| -If you need interface to other applications, please check wrapper section |
91 |
| -on [AddOns wiki page](https://github.com/tesseract-ocr/tesseract/wiki/AddOns#for-tesseract-ocr-30x). |
| 31 | + The code in this repository is licensed under the Apache License, Version 2.0 (the "License"); |
| 32 | + you may not use this file except in compliance with the License. |
| 33 | + You may obtain a copy of the License at |
92 | 34 |
|
| 35 | + http://www.apache.org/licenses/LICENSE-2.0 |
93 | 36 |
|
94 |
| -Non-Windows (or Cygwin) |
95 |
| ------------------------ |
| 37 | + Unless required by applicable law or agreed to in writing, software |
| 38 | + distributed under the License is distributed on an "AS IS" BASIS, |
| 39 | + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 40 | + See the License for the specific language governing permissions and |
| 41 | + limitations under the License. |
96 | 42 |
|
97 |
| -You have to tell Tesseract through a standard unix mechanism where to |
98 |
| -find its data directory. You must either: |
| 43 | +**NOTE**: This software depends on other packages that may be licensed under different open source licenses. |
99 | 44 |
|
100 |
| - ./autogen.sh |
101 |
| - ./configure |
102 |
| - make |
103 |
| - sudo make install |
104 |
| - sudo ldconfig |
| 45 | +#Installing Tesseract |
105 | 46 |
|
106 |
| -to move the data files to the standard place, or: |
| 47 | +You can either [Install Tesseract via pre-built binary package](https://github.com/tesseract-ocr/tesseract/wiki) or [build it from source](https://github.com/tesseract-ocr/tesseract/wiki/Compiling). |
107 | 48 |
|
108 |
| - export TESSDATA_PREFIX="directory in which your tessdata resides/" |
| 49 | +#Running Tesseract |
109 | 50 |
|
110 |
| -In either case the command line is: |
| 51 | +Basic command line usage: |
111 | 52 |
|
112 | 53 | tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...]
|
113 | 54 |
|
114 |
| -New there is a tesseract.spec for making rpms. (Thanks to Andrew Ziem for |
115 |
| -the help.) It might work with your OS if you know how to do that. |
| 55 | +To see the full usage use `tesseract --help` |
116 | 56 |
|
117 |
| -If you are linking to the libraries, as Ocropus does, please link to |
118 |
| -libtesseract_api. |
| 57 | +#Support |
119 | 58 |
|
| 59 | +Mailing-lists: |
| 60 | +* [tesseract-ocr](https://groups.google.com/d/forum/tesseract-ocr) - For tesseract users. |
| 61 | +* [tesseract-dev](https://groups.google.com/d/forum/tesseract-dev) - For tesseract developers. |
120 | 62 |
|
121 |
| -If you get `leptonica not found` and you've installed it with e.g. homebrew, you |
122 |
| -can run `CPPFLAGS="-I/usr/local/include" LDFLAGS="-L/usr/local/lib" ./configure` |
123 |
| -instead of `./configure` above. |
124 |
| - |
125 |
| - |
126 |
| -History |
127 |
| -======= |
128 |
| -The engine was developed at Hewlett-Packard Laboratories Bristol and |
129 |
| -at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some |
130 |
| -more changes made in 1996 to port to Windows, and some C++izing in 1998. |
131 |
| -A lot of the code was written in C, and then some more was written in C++. |
132 |
| -Since then all the code has been converted to at least compile with a C++ |
133 |
| -compiler. Currently it builds under Linux with gcc 4.4.3 and under Windows |
134 |
| -with VC++2010. The C++ code makes heavy use of a list system using macros. |
135 |
| -This predates stl, was portable before stl, and is more efficient than stl |
136 |
| -lists, but has the big negative that if you do get a segmentation violation, |
137 |
| -it is hard to debug. |
138 |
| - |
139 |
| -The most recent change is that Tesseract can now recognize 39 languages, |
140 |
| -including Arabic, Hindi, Vietnamese, plus 3 Fraktur variants, |
141 |
| -is fully UTF8 capable, and is fully trainable. See TrainingTesseract for |
142 |
| -more information on training. |
143 |
| - |
144 |
| -Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy. |
145 |
| -Results were available on https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf. |
146 |
| -With Tesseract 2.00, scripts were included to allow anyone to reproduce |
147 |
| -some of these tests. See TestingTesseract for more details. |
148 |
| - |
149 |
| - |
150 |
| -About the Engine |
151 |
| -================ |
152 |
| -This code is a raw OCR engine. It has limited PAGE LAYOUT ANALYSIS, simple |
153 |
| -OUTPUT FORMATTING (txt, hocr/html), and NO UI. |
154 |
| -Having said that, in 1995, this engine was in the top 3 in terms of character |
155 |
| -accuracy, and it compiles and runs on both Linux and Windows. |
156 |
| -As of 3.01, Tesseract is fully unicode (UTF-8) enabled, and can recognize 39 |
157 |
| -languages "out of the box." Code and documentation is provided for the brave |
158 |
| -to train in other languages. |
159 |
| -See [Tesseract Training wiki](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract) |
160 |
| -for more information on training. Additional [code and extracted documentation](http://tesseract-ocr.github.io/) was generated by Doxygen. |
| 63 | +Please read the [FAQ](https://github.com/tesseract-ocr/tesseract/wiki/FAQ) before asking any question in the mailing-list or reporting an issue. |
0 commit comments