Skip to content

Commit 7a90446

Browse files
committed
Update README.md
1 parent 1826ac1 commit 7a90446

File tree

1 file changed

+35
-132
lines changed

1 file changed

+35
-132
lines changed

README.md

+35-132
Original file line numberDiff line numberDiff line change
@@ -1,160 +1,63 @@
11
[![Build Status](https://travis-ci.org/tesseract-ocr/tesseract.svg?branch=master)](https://travis-ci.org/tesseract-ocr/tesseract)
22
[![Build status](https://ci.appveyor.com/api/projects/status/miah0ikfsf0j3819?svg=true)](https://ci.appveyor.com/project/zdenop/tesseract/)
33

4-
Note that this is possibly out-of-date version of the wiki ReadMe,
5-
which is located at:
64

7-
https://github.com/tesseract-ocr/tesseract/blob/master/README.md
5+
#About
86

9-
Introduction
10-
============
7+
This package contains an OCR engine - `libtesseract` and a command line program - `tesseract`.
118

12-
This package contains the Tesseract Open Source OCR Engine.
13-
Originally developed at Hewlett-Packard Laboratories Bristol and
14-
at Hewlett-Packard Co, Greeley Colorado, all the code
15-
in this distribution is now licensed under the Apache License:
9+
The lead developer is Ray Smith. The maintainer is Zdenko Podobny.
10+
For a list of contributors see [AUTHORS](https://github.com/tesseract-ocr/tesseract/blob/master/AUTHORS) and github's log of [contributors](https://github.com/tesseract-ocr/tesseract/graphs/contributors).
1611

17-
Licensed under the Apache License, Version 2.0 (the "License");
18-
you may not use this file except in compliance with the License.
19-
You may obtain a copy of the License at
20-
21-
http://www.apache.org/licenses/LICENSE-2.0
22-
23-
Unless required by applicable law or agreed to in writing, software
24-
distributed under the License is distributed on an "AS IS" BASIS,
25-
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
26-
See the License for the specific language governing permissions and
27-
limitations under the License.
28-
29-
30-
Dependencies and Licenses
31-
=========================
32-
33-
[Leptonica](http://www.leptonica.com) is required. Tesseract no longer
34-
compiles without Leptonica.
35-
36-
Libtiff is no longer required as a direct dependency.
37-
38-
39-
Installing and Running Tesseract
40-
--------------------------------
41-
42-
All Users Do NOT Ignore!
43-
44-
The tarballs are split into pieces.
45-
46-
tesseract-x.xx.tar.gz contains all the source code.
47-
48-
tesseract-x.xx.`<lang>`.tar.gz contains the language data files for `<lang>`.
49-
You need at least one of these or Tesseract will not work.
12+
Tesseract has unicode (UTF-8) support, and can recognize more than 100
13+
languages "out of the box". It can be trained to recognize other languages. See [Tesseract Training](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract) for more information.
5014

51-
Note that tesseract-x.xx.tar.gz unpacks to the tesseract-ocr directory.
52-
tesseract-x.xx.`<lang>`.tar.gz unpacks to the tessdata directory which
53-
belongs inside your tesseract-ocr directory. It is therefore best to
54-
download them into your tesseract-x.xx directory, so you can use unpack
55-
here or equivalent. You can unpack as many of the language packs as you
56-
care to, as they all contain different files. Note that if you are using
57-
make install you should unpack your language data to your source tree
58-
before you run make install. If you unpack them as root to the
59-
destination directory of make install, then the user ids and access
60-
permissions might be messed up.
15+
Tesseract supports various output formats: plain-text, hocr(html), pdf.
6116

62-
boxtiff-2.xx.`<lang>`.tar.gz contains data that was used in training for
63-
those that want to do their own training. Most users should NOT download
64-
these files.
17+
The latest stable version is 3.04, released in July 2015.
6518

66-
Instructions for using the training tools are documented separately at
67-
[Tesseract Training wiki](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract)
19+
#Brief history
6820

21+
Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and
22+
at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some
23+
more changes made in 1996 to port to Windows, and some C++izing in 1998.
6924

70-
Windows
71-
-------
72-
73-
Please use the installer (for 3.00 and above). Tesseract is a library with a
74-
command line interface. If you need a GUI, please check the [3rdParty wiki page](https://github.com/tesseract-ocr/tesseract/wiki/3rdParty#gui).
75-
76-
If you are building from the sources, the recommended build platform is
77-
VC++ Express 2010.
78-
79-
The executables are built with static linking, so they stand more chance
80-
of working out of the box on more Windows systems.
81-
82-
The executable must reside in the same directory as the tessdata
83-
directory or you need to set up environment variable TESSDATA_PREFIX.
84-
Installer will set it up for you.
25+
In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.
8526

86-
The command line is:
27+
[Release Notes](https://github.com/tesseract-ocr/tesseract/wiki/ReleaseNotes)
8728

88-
tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...]
29+
#License
8930

90-
If you need interface to other applications, please check wrapper section
91-
on [AddOns wiki page](https://github.com/tesseract-ocr/tesseract/wiki/AddOns#for-tesseract-ocr-30x).
31+
The code in this repository is licensed under the Apache License, Version 2.0 (the "License");
32+
you may not use this file except in compliance with the License.
33+
You may obtain a copy of the License at
9234

35+
http://www.apache.org/licenses/LICENSE-2.0
9336

94-
Non-Windows (or Cygwin)
95-
-----------------------
37+
Unless required by applicable law or agreed to in writing, software
38+
distributed under the License is distributed on an "AS IS" BASIS,
39+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
40+
See the License for the specific language governing permissions and
41+
limitations under the License.
9642

97-
You have to tell Tesseract through a standard unix mechanism where to
98-
find its data directory. You must either:
43+
**NOTE**: This software depends on other packages that may be licensed under different open source licenses.
9944

100-
./autogen.sh
101-
./configure
102-
make
103-
sudo make install
104-
sudo ldconfig
45+
#Installing Tesseract
10546

106-
to move the data files to the standard place, or:
47+
You can either [Install Tesseract via pre-built binary package](https://github.com/tesseract-ocr/tesseract/wiki) or [build it from source](https://github.com/tesseract-ocr/tesseract/wiki/Compiling).
10748

108-
export TESSDATA_PREFIX="directory in which your tessdata resides/"
49+
#Running Tesseract
10950

110-
In either case the command line is:
51+
Basic command line usage:
11152

11253
tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...]
11354

114-
New there is a tesseract.spec for making rpms. (Thanks to Andrew Ziem for
115-
the help.) It might work with your OS if you know how to do that.
55+
To see the full usage use `tesseract --help`
11656

117-
If you are linking to the libraries, as Ocropus does, please link to
118-
libtesseract_api.
57+
#Support
11958

59+
Mailing-lists:
60+
* [tesseract-ocr](https://groups.google.com/d/forum/tesseract-ocr) - For tesseract users.
61+
* [tesseract-dev](https://groups.google.com/d/forum/tesseract-dev) - For tesseract developers.
12062

121-
If you get `leptonica not found` and you've installed it with e.g. homebrew, you
122-
can run `CPPFLAGS="-I/usr/local/include" LDFLAGS="-L/usr/local/lib" ./configure`
123-
instead of `./configure` above.
124-
125-
126-
History
127-
=======
128-
The engine was developed at Hewlett-Packard Laboratories Bristol and
129-
at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some
130-
more changes made in 1996 to port to Windows, and some C++izing in 1998.
131-
A lot of the code was written in C, and then some more was written in C++.
132-
Since then all the code has been converted to at least compile with a C++
133-
compiler. Currently it builds under Linux with gcc 4.4.3 and under Windows
134-
with VC++2010. The C++ code makes heavy use of a list system using macros.
135-
This predates stl, was portable before stl, and is more efficient than stl
136-
lists, but has the big negative that if you do get a segmentation violation,
137-
it is hard to debug.
138-
139-
The most recent change is that Tesseract can now recognize 39 languages,
140-
including Arabic, Hindi, Vietnamese, plus 3 Fraktur variants,
141-
is fully UTF8 capable, and is fully trainable. See TrainingTesseract for
142-
more information on training.
143-
144-
Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy.
145-
Results were available on https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf.
146-
With Tesseract 2.00, scripts were included to allow anyone to reproduce
147-
some of these tests. See TestingTesseract for more details.
148-
149-
150-
About the Engine
151-
================
152-
This code is a raw OCR engine. It has limited PAGE LAYOUT ANALYSIS, simple
153-
OUTPUT FORMATTING (txt, hocr/html), and NO UI.
154-
Having said that, in 1995, this engine was in the top 3 in terms of character
155-
accuracy, and it compiles and runs on both Linux and Windows.
156-
As of 3.01, Tesseract is fully unicode (UTF-8) enabled, and can recognize 39
157-
languages "out of the box." Code and documentation is provided for the brave
158-
to train in other languages.
159-
See [Tesseract Training wiki](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract)
160-
for more information on training. Additional [code and extracted documentation](http://tesseract-ocr.github.io/) was generated by Doxygen.
63+
Please read the [FAQ](https://github.com/tesseract-ocr/tesseract/wiki/FAQ) before asking any question in the mailing-list or reporting an issue.

0 commit comments

Comments
 (0)