Skip to content
This repository was archived by the owner on Jun 14, 2018. It is now read-only.

Add support for Tesseract version 3.05.00 #62

Merged
merged 1 commit into from
Apr 11, 2017

Conversation

aszlig
Copy link
Contributor

@aszlig aszlig commented Apr 9, 2017

This is a bit more involved, because Tesseract 3.05.00 comes not only with improvements but also with a few quirks we need to deal with.

The first quirk is that the order arguments of the tesseract command now matters and the list of configurations has to be at the end of the command line. So we add a new attribute tesseract_flags to the BaseBuilder class that contains a list of all the flags to pass to tesseract, the tesseract_configs attribute however remains pretty much the same but now only really contains a list of configs instead of being mixed with flag arguments.

Another quirk has to do with Leptonica >= 1.74 which Tesseract 3.05.00 now requires. Leptonica has special handling of files that reside in /tmp and assumes that it's an internal temporary file of Leptonica. In order to deal with it, we now run Tesseract in a temporary directory, which contains the input/output files and use the relative name of these files because Leptonica only searches for path names beginning with /tmp.

Fortunately the last item we need to address is not really a quirk, but an API change. In Tesseract 3.05.00 there is now a new function called TessBaseAPIDetectOrientationScript(), which doesn't fill the OSResults object anymore but now allows to pass the values we're interested in directly by reference.
We need to use this new function because the old function TessBaseAPIDetectOS() now always returns false.

Ran the test suite successfully with Python 3.5 and both Tesseract 3.04.01 and 3.05.00 except the following tests, which also didn't succeed prior to this commit:

  • cuneiform:TestTxt.test_basic
  • cuneiform:TestTxt.test_european
  • cuneiform:TestTxt.test_french
  • cuneiform:TestWordBox.test_basic
  • cuneiform:TestWordBox.test_european
  • cuneiform:TestWordBox.test_french
  • libtesseract:TestBasicDoc.test_basic
  • libtesseract:TestDigitLineBox.test_digits
  • libtesseract:TestLineBox.test_japanese
  • libtesseract:TestTxt.test_japanese
  • libtesseract:TestWordBox.test_japanese
  • tesseract:TestDigitLineBox.test_digits
  • tesseract:TestTxt.test_japanese

The failure of these test cases is probably related to issue #52, but from looking at the failures it doesn't seem to be related to this change anyway.

This is a bit more involved, because Tesseract 3.05.00 comes not only
with improvements but also with a few quirks we need to deal with.

The first quirk is that the order arguments of the `tesseract' command
now matters and the list of configurations has to be at the end of the
command line. So we add a new attribute tesseract_flags to the
BaseBuilder class that contains a list of all the flags to pass to
`tesseract', the tesseract_configs attribute however remains pretty much
the same but now only really contains a list of configs instead of being
mixed with flag arguments.

Another quirk has to do with Leptonica >= 1.74 which Tesseract 3.05.00
now requires. Leptonica has special handling of files that reside in
/tmp and assumes that it's an internal temporary file of Leptonica. In
order to deal with it, we now run Tesseract in a temporary directory,
which contains the input/output files and use the relative name of these
files because Leptonica only searches for path names beginning with
/tmp.

Fortunately the last item we need to address is not really a quirk, but
an API change. In Tesseract 3.05.00 there is now a new function called
TessBaseAPIDetectOrientationScript(), which doesn't fill the OSResults
object anymore but now allows to pass the values we're interested in
directly by reference. We need to use this new function because the old
function TessBaseAPIDetectOS() now *always* returns false.

Ran the test suite successfully with Python 3.5 and both Tesseract
3.04.01 and 3.05.00 except the following tests, which also didn't
succeed prior to this commit:

 * cuneiform:TestTxt.test_basic
 * cuneiform:TestTxt.test_european
 * cuneiform:TestTxt.test_french
 * cuneiform:TestWordBox.test_basic
 * cuneiform:TestWordBox.test_european
 * cuneiform:TestWordBox.test_french
 * libtesseract:TestBasicDoc.test_basic
 * libtesseract:TestDigitLineBox.test_digits
 * libtesseract:TestLineBox.test_japanese
 * libtesseract:TestTxt.test_japanese
 * libtesseract:TestWordBox.test_japanese
 * tesseract:TestDigitLineBox.test_digits
 * tesseract:TestTxt.test_japanese

The failure of these test cases is probably related to issue openpaperwork#52, but
from looking at the failures it doesn't seem to be related to this
change anyway.

Signed-off-by: aszlig <[email protected]>
aszlig added a commit to NixOS/nixpkgs that referenced this pull request Apr 11, 2017
This is from the commit message I've written for the upstream pull
request (openpaperwork/pyocr#62):

    This is a bit more involved, because Tesseract 3.05.00 comes not
    only with improvements but also with a few quirks we need to deal
    with.

    The first quirk is that the order arguments of the `tesseract'
    command now matters and the list of configurations has to be at the
    end of the command line. So we add a new attribute tesseract_flags
    to the BaseBuilder class that contains a list of all the flags to
    pass to `tesseract', the tesseract_configs attribute however remains
    pretty much the same but now only really contains a list of configs
    instead of being mixed with flag arguments.

    Another quirk has to do with Leptonica >= 1.74 which Tesseract
    3.05.00 now requires. Leptonica has special handling of files that
    reside in /tmp and assumes that it's an internal temporary file of
    Leptonica. In order to deal with it, we now run Tesseract in a
    temporary directory, which contains the input/output files and use
    the relative name of these files because Leptonica only searches for
    path names beginning with /tmp.

    Fortunately the last item we need to address is not really a quirk,
    but an API change. In Tesseract 3.05.00 there is now a new function
    called TessBaseAPIDetectOrientationScript(), which doesn't fill the
    OSResults object anymore but now allows to pass the values we're
    interested in directly by reference. We need to use this new
    function because the old function TessBaseAPIDetectOS() now *always*
    returns false.

I've tested this specifically on NixOS and in conjunction with Paperwork
(the only package that's using pyocr so far) and all the tests of the
dependency chain are now succeeding. However, I didn't do manual tests
of Paperwork though.

Signed-off-by: aszlig <[email protected]>
@jflesch
Copy link
Member

jflesch commented Apr 11, 2017

Really good contribution. Thanks :-)
And thank you also for taking care of not breaking Tesseract 3.04 support.

I'll test it later with Tesseract 3.05 and make a new release (maybe next week I hope).

<rant>

Leptonica has special handling of files that reside in /tmp

Yeah, my first question would be "what the f*ck?" .. but I guess they must have their (weird) reasons, and there is nothing we can do about it.

We need to use this new function because the old function TessBaseAPIDetectOS() now always returns false.

This is the second time I see that @tesseract-ocr breaks the C API in such way. This is getting frustrating. They could simply mark the old function obsolete, and call the new one from the old one.
Since we are using Python, it's easy for us to remain compatible, but I seriously wonder how C developers are supposed to handle this kind of changes without using dlopen()&friends.

</rant>

Anyway, thank you again for this great contribution. I'll merge it right now :-)

@jflesch jflesch merged commit 76f0ad0 into openpaperwork:master Apr 11, 2017
@aszlig
Copy link
Contributor Author

aszlig commented Apr 24, 2017

@jflesch: I guess C programmers simply didn't use the TessBaseAPIDetectOS function because it requires an OSResults C++ object, which is why they changed the API in the first place. Apart from that, if I'd want to use the API from C code, I'd handle that within the build and use the preprocessor to handle the different cases based on the version from pkg-config.

@jflesch
Copy link
Member

jflesch commented Apr 24, 2017

@aszlig Good points. I totally forgot it was actually a C++ object.

Regarding pkg-config however, there is no .pc file with Libtesseract 3.03 (in debian at least), so it isn't a valid option if you want to support tesseract 3.04.
Also, even if there would be one, it would still break the build silently. Tests could (and should) catch it, but still, it's bad practice to break an API when it can be easily avoided.

@aszlig aszlig deleted the tesseract-3.5 branch April 24, 2017 14:34
@jflesch
Copy link
Member

jflesch commented May 13, 2017

Included in Pyocr 0.4.7

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants