-
Notifications
You must be signed in to change notification settings - Fork 152
Add support for Tesseract version 3.05.00 #62
Conversation
This is a bit more involved, because Tesseract 3.05.00 comes not only with improvements but also with a few quirks we need to deal with. The first quirk is that the order arguments of the `tesseract' command now matters and the list of configurations has to be at the end of the command line. So we add a new attribute tesseract_flags to the BaseBuilder class that contains a list of all the flags to pass to `tesseract', the tesseract_configs attribute however remains pretty much the same but now only really contains a list of configs instead of being mixed with flag arguments. Another quirk has to do with Leptonica >= 1.74 which Tesseract 3.05.00 now requires. Leptonica has special handling of files that reside in /tmp and assumes that it's an internal temporary file of Leptonica. In order to deal with it, we now run Tesseract in a temporary directory, which contains the input/output files and use the relative name of these files because Leptonica only searches for path names beginning with /tmp. Fortunately the last item we need to address is not really a quirk, but an API change. In Tesseract 3.05.00 there is now a new function called TessBaseAPIDetectOrientationScript(), which doesn't fill the OSResults object anymore but now allows to pass the values we're interested in directly by reference. We need to use this new function because the old function TessBaseAPIDetectOS() now *always* returns false. Ran the test suite successfully with Python 3.5 and both Tesseract 3.04.01 and 3.05.00 except the following tests, which also didn't succeed prior to this commit: * cuneiform:TestTxt.test_basic * cuneiform:TestTxt.test_european * cuneiform:TestTxt.test_french * cuneiform:TestWordBox.test_basic * cuneiform:TestWordBox.test_european * cuneiform:TestWordBox.test_french * libtesseract:TestBasicDoc.test_basic * libtesseract:TestDigitLineBox.test_digits * libtesseract:TestLineBox.test_japanese * libtesseract:TestTxt.test_japanese * libtesseract:TestWordBox.test_japanese * tesseract:TestDigitLineBox.test_digits * tesseract:TestTxt.test_japanese The failure of these test cases is probably related to issue openpaperwork#52, but from looking at the failures it doesn't seem to be related to this change anyway. Signed-off-by: aszlig <[email protected]>
This is from the commit message I've written for the upstream pull request (openpaperwork/pyocr#62): This is a bit more involved, because Tesseract 3.05.00 comes not only with improvements but also with a few quirks we need to deal with. The first quirk is that the order arguments of the `tesseract' command now matters and the list of configurations has to be at the end of the command line. So we add a new attribute tesseract_flags to the BaseBuilder class that contains a list of all the flags to pass to `tesseract', the tesseract_configs attribute however remains pretty much the same but now only really contains a list of configs instead of being mixed with flag arguments. Another quirk has to do with Leptonica >= 1.74 which Tesseract 3.05.00 now requires. Leptonica has special handling of files that reside in /tmp and assumes that it's an internal temporary file of Leptonica. In order to deal with it, we now run Tesseract in a temporary directory, which contains the input/output files and use the relative name of these files because Leptonica only searches for path names beginning with /tmp. Fortunately the last item we need to address is not really a quirk, but an API change. In Tesseract 3.05.00 there is now a new function called TessBaseAPIDetectOrientationScript(), which doesn't fill the OSResults object anymore but now allows to pass the values we're interested in directly by reference. We need to use this new function because the old function TessBaseAPIDetectOS() now *always* returns false. I've tested this specifically on NixOS and in conjunction with Paperwork (the only package that's using pyocr so far) and all the tests of the dependency chain are now succeeding. However, I didn't do manual tests of Paperwork though. Signed-off-by: aszlig <[email protected]>
Really good contribution. Thanks :-) I'll test it later with Tesseract 3.05 and make a new release (maybe next week I hope). <rant>
Yeah, my first question would be "what the f*ck?" .. but I guess they must have their (weird) reasons, and there is nothing we can do about it.
This is the second time I see that @tesseract-ocr breaks the C API in such way. This is getting frustrating. They could simply mark the old function obsolete, and call the new one from the old one. </rant> Anyway, thank you again for this great contribution. I'll merge it right now :-) |
@jflesch: I guess C programmers simply didn't use the |
@aszlig Good points. I totally forgot it was actually a C++ object. Regarding pkg-config however, there is no .pc file with Libtesseract 3.03 (in debian at least), so it isn't a valid option if you want to support tesseract 3.04. |
Included in Pyocr 0.4.7 |
This is a bit more involved, because Tesseract 3.05.00 comes not only with improvements but also with a few quirks we need to deal with.
The first quirk is that the order arguments of the
tesseract
command now matters and the list of configurations has to be at the end of the command line. So we add a new attributetesseract_flags
to theBaseBuilder
class that contains a list of all the flags to pass totesseract
, thetesseract_configs
attribute however remains pretty much the same but now only really contains a list of configs instead of being mixed with flag arguments.Another quirk has to do with Leptonica >= 1.74 which Tesseract 3.05.00 now requires. Leptonica has special handling of files that reside in
/tmp
and assumes that it's an internal temporary file of Leptonica. In order to deal with it, we now run Tesseract in a temporary directory, which contains the input/output files and use the relative name of these files because Leptonica only searches for path names beginning with/tmp
.Fortunately the last item we need to address is not really a quirk, but an API change. In Tesseract 3.05.00 there is now a new function called
TessBaseAPIDetectOrientationScript()
, which doesn't fill theOSResults
object anymore but now allows to pass the values we're interested in directly by reference.We need to use this new function because the old function
TessBaseAPIDetectOS()
now always returnsfalse
.Ran the test suite successfully with Python 3.5 and both Tesseract 3.04.01 and 3.05.00 except the following tests, which also didn't succeed prior to this commit:
cuneiform:TestTxt.test_basic
cuneiform:TestTxt.test_european
cuneiform:TestTxt.test_french
cuneiform:TestWordBox.test_basic
cuneiform:TestWordBox.test_european
cuneiform:TestWordBox.test_french
libtesseract:TestBasicDoc.test_basic
libtesseract:TestDigitLineBox.test_digits
libtesseract:TestLineBox.test_japanese
libtesseract:TestTxt.test_japanese
libtesseract:TestWordBox.test_japanese
tesseract:TestDigitLineBox.test_digits
tesseract:TestTxt.test_japanese
The failure of these test cases is probably related to issue #52, but from looking at the failures it doesn't seem to be related to this change anyway.