-
Notifications
You must be signed in to change notification settings - Fork 10.1k
Description
Current Behavior
I use OCRmyPDF on Archlinux. The program has been crashing since yesterday after tesseract was updated from version 5.3.4-2 to 5.4.0-1. After a downgrade, tesseract works as expected with the same image.
I executed the ocrmypdf commands manually:
$ gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -dInterpolateControl=-1 -sDEVICE=png16m -dFirstPage=1 -dLastPage=1 -r200.161797x200.161797 -dPDFSTOPONERROR -o image.png -sstdout=%stderr -dAutoRotatePages=/None -f doc20240608121758.pdf
$ tesseract -l deu image.png 000001_ocr_hocr hocr txt
[1] 9771 floating point exception (core dumped)
$ pacman -U tesseract-5.3.4-2-x86_64.pkg.tar.zst
$ tesseract -l deu image.png 000001_ocr_hocr hocr txt
(works fine)
For data protection reasons, I recreated a document that caused the program to crash:
Expected Behavior
No response
Suggested Fix
No response
tesseract -v
tesseract 5.4.0
leptonica-1.84.1
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 3.0.2) : libpng 1.6.43 : libtiff 4.6.0 : zlib 1.3.1 : libwebp 1.4.0 : libopenjp2 2.5.2
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.7.4 zlib/1.3.1 liblzma/5.6.1 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.5
Found libcurl/8.8.0 OpenSSL/3.3.1 zlib/1.3.1 brotli/1.1.0 zstd/1.5.6 libidn2/2.3.7 libpsl/0.21.5 libssh2/1.11.0
Operating System
No response
Other Operating System
Archlinux
uname -a
Linux pc 6.9.3-arch1-1 #1 SMP PREEMPT_DYNAMIC Fri, 31 May 2024 15:14:45 +0000 x86_64 GNU/Linux
Compiler
No response
CPU
Intel i7-8650U
Virtualization / Containers
No response
Other Information
$ gdb --args tesseract -l deu image.png 000001_ocr_hocr hocr txt
(gdb) run
Starting program: /usr/bin/tesseract -l deu image.png 000001_ocr_hocr hocr txt
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[New Thread 0x7ffff20006c0 (LWP 10820)]
[New Thread 0x7ffff16006c0 (LWP 10821)]
[New Thread 0x7ffff0c006c0 (LWP 10822)]
Thread 1 "tesseract" received signal SIGFPE, Arithmetic exception.
0x00007ffff7d8afa4 in tesseract::Classify::ComputeNormMatch(int, tesseract::FEATURE_STRUCT const&, bool) () from /usr/lib/libtesseract.so.5
(gdb) bt
#0 0x00007ffff7d8afa4 in tesseract::Classify::ComputeNormMatch(int, tesseract::FEATURE_STRUCT const&, bool) () from /usr/lib/libtesseract.so.5
#1 0x00007ffff7d806c5 in tesseract::Classify::ComputeIntCharNormArray(tesseract::FEATURE_STRUCT const&, unsigned char*) () from /usr/lib/libtesseract.so.5
#2 0x00007ffff7d705ad in tesseract::Classify::ComputeCharNormArrays(tesseract::FEATURE_STRUCT*, tesseract::INT_TEMPLATES_STRUCT*, unsigned char*, unsigned char*) () from /usr/lib/libtesseract.so.5
#3 0x00007ffff7d709fe in tesseract::Classify::CharNormTrainingSample(bool, int, tesseract::TrainingSample const&, std::vector<tesseract::UnicharRating, std::allocatortesseract::UnicharRating >) ()
from /usr/lib/libtesseract.so.5
#4 0x00007ffff7d9312b in tesseract::TessClassifier::UnicharClassifySample(tesseract::TrainingSample const&, tesseract::Image, int, int, std::vector<tesseract::UnicharRating, std::allocatortesseract::UnicharRating >) () from /usr/lib/libtesseract.so.5
#5 0x00007ffff7d6e575 in tesseract::Classify::CharNormClassifier(tesseract::TBLOB*, tesseract::TrainingSample const&, tesseract::ADAPT_RESULTS*) () from /usr/lib/libtesseract.so.5
#6 0x00007ffff7d73e76 in tesseract::Classify::DoAdaptiveMatch(tesseract::TBLOB*, tesseract::ADAPT_RESULTS*) () from /usr/lib/libtesseract.so.5
#7 0x00007ffff7d6c5a3 in tesseract::Classify::AdaptiveClassifier(tesseract::TBLOB*, tesseract::BLOB_CHOICE_LIST*) () from /usr/lib/libtesseract.so.5
#8 0x00007ffff7e4d1a4 in tesseract::Wordrec::call_matcher(tesseract::TBLOB*) () from /usr/lib/libtesseract.so.5
#9 0x00007ffff7e5aaeb in tesseract::Wordrec::classify_blob(tesseract::TBLOB*, char const*, tesseract::ScrollView::Color, tesseract::BlamerBundle*) () from /usr/lib/libtesseract.so.5
#10 0x00007ffff7e5ac41 in tesseract::Wordrec::classify_piece(std::vector<tesseract::SEAM*, std::allocatortesseract::SEAM* > const&, short, short, char const*, tesseract::TWERD*, tesseract::BlamerBundle*) () from /usr/lib/libtesseract.so.5
#11 0x00007ffff7e4b1d4 in tesseract::Wordrec::chop_word_main(tesseract::WERD_RES*) () from /usr/lib/libtesseract.so.5
#12 0x00007ffff7e4b6f2 in tesseract::Wordrec::cc_recog(tesseract::WERD_RES*) () from /usr/lib/libtesseract.so.5
#13 0x00007ffff7d273f9 in tesseract::Tesseract::recog_word_recursive(tesseract::WERD_RES*) () from /usr/lib/libtesseract.so.5
#14 0x00007ffff7d2854b in tesseract::Tesseract::recog_word(tesseract::WERD_RES*) () from /usr/lib/libtesseract.so.5
#15 0x00007ffff7d28927 in tesseract::Tesseract::tess_segment_pass_n(int, tesseract::WERD_RES*) () from /usr/lib/libtesseract.so.5
#16 0x00007ffff7cda5a2 in tesseract::Tesseract::match_word_pass_n(int, tesseract::WERD_RES*, tesseract::ROW*, tesseract::BLOCK*) () from /usr/lib/libtesseract.so.5
#17 0x00007ffff7ce1102 in tesseract::Tesseract::classify_word_pass1(tesseract::WordData const&, tesseract::WERD_RES**, tesseract::PointerVectortesseract::WERD_RES) () from /usr/lib/libtesseract.so.5
#18 0x00007ffff7cd0c51 in tesseract::Tesseract::RetryWithLanguage(tesseract::WordData const&, void (tesseract::Tesseract::)(tesseract::WordData const&, tesseract::WERD_RES**, tesseract::PointerVectortesseract::WERD_RES), bool, tesseract::WERD_RES**, tesseract::PointerVectortesseract::WERD_RES) () from /usr/lib/libtesseract.so.5
#19 0x00007ffff7cd1ac5 in tesseract::Tesseract::classify_word_and_language(int, tesseract::PAGE_RES_IT*, tesseract::WordData*) () from /usr/lib/libtesseract.so.5
#20 0x00007ffff7cd573d in tesseract::Tesseract::RecogAllWordsPassN(int, tesseract::ETEXT_DESC*, tesseract::PAGE_RES_IT*, std::vector<tesseract::WordData, std::allocatortesseract::WordData >) ()
from /usr/lib/libtesseract.so.5
#21 0x00007ffff7cd5ee5 in tesseract::Tesseract::recog_all_words(tesseract::PAGE_RES, tesseract::ETEXT_DESC*, tesseract::TBOX const*, char const*, int) () from /usr/lib/libtesseract.so.5
#22 0x00007ffff7c9a23d in tesseract::TessBaseAPI::Recognize(tesseract::ETEXT_DESC*) () from /usr/lib/libtesseract.so.5
#23 0x00007ffff7c9d963 in tesseract::TessBaseAPI::ProcessPage(Pix*, int, char const*, char const*, int, tesseract::TessResultRenderer*) () from /usr/lib/libtesseract.so.5
#24 0x00007ffff7c9ef88 in tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) () from /usr/lib/libtesseract.so.5
#25 0x00007ffff7c9f1b4 in tesseract::TessBaseAPI::ProcessPages(char const*, char const*, int, tesseract::TessResultRenderer*) () from /usr/lib/libtesseract.so.5
#26 0x0000555555558797 in ?? ()
#27 0x00007ffff714ec88 in ?? () from /usr/lib/libc.so.6
#28 0x00007ffff714ed4c in __libc_start_main () from /usr/lib/libc.so.6
#29 0x00005555555598b5 in ?? ()