Skip to content

Floating point exception with tessdata models since version 5.4.0 #4257

@yeezy69

Description

@yeezy69

Current Behavior

I use OCRmyPDF on Archlinux. The program has been crashing since yesterday after tesseract was updated from version 5.3.4-2 to 5.4.0-1. After a downgrade, tesseract works as expected with the same image.

I executed the ocrmypdf commands manually:

$ gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -dInterpolateControl=-1 -sDEVICE=png16m -dFirstPage=1 -dLastPage=1 -r200.161797x200.161797 -dPDFSTOPONERROR -o image.png -sstdout=%stderr -dAutoRotatePages=/None -f doc20240608121758.pdf

$ tesseract -l deu image.png 000001_ocr_hocr hocr txt
[1] 9771 floating point exception (core dumped)

$ pacman -U tesseract-5.3.4-2-x86_64.pkg.tar.zst

$ tesseract -l deu image.png 000001_ocr_hocr hocr txt
(works fine)

For data protection reasons, I recreated a document that caused the program to crash:

image

Expected Behavior

No response

Suggested Fix

No response

tesseract -v

tesseract 5.4.0
leptonica-1.84.1
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 3.0.2) : libpng 1.6.43 : libtiff 4.6.0 : zlib 1.3.1 : libwebp 1.4.0 : libopenjp2 2.5.2
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.7.4 zlib/1.3.1 liblzma/5.6.1 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.5
Found libcurl/8.8.0 OpenSSL/3.3.1 zlib/1.3.1 brotli/1.1.0 zstd/1.5.6 libidn2/2.3.7 libpsl/0.21.5 libssh2/1.11.0

Operating System

No response

Other Operating System

Archlinux

uname -a

Linux pc 6.9.3-arch1-1 #1 SMP PREEMPT_DYNAMIC Fri, 31 May 2024 15:14:45 +0000 x86_64 GNU/Linux

Compiler

No response

CPU

Intel i7-8650U

Virtualization / Containers

No response

Other Information

$ gdb --args tesseract -l deu image.png 000001_ocr_hocr hocr txt

(gdb) run
Starting program: /usr/bin/tesseract -l deu image.png 000001_ocr_hocr hocr txt
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[New Thread 0x7ffff20006c0 (LWP 10820)]
[New Thread 0x7ffff16006c0 (LWP 10821)]
[New Thread 0x7ffff0c006c0 (LWP 10822)]

Thread 1 "tesseract" received signal SIGFPE, Arithmetic exception.
0x00007ffff7d8afa4 in tesseract::Classify::ComputeNormMatch(int, tesseract::FEATURE_STRUCT const&, bool) () from /usr/lib/libtesseract.so.5

(gdb) bt
#0 0x00007ffff7d8afa4 in tesseract::Classify::ComputeNormMatch(int, tesseract::FEATURE_STRUCT const&, bool) () from /usr/lib/libtesseract.so.5
#1 0x00007ffff7d806c5 in tesseract::Classify::ComputeIntCharNormArray(tesseract::FEATURE_STRUCT const&, unsigned char*) () from /usr/lib/libtesseract.so.5
#2 0x00007ffff7d705ad in tesseract::Classify::ComputeCharNormArrays(tesseract::FEATURE_STRUCT*, tesseract::INT_TEMPLATES_STRUCT*, unsigned char*, unsigned char*) () from /usr/lib/libtesseract.so.5
#3 0x00007ffff7d709fe in tesseract::Classify::CharNormTrainingSample(bool, int, tesseract::TrainingSample const&, std::vector<tesseract::UnicharRating, std::allocatortesseract::UnicharRating >) ()
from /usr/lib/libtesseract.so.5
#4 0x00007ffff7d9312b in tesseract::TessClassifier::UnicharClassifySample(tesseract::TrainingSample const&, tesseract::Image, int, int, std::vector<tesseract::UnicharRating, std::allocatortesseract::UnicharRating >
) () from /usr/lib/libtesseract.so.5
#5 0x00007ffff7d6e575 in tesseract::Classify::CharNormClassifier(tesseract::TBLOB*, tesseract::TrainingSample const&, tesseract::ADAPT_RESULTS*) () from /usr/lib/libtesseract.so.5
#6 0x00007ffff7d73e76 in tesseract::Classify::DoAdaptiveMatch(tesseract::TBLOB*, tesseract::ADAPT_RESULTS*) () from /usr/lib/libtesseract.so.5
#7 0x00007ffff7d6c5a3 in tesseract::Classify::AdaptiveClassifier(tesseract::TBLOB*, tesseract::BLOB_CHOICE_LIST*) () from /usr/lib/libtesseract.so.5
#8 0x00007ffff7e4d1a4 in tesseract::Wordrec::call_matcher(tesseract::TBLOB*) () from /usr/lib/libtesseract.so.5
#9 0x00007ffff7e5aaeb in tesseract::Wordrec::classify_blob(tesseract::TBLOB*, char const*, tesseract::ScrollView::Color, tesseract::BlamerBundle*) () from /usr/lib/libtesseract.so.5
#10 0x00007ffff7e5ac41 in tesseract::Wordrec::classify_piece(std::vector<tesseract::SEAM*, std::allocatortesseract::SEAM* > const&, short, short, char const*, tesseract::TWERD*, tesseract::BlamerBundle*) () from /usr/lib/libtesseract.so.5
#11 0x00007ffff7e4b1d4 in tesseract::Wordrec::chop_word_main(tesseract::WERD_RES*) () from /usr/lib/libtesseract.so.5
#12 0x00007ffff7e4b6f2 in tesseract::Wordrec::cc_recog(tesseract::WERD_RES*) () from /usr/lib/libtesseract.so.5
#13 0x00007ffff7d273f9 in tesseract::Tesseract::recog_word_recursive(tesseract::WERD_RES*) () from /usr/lib/libtesseract.so.5
#14 0x00007ffff7d2854b in tesseract::Tesseract::recog_word(tesseract::WERD_RES*) () from /usr/lib/libtesseract.so.5
#15 0x00007ffff7d28927 in tesseract::Tesseract::tess_segment_pass_n(int, tesseract::WERD_RES*) () from /usr/lib/libtesseract.so.5
#16 0x00007ffff7cda5a2 in tesseract::Tesseract::match_word_pass_n(int, tesseract::WERD_RES*, tesseract::ROW*, tesseract::BLOCK*) () from /usr/lib/libtesseract.so.5
#17 0x00007ffff7ce1102 in tesseract::Tesseract::classify_word_pass1(tesseract::WordData const&, tesseract::WERD_RES**, tesseract::PointerVectortesseract::WERD_RES) () from /usr/lib/libtesseract.so.5
#18 0x00007ffff7cd0c51 in tesseract::Tesseract::RetryWithLanguage(tesseract::WordData const&, void (tesseract::Tesseract::
)(tesseract::WordData const&, tesseract::WERD_RES**, tesseract::PointerVectortesseract::WERD_RES), bool, tesseract::WERD_RES**, tesseract::PointerVectortesseract::WERD_RES) () from /usr/lib/libtesseract.so.5
#19 0x00007ffff7cd1ac5 in tesseract::Tesseract::classify_word_and_language(int, tesseract::PAGE_RES_IT*, tesseract::WordData*) () from /usr/lib/libtesseract.so.5
#20 0x00007ffff7cd573d in tesseract::Tesseract::RecogAllWordsPassN(int, tesseract::ETEXT_DESC*, tesseract::PAGE_RES_IT*, std::vector<tesseract::WordData, std::allocatortesseract::WordData >) ()
from /usr/lib/libtesseract.so.5
#21 0x00007ffff7cd5ee5 in tesseract::Tesseract::recog_all_words(tesseract::PAGE_RES
, tesseract::ETEXT_DESC*, tesseract::TBOX const*, char const*, int) () from /usr/lib/libtesseract.so.5
#22 0x00007ffff7c9a23d in tesseract::TessBaseAPI::Recognize(tesseract::ETEXT_DESC*) () from /usr/lib/libtesseract.so.5
#23 0x00007ffff7c9d963 in tesseract::TessBaseAPI::ProcessPage(Pix*, int, char const*, char const*, int, tesseract::TessResultRenderer*) () from /usr/lib/libtesseract.so.5
#24 0x00007ffff7c9ef88 in tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) () from /usr/lib/libtesseract.so.5
#25 0x00007ffff7c9f1b4 in tesseract::TessBaseAPI::ProcessPages(char const*, char const*, int, tesseract::TessResultRenderer*) () from /usr/lib/libtesseract.so.5
#26 0x0000555555558797 in ?? ()
#27 0x00007ffff714ec88 in ?? () from /usr/lib/libc.so.6
#28 0x00007ffff714ed4c in __libc_start_main () from /usr/lib/libc.so.6
#29 0x00005555555598b5 in ?? ()

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions