-
Notifications
You must be signed in to change notification settings - Fork 10.1k
Description
Overview
When using certain PSMs with certain inputs, the PageIterator::Baseline
function produces results that are incorrect due to a bug when getting line bounding boxes. I noticed this when using psm
8
(single word). This impacts API users trying to get a line's baseline, and also causes incorrect results in CLI output formats that report baseline (.hocr
).
Reproducible Example
While this is most noticeable using it->Baseline
through the API, the phenomenon can be demonstrated using the CLI with the example image below.
The word in the image is recognized correctly--including having the same bounding box--whether psm
is set to 6
(single block) or 8
(single word). However, the latter does not calculate the baseline correctly.
When setting psm
to 6
, the baseline attribute is set to -0.036 0
, which is correct.
tesseract simple_c2.png stdout --oem 0 --psm 6 hocr
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<meta name='ocr-system' content='tesseract 5.1.0-471-gbc490' />
<meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
</head>
<body>
<div class='ocr_page' id='page_1' title='image "simple_c2.png"; bbox 0 0 328 194; ppageno 0; scan_res 96 96'>
<div class='ocr_carea' id='block_1_1' title="bbox 85 84 195 104">
<p class='ocr_par' id='par_1_1' lang='eng' title="bbox 85 84 195 104">
<span class='ocr_line' id='line_1_1' title="bbox 85 84 195 104; baseline -0.036 0; x_size 24.310345; x_descenders 5.3103447; x_ascenders 5">
<span class='ocrx_word' id='word_1_1' title='bbox 85 84 195 104; x_wconf 83'>Tesseract</span>
</span>
</p>
</div>
</div>
</body>
</html>
However, when setting psm
to 8
the baseline attribute is set to -0 -2.005
, which is incorrect.
tesseract simple_c2.png stdout --oem 0 --psm 8 hocr
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<meta name='ocr-system' content='tesseract 5.1.0-471-gbc490' />
<meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
</head>
<body>
<div class='ocr_page' id='page_1' title='image "simple_c2.png"; bbox 0 0 328 194; ppageno 0; scan_res 96 96'>
<div class='ocr_carea' id='block_1_1' title="bbox 85 84 195 104">
<p class='ocr_par' id='par_1_1' lang='eng' title="bbox 85 84 195 104">
<span class='ocr_line' id='line_1_1' title="bbox 85 84 195 104; baseline -0 -2.005; x_size 24.310345; x_descenders 5.3103447; x_ascenders 5">
<span class='ocrx_word' id='word_1_1' title='bbox 85 84 195 104; x_wconf 83'>Tesseract</span>
</span>
</p>
</div>
</div>
</body>
</html>
Cause
I investigated, and the root cause is that the PageIterator::Baseline
function assumes that the line's bounding box has already been calculated, however this is not always the case. The PageIterator::Baseline
gets the line's bounding box using row->bounding_box()
, which does not force these values to be calculated--it simply returns the default values (-32767
or 32767
) if they were not calculated already.
tesseract/src/ccmain/pageiterator.cpp
Lines 534 to 542 in 215b023
bool PageIterator::Baseline(PageIteratorLevel level, int *x1, int *y1, int *x2, | |
int *y2) const { | |
if (it_->word() == nullptr) { | |
return false; // Already at the end! | |
} | |
ROW *row = it_->row()->row; | |
WERD *word = it_->word()->word; | |
TBOX box = (level == RIL_WORD || level == RIL_SYMBOL) ? word->bounding_box() | |
: row->bounding_box(); |
This can be confirmed by adding tprintf
statements within the PageIterator::Baseline
function:
bool PageIterator::Baseline(PageIteratorLevel level, int *x1, int *y1, int *x2,
int *y2) const {
if (it_->word() == nullptr) {
return false; // Already at the end!
}
ROW *row = it_->row()->row;
WERD *word = it_->word()->word;
TBOX box = (level == RIL_WORD || level == RIL_SYMBOL) ? word->bounding_box()
: row->bounding_box();
tprintf("Box: %d,%d -> %d,%d\n", box.left(), box.bottom(), box.right(),
box.top());
int left = box.left();
ICOORD startpt(left, static_cast<int16_t>(row->base_line(left) + 0.5));
int right = box.right();
ICOORD endpt(right, static_cast<int16_t>(row->base_line(right) + 0.5));
// Rotate to image coordinates and convert to global image coords.
startpt.rotate(it_->block()->block->re_rotation());
endpt.rotate(it_->block()->block->re_rotation());
*x1 = startpt.x() / scale_ + rect_left_;
*y1 = (rect_height_ - startpt.y()) / scale_ + rect_top_;
*x2 = endpt.x() / scale_ + rect_left_;
*y2 = (rect_height_ - endpt.y()) / scale_ + rect_top_;
tprintf("Baseline: (%d,%d)->(%d,%d)\n", *x1, *y1, *x2, *y2);
return true;
}
When run with psm
set to 8
this produces the following:
Box: 32767,32767 -> -32767,-32767
Baseline: (32767,-990)->(-32767,1342)
Potential Fixes
I think there are 3 potential approaches for fixing:
- Add an ad-hoc check within
PageIterator::Baseline
for whether the default value is being returned, and if it is, calculate the actual bounding box.- This would fix the issue, however may leave other bugs related to the bounding box never being calculated outstanding.
- Modify the
row->bounding_box()
function to calculate the bounding box if it has never been calculated before. - Figure out why row bounding boxes are not being calculated with specific
psm
settings, and edit so they are being calculated upon creation.
Environment
Ubuntu 22.04 Jammy
tesseract 5.1.0-471-gbc490
leptonica-1.82.0
libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1