jpg input files result in much bigger pdf #1961

ShinjiLE · 2018-10-07T14:40:44Z

Environment

Tesseract Version: tesseract 4.0.0-beta.4-165-g971f
leptonica-1.77.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.7.0beta84 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 1.0.0 : libopenjp2 2.3.0
Found AVX
Found SSE
Commit Number: 5fe1390
Platform: Linux leo.mybase 4.18.9-1-default defect issue #1 SMP PREEMPT Thu Sep 20 06:37:04 UTC 2018 (67901ec) x86_64 x86_64 x86_64 GNU/Linux

Current Behavior:

OCR jpeg files lead to bigger output pdf file .
Inputfilesize = 642K
pdf = 2,5M

Expected Behavior:

the pdf size should not much bigger then the input
pdf = 645K

Suggested Fix:

not a fix , but this commit (5fe1390) introduce the problem . Exactly src/api/pdfrenderer.cpp line 719.

The text was updated successfully, but these errors were encountered:

stweil · 2018-10-07T17:16:16Z

This looks like a regression. @zdenop, maybe we can try to fix it for 4.0.0, perhaps by writing the image in JPEG 2000 format (like at least one commercial OCR software does) if that is supported by Leptonica. That could reduce the size of the PDF a lot.

zdenop · 2018-10-07T17:58:36Z

Can you please provide you image for testing?
@stweil : pdfrenderer philosophy is not to modify input image for output (if is not needed for OCR or other reason). So if user what jp2 of tiff g4 compression in pdf, user should provide such encoded images.

zdenop · 2018-10-07T18:58:39Z

Thank for report. Should be fixed - please check.

ShinjiLE · 2018-10-08T05:25:32Z

Yep it work .
Thx

* 'master' of https://github.com/tesseract-ocr/tesseract: Remove code for _MSC_VER < 1900 keep API compatibility with #1265 Update googletest submodule to release v1.8.1 Update test submodule Always use isascii() with isspace() Avoid crash with --psm 0 and LSTM traineddata SVPaint: Remove empty block Classify: Don't hide debug parameter UNICHARMAP: Remove comparison which is always false svpaint: Change a variable from global to local pgedit: remove unused declaration of display_bln_lines Plumbing: Remove comparison which is always false Release candidate 2 use pdf L_FLATE_ENCODE only for png input; fixes #1961

zdenop closed this as completed in f794571 Oct 7, 2018

amitdo added the bug label May 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jpg input files result in much bigger pdf #1961

jpg input files result in much bigger pdf #1961

ShinjiLE commented Oct 7, 2018

stweil commented Oct 7, 2018

zdenop commented Oct 7, 2018

zdenop commented Oct 7, 2018

ShinjiLE commented Oct 8, 2018

jpg input files result in much bigger pdf #1961

jpg input files result in much bigger pdf #1961

Comments

ShinjiLE commented Oct 7, 2018

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

stweil commented Oct 7, 2018

zdenop commented Oct 7, 2018

zdenop commented Oct 7, 2018

ShinjiLE commented Oct 8, 2018