Skip to content

jpg input files result in much bigger pdf #1961

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ShinjiLE opened this issue Oct 7, 2018 · 4 comments
Closed

jpg input files result in much bigger pdf #1961

ShinjiLE opened this issue Oct 7, 2018 · 4 comments
Labels

Comments

@ShinjiLE
Copy link

ShinjiLE commented Oct 7, 2018

Environment

  • Tesseract Version: tesseract 4.0.0-beta.4-165-g971f
    leptonica-1.77.0
    libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.7.0beta84 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 1.0.0 : libopenjp2 2.3.0
    Found AVX
    Found SSE
  • Commit Number: 5fe1390
  • Platform: Linux leo.mybase 4.18.9-1-default defect issue #1 SMP PREEMPT Thu Sep 20 06:37:04 UTC 2018 (67901ec) x86_64 x86_64 x86_64 GNU/Linux

Current Behavior:

OCR jpeg files lead to bigger output pdf file .
Inputfilesize = 642K
pdf = 2,5M

Expected Behavior:

the pdf size should not much bigger then the input
pdf = 645K

Suggested Fix:

not a fix , but this commit (5fe1390) introduce the problem . Exactly src/api/pdfrenderer.cpp line 719.

@stweil
Copy link
Member

stweil commented Oct 7, 2018

This looks like a regression. @zdenop, maybe we can try to fix it for 4.0.0, perhaps by writing the image in JPEG 2000 format (like at least one commercial OCR software does) if that is supported by Leptonica. That could reduce the size of the PDF a lot.

@zdenop
Copy link
Contributor

zdenop commented Oct 7, 2018

Can you please provide you image for testing?
@stweil : pdfrenderer philosophy is not to modify input image for output (if is not needed for OCR or other reason). So if user what jp2 of tiff g4 compression in pdf, user should provide such encoded images.

@zdenop zdenop closed this as completed in f794571 Oct 7, 2018
@zdenop
Copy link
Contributor

zdenop commented Oct 7, 2018

Thank for report. Should be fixed - please check.

@ShinjiLE
Copy link
Author

ShinjiLE commented Oct 8, 2018

Yep it work .
Thx

zdenop added a commit that referenced this issue Oct 9, 2018
* 'master' of https://github.com/tesseract-ocr/tesseract:
  Remove code for _MSC_VER < 1900
  keep API compatibility with #1265
  Update googletest submodule to release v1.8.1
  Update test submodule
  Always use isascii() with isspace()
  Avoid crash with --psm 0 and LSTM traineddata
  SVPaint: Remove empty block
  Classify: Don't hide debug parameter
  UNICHARMAP: Remove comparison which is always false
  svpaint: Change a variable from global to local
  pgedit: remove unused declaration of display_bln_lines
  Plumbing: Remove comparison which is always false
  Release candidate 2
  use pdf L_FLATE_ENCODE only for png input; fixes #1961
@amitdo amitdo added the bug label May 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants