-
Notifications
You must be signed in to change notification settings - Fork 9.8k
Issue with searchable PDF #1889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This code change looks fine to me. But I am curious how you ran into this problem. The filename is set from api->GetInputName(). In what scenario is that empty? |
My understanding from the documentation was that the InputName was optional except for certain special uses, such as in training. I had a Pix in memory which I had preprocessed with Leptonica functions and wanted to convert to a searchable PDF. I didn't know that I had to call api->SetInputName with the name of the image file (what if there is no file name). We could probably also eliminate the first part of this logic which uses Deflate if the original image was a PNG file and let Leptonica decide on the encoding based on the color depth and the alpha channel. |
The overall philosophy of the PDF rendering module is to be as hands off as possible with the images. The less we do, the less trouble we can get into ("You made my PDF file too big!") and the less pressure for continuous development ("Leptonica needs better transcode heuristics!" "Change the pixel values in the image!"). Forcing with Flate for PNG input is part of the philosophy of minimal image meddling. Why change that? PS. I'd still like to encourage use of api->SetInputName whenever possible. Do you have any documentation suggestions? |
Thanks for all your efforts and I totally agree with your philosophy! Tesseract is about OCR and not PDF generation. A lot of libraries can take OCR data and create fancy PDF files. I will be building one since the use of zero-glyph CID font is not going to work for us. We need a switch to toggle between "image with hidden text behind" and "no image with recognized text visible". The issue I reported was not on the approach but the fact that if the input name was not detected, a corrupt PDF was being generated. Calling api->SetInputName corrects that problem. Keep up the good work! |
@FarhadKhalafi Do you want to turn the code change in this bug into a pull request? |
@jbreiden : I can commit to source code. Should I? |
Yes please, if you could. My workload is crazy right now and I am not that
familiar with pull requests. Thanks!
…On Wed, Sep 12, 2018 at 11:04 AM zdenop ***@***.***> wrote:
@jbreiden <https://github.com/jbreiden> : I can commit to source code.
Should I?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1889 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AIGOXUVkw9vYU5hnkrRQERF6rpiZDyOCks5uaT6JgaJpZM4WVuH8>
.
|
Yes, approved. |
Merge branch 'master' into jpg_quality_option * master: (577 commits) fix issue tesseract-ocr#1889 Add badges for download , licence and lgtm Replace macro MINGW by __MINGW32__ EquationDetectBase: Define virtual destructor in .cpp file BlobGrid: Define virtual destructor in .cpp file GridBase: Define virtual destructor in .cpp file AlignedBlob: Define virtual destructor in .cpp file TransposedArray: Define virtual destructor in .cpp file IndexMapBiDi: Define virtual destructor in .cpp file Add missing include file (fixes linker error for Visual Studio) NthItemTest: Add definition for virtual destructor HeapTest: Add definition for virtual destructor IcuErrorCode: Define virtual destructor in .cpp file Validator: Define virtual destructor in .cpp file Dawg: Define virtual destructor in .cpp file CUtil: Define virtual destructor in .cpp file IndexMap: Define virtual destructor in .cpp file CCUtil: Define virtual destructor in .cpp file MATRIX: Define virtual destructor in .cpp file CCStruct: Define virtual destructor in .cpp file ...
@dthornley In your commit, I believe you forgot to remove (at the beginning of the module) if (!filename) which must be removed for the fix to work. |
IMO your proposed change is not safe because l_generateCIDataForPdf needs filename or pix. So we should change it this way:
Can you test it please? |
@zdenop I think the idea of using a file name provided by the caller independent of the Pix image is a problem. The user can pass any filename (or none) and break this code. Is it poaaible to use Leptonica function pixGetInputFormat and get rid of the filename? Something like:
I believe the reason Leptonica function l_generateCIDataForPdf accepts a filename as well as a pix is for optimization as certain image files can be directly inserted into PDF without decompressing first (e.g. Tiff or Jpeg). If this doesn't work, then I would agree with your proposed check (or just checking for pix, as filename is checked later in the code). The same check as your suggestion is also done inside l_generateCIDataForPdf, but it is OK to check multiple times. |
The code was working fine so I do not see reason why to eliminate usage of filename. |
I am OK with your comment. Originally I got into this situation when I was not setting the filename and the code was generating a corrupt PDF where the indirect reference to the image stream was pointing to the PDF info stream instead without complaining. Now that I know about the filename, I can set it prior to calling PDF renderer and everything is OK. Sorry if I created a diversion and thank you for all your efforts! |
As of 8/31/2018, if a PDF is generated from the current image without setting the Input Name, the PDF is corrupt. The code sets up all the pointers to the PDF image resource, both in the content stream and the page resources but skips storing the encoded image itself which causes the indirect reference from the image to point to the next object in the file (in my case it was the PDF info dictionary). The problem can be prevented by setting the InputName to the name of the image being processed. I think this requirement can be relaxed as follows.
The current code in pdfrenderer.cpp:
And here is a simple fix (Leptonica needs either the filename or the pix):
The text was updated successfully, but these errors were encountered: