Skip to content

Characters swapped in results from 3.05.01 #1253

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
QuLogic opened this issue Dec 31, 2017 · 11 comments
Closed

Characters swapped in results from 3.05.01 #1253

QuLogic opened this issue Dec 31, 2017 · 11 comments
Labels
Milestone

Comments

@QuLogic
Copy link

QuLogic commented Dec 31, 2017

With 3.05.01, the basic English test from pyocr started failing. It supports both OCR via the library or calling out to tesseract. The basic test uses this file, does some conversion, and passes it along to tesseract. I have attached the bitmap file that goes directly into tesseract, though it also works if you pass the png unchanged. The expected text is here. The effective command is: tesseract input.bmp output -l eng -psm 3.

Environment

  • Tesseract Version: 3.05.00 / 3.05.01
  • Commit Number: 8ff183b / acf318a
  • Platform: Fedora 27 x86_64

Current Behavior:

With 3.05.01, the text comes out as expected except the word "ocr" comes our as "cor".

Expected Behavior:

With 3.05.00, the text matches entirely and tesseract produces "ocr".

Suggested Fix:

I have bisected this change to be somewhere between 8ff183b and acf318a. I could not bisect any further because the commits did not compile (bisect is a great reason to ensure that commits compile!). As all the intervening commits appear to be code cleanup, I believe this change was an unintended regression.

@QuLogic
Copy link
Author

QuLogic commented Dec 31, 2017

If I go back to 8ff183b (the last working commit), then cherry-pick 90db5a1, it starts to fail again.

So, heading back to the earliest broken commit (or even the 3.05 branch), then reverting 90db5a1 fixes the problem. I don't really know the code so I can't say that's the exact fix, but it seems to work for this case.

@amitdo
Copy link
Collaborator

amitdo commented Dec 31, 2017

Thanks for the report and analysis.

90db5a1 was backported from 9a5ed19 (4.00). Ray Smith reverted it in da03e4e.

@stweil
Copy link
Member

stweil commented Dec 31, 2017

That sounds as if it has to be reverted for the 3.05 branch, too. @rfschtkt, what do you think?

@amitdo
Copy link
Collaborator

amitdo commented Dec 31, 2017

Let's talk next year :-)

@amitdo
Copy link
Collaborator

amitdo commented Jan 1, 2018

#529 (comment)

From a quick look, Dict::Load() doesn't add load_bigram_dawg to dawgs_, which either is a bug or should be documented in a comment

Ray also added this comment to Dict::Load()

  if (load_bigram_dawg) {
    bigram_dawg_ = dawg_cache_->GetSquishedDawg(lang, TESSDATA_BIGRAM_DAWG,
                                                dawg_debug_level, data_file);
    // The bigram_dawg_ is NOT used like the other dawgs! DO NOT add to the
    // dawgs_!!
}

@Shreeshrii
Copy link
Collaborator

@zdenop Please label
accuracy.

90db5a1 was backported from 9a5ed19 (4.00). Ray Smith reverted it in da03e4e.

That sounds as if it has to be reverted for the 3.05 branch, too.

@QuLogic
Copy link
Author

QuLogic commented Aug 4, 2018

Soo, is there any chance the fix will be backported?

@zdenop zdenop added this to the 3.05 milestone Sep 29, 2018
@zdenop
Copy link
Contributor

zdenop commented Sep 29, 2018

is it needed? There is no plan to release new 3.05. We are focused on 4.0 release.

@amitdo
Copy link
Collaborator

amitdo commented Jan 9, 2019

This issue should be fixed or closed.

zdenop added a commit that referenced this issue Jan 17, 2019
@zdenop
Copy link
Contributor

zdenop commented Jan 17, 2019

Can you please check 3.05 branch?

@QuLogic
Copy link
Author

QuLogic commented Jan 19, 2019

Yep, looks like it's working there now.

@QuLogic QuLogic closed this as completed Jan 19, 2019
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants