Skip to content

Simplify, fix and improve similar images algorithm #983

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jun 9, 2023
Merged

Conversation

qarmin
Copy link
Owner

@qarmin qarmin commented Jun 1, 2023

  • New algorithm of finding similar images - should be faster, give a little better results and fix some problems, it is also easier to understand
  • Fixes problem with crashing when using reference folders, introduced in recent gui cleanups
  • Added tests
  • App should find more broken pdf files

Times in mm:ss format

78000 tested image files, hash size 16 - Nearest - 40 - double gradient:

Old algorithm - 8:13  
Current implementation - 4:42
First implementation of current algorithm when computations worked mostly in one thread  - 16:51

78000 tested image files, hash size 16 - Nearest - 10 - double gradient:

Old algorithm - 0:52
Current implementation - 0:35

Initial implementation:

  • Calculate hashes
  • Find similar hashes for each hash(one threaded)
  • If hash was not used before, use it now

Previous implementation:

  • Calculate hashes
  • Split hashes to check into n + 1 parts(where n is number of threads)
  • Check this chunks in parallel
  • Maintain for each thread list of hashes are used as "originals" and info about similarity between files
  • Compare one hash with all other (multithreaded)
  • Fill and compare results from previous comparisons (multithreaded)
  • Connect all results from initial chunks and compare results to each other (one threaded)

Current implementation:

  • Calculate hashes
  • Split hashes into groups of 1000 items
  • Maintain only one list of hashes used as "originals" and
  • Compare each hash from this group to every other (multithreaded)
  • Throw out results that in 100% will not be used(e.g. hashes with too small similarity) (multithreaded)
  • Connect results from 1000 items with previous results (one threaded)

@qarmin qarmin changed the title Update dependencies Simplify, fix and improve similar images algorithm Jun 8, 2023
@qarmin qarmin merged commit 55b2744 into master Jun 9, 2023
@qarmin qarmin deleted the random_changes branch June 9, 2023 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant