diff --git a/README.md b/README.md index b6adb63a..e66975cc 100644 --- a/README.md +++ b/README.md @@ -20,7 +20,7 @@ pip install difPy ``` -> ✨🚀 **Join the [difPy for Desktop beta tester](https://difpy.app/) program and be among to first to test the new difPy desktop app!** +> ✨🚀 **Join the [difPy for Desktop beta tester](https://difpy.short.gy/desktop-beta-ghb) program and be among to first to test the new difPy desktop app!** > :open_hands: Our motto? We :heart: Open Source! **Contributions and new ideas for difPy are always welcome** - check our [Contributor Guidelines](https://difpy.readthedocs.io/en/latest/contributing.html) for more information. @@ -204,7 +204,7 @@ difPy_xxx_stats.json The new difPy desktop app brings difPy directly to your desktop. We are now accepting beta tester sign ups and will soon be starting our first tester access wave. -✨🚀 **Join the [difPy for Desktop beta tester](https://difpy.app/) program now and be among to first to test the new difPy desktop app!** +✨🚀 **Join the [difPy for Desktop beta tester](https://difpy.short.gy/desktop-beta-ghb) program now and be among to first to test the new difPy desktop app!** ------- diff --git a/docs/conf.py b/docs/conf.py index cbc91b47..e47b1d63 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -18,8 +18,11 @@ 'sphinx.ext.autosummary', 'sphinx.ext.intersphinx', 'sphinx_rtd_theme', + #-- 'sphinxcontrib.googleanalytics' ] +# -- googleanalytics_id = 'G-X002SSZTWC' + intersphinx_mapping = { 'python': ('https://docs.python.org/3/', None), 'sphinx': ('https://www.sphinx-doc.org/en/master/', None), @@ -32,6 +35,8 @@ html_theme = 'sphinx_rtd_theme' +html_static_path = ['_static'] + # -- Options for EPUB output epub_show_urls = 'footnote' diff --git a/docs/contributing.rst b/docs/contributing/contributing.rst similarity index 92% rename from docs/contributing.rst rename to docs/contributing/contributing.rst index 43ae181e..073832b5 100644 --- a/docs/contributing.rst +++ b/docs/contributing/contributing.rst @@ -3,8 +3,6 @@ Contributing to difPy .. _Contributing: -.. include:: /misc/support_difpy.rst - difPy is constantly updated with code improvements, new features and requests from the community. Contributions are a good way to give feedback and to improve the functionalities and quality of the package. **Do you feel like difPy is missing a certain feature? Or do you have an idea of how to improve difPy?** @@ -35,8 +33,4 @@ Your pull request and implementation will be reviewed and approved if it passes 👉 comment your code |br| 👉 follow the code style of the project, including indentation |br| -👉 update the `README.md `_ instructions - ------------- - -.. include:: /misc/support_difpy.rst \ No newline at end of file +👉 update the `README.md `_ instructions \ No newline at end of file diff --git a/docs/contributing/support.rst b/docs/contributing/support.rst new file mode 100644 index 00000000..6404b95d --- /dev/null +++ b/docs/contributing/support.rst @@ -0,0 +1,6 @@ +Support difPy +===== + +.. _Support: + +.. include:: /misc/support_difpy.rst \ No newline at end of file diff --git a/docs/using/basic_usage.rst b/docs/getting_started/basic_usage.rst similarity index 100% rename from docs/using/basic_usage.rst rename to docs/getting_started/basic_usage.rst diff --git a/docs/usage.rst b/docs/getting_started/cli_usage.rst similarity index 81% rename from docs/usage.rst rename to docs/getting_started/cli_usage.rst index 731bed35..3c4f7b24 100644 --- a/docs/usage.rst +++ b/docs/getting_started/cli_usage.rst @@ -1,19 +1,3 @@ -Using difPy -===== - -.. _using difPy: - -**difPy** is a Python package that automates the search for duplicate/similar images. - -.. include:: /using/installation.rst - -.. include:: /using/basic_usage.rst - -.. raw:: html - -
- - .. _cli usage: CLI Usage @@ -66,27 +50,4 @@ The output of difPy is written to files and **saved in the working directory** b difPy_xxx_results.json difPy_xxx_lower_quality.txt - difPy_xxx_stats.json - - -.. raw:: html - -
- - -.. include:: /parameters/main.rst - - -.. raw:: html - -
- -.. include:: /output/main.rst - -.. include:: /output/result.rst - -.. include:: /output/result_infolder.rst - -.. include:: /output/lower_quality.rst - -.. include:: /output/stats.rst + difPy_xxx_stats.json \ No newline at end of file diff --git a/docs/using/installation.rst b/docs/getting_started/installation.rst similarity index 100% rename from docs/using/installation.rst rename to docs/getting_started/installation.rst diff --git a/docs/getting_started/output.rst b/docs/getting_started/output.rst new file mode 100644 index 00000000..ee892975 --- /dev/null +++ b/docs/getting_started/output.rst @@ -0,0 +1,104 @@ +.. _output: + +Output +---------------- + +difPy returns various types of output: + +.. _search.result: + +Search Result +^^^^^^^^^^ + +A **dictionary** of duplicates/similar images (i. e. **match groups**) that were found. Each match group has a primary image (the key of the dictionary) which holds the list of its duplicates including their filename and MSE (Mean Squared Error). The lower the MSE, the more similar the primary image and the matched images are. Therefore, an MSE of 0 indicates that two images are exact duplicates. + +.. code-block:: python + + search.result + + > Output: + {'C:/Path/image1.jpg' : [['C:/Path/duplicate_image1a.jpg', 0.0], + ['C:/Path/duplicate_image1b.jpg', 0.0]], + 'C:/Path/image2.jpg' : [['C:/Path/duplicate_image2a.jpg', 0.0]], + ... + } + +When :ref:`in_folder` is set to ``True``, the result output is slightly modified and matches are grouped in their separate folders, with the key of the dictionary being the folder path. + +.. code-block:: python + + search.result + + > Output: + {'C:/Path1/' : {'C:/Path1/image1.jpg' : [['C:/Path1/duplicate_image1a.jpg', 0.0], + ['C:/Path1/duplicate_image1b.jpg', 0.0]], + 'C:/Path1/image2.jpg' : [['C:/Path1/duplicate_image2a.jpg', 0.0]], + 'C:/Path2/' : {'C:/Path2/image1.jpg' : [['C:/Path2/duplicate_image1a.jpg', 0.0]], + ... + } + +.. _search.lower_quality: + +Lower Quality Files +^^^^^^^^^^ + +A **list** of duplicates/similar images that have the **lowest resolution** among match groups: + +.. code-block:: python + + search.lower_quality + + > Output: + ['C:/Path/duplicate_image1.jpg', + 'C:/Path/duplicate_image2.jpg', ...] + +To find the lower quality images, difPy compares the **image resolutions** (pixel width x pixel height) within a match group and selects all images that have lowest image file resolutions among the group. + +Lower quality images then can be **moved** to a different location (see :ref:`search.move_to`): + +.. code-block:: python + + search.move_to(destination_path='C:/Path/to/Destination/') + +Or **deleted** (see :ref:`search.delete`): + +.. code-block:: python + + search.delete(silent_del=False) + +.. _search.stats: + +Search Statistics +^^^^^^^^^^ + +A **JSON formatted collection** with statistics on the completed difPy process: + +.. code-block:: python + + search.stats + + > Output: + {'directory': ['C:/Path1/', 'C:/Path2/', ... ], + 'process': {'build': {'duration': {'start': '2024-02-18T19:52:39.479548', + 'end': '2024-02-18T19:52:41.630027', + 'seconds_elapsed': 2.1505}, + 'parameters': {'recursive': True, + 'in_folder': False, + 'limit_extensions': True, + 'px_size': 50, + 'processes': 5}}, + 'search': {'duration': {'start': '2024-02-18T19:52:41.630027', + 'end': '2024-02-18T19:52:46.770077', + 'seconds_elapsed': 5.14}, + 'parameters': {'similarity_mse': 0, + 'rotate': True, + 'lazy': True, + 'processes': 5, + 'chunksize': None}, + 'files_searched': 3228, + 'matches_found': {'duplicates': 3030, + 'similar': 0}}}, + 'total_files': 3232, + 'invalid_files': {'count': 4, + 'logs': {'C:/Path/invalid_File.pdf': 'Unsupported file type', + ... }}}} diff --git a/docs/index.rst b/docs/index.rst index 471931c6..ae531266 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,3 +1,41 @@ +.. toctree:: + :maxdepth: 2 + :hidden: + :caption: Getting started + + /getting_started/installation + /getting_started/basic_usage + /getting_started/cli_usage + /getting_started/output + +.. toctree:: + :maxdepth: 2 + :hidden: + :caption: Methods and parameters + + /methods/build + /methods/search + /methods/search_moveto + /methods/search_delete + +.. toctree:: + :maxdepth: 2 + :hidden: + :caption: Contributing + + /contributing/contributing + /contributing/support + +.. toctree:: + :maxdepth: 2 + :hidden: + :caption: Further Resources + + /resources/desktop + /resources/large_datasets + /resources/supported_filetypes + /resources/report_bug + difPy Guide =================================== @@ -8,27 +46,17 @@ difPy Guide **difPy** is a Python package that automates the search for duplicate/similar images. -.. note:: - - ✨ Update to `difPy v4 `_ for up to **10x performance increases** to previous versions! :ref:`What's new in v4?` - -difPy searches for images in **one or more different directories**, compares the images it found and checks whether these are duplicates. It then outputs the **image files classified as duplicates** as well as the **images having the lowest resolutions**, so you know which of the duplicate images are safe to be deleted. You can then either delete them manually, or let difPy delete them for you. +difPy searches for images in **one or more directories**, compares the images it found and checks whether these are duplicates. It then outputs the **image files classified as duplicates**, as well as the **images having the lowest resolutions**, so that you know which of the duplicate images are safe to be moved/deleted. You can then either move/delete them manually, or let difPy do this for you. -difPy does not compare images based on their hashes. It compares them based on their tensors i. e. the image content. This allows difPy to **not only search for duplicate images, but also for similar images**. +difPy does not compare images based on their hashes. It compares them based on their tensors i. e. the image content. This allows you to let difPy **not only search for duplicate images, but also for similar images**. difPy leverages Python's multiprocessing capabilities and is therefore able to perform at high performance even on large datasets. -View difPy on `GitHub `_ and `PyPi `_. +.. note:: + ✨ difPy will soon be available as an app for your desktop! `Learn more `_. -Guide Content --------- -.. toctree:: - :maxdepth: 3 - - usage - contributing - faq +View difPy on `GitHub `_ and `PyPi `_. ------------ diff --git a/docs/methods/build.rst b/docs/methods/build.rst new file mode 100644 index 00000000..ffc3a52f --- /dev/null +++ b/docs/methods/build.rst @@ -0,0 +1,146 @@ +.. _difPy.build: + +difpy.build +^^^^^^^^^^ + +Before difPy can perform any search, it needs to build its image repository and transform the images in the provided directory into tensors. This is what is done when ``difPy.build()`` is invoked. + +Upon completion, ``difPy.build()`` returns a ``dif`` object that can be used in :ref:`difPy.search` to start the search process. + +``difPy.build`` supports the following parameters: + +.. code-block:: python + + difPy.build(*directory, recursive=True, in_folder=False, limit_extensions=True, px_size=50, show_progress=True, processes=None) + +.. csv-table:: + :header: Parameter,Input Type,Default Value,Other Values + :widths: 10, 10, 10, 20 + :class: tight-table + + :ref:`directory`,"``str``, ``list``",, + :ref:`recursive`,``bool``,``True``,``False`` + :ref:`in_folder`,"``bool``, ``False``",``True`` + :ref:`limit_extensions`,``bool``,``True``,``False`` + :ref:`px_size`,"``int``, ``float``",50, ``int`` + :ref:`show_progress`,``bool``,``True``,``False`` + :ref:`processes`,``int``,``None`` (``os.cpu_count()``), ``int`` + +.. note:: + + If you want to reuse the image tensors generated by difPy in your own application, you can access the generated repository by calling ``difPy.build._tensor_dictionary``. To reverse the image IDs to the original filenames, use ``difPy.build._filename_dictionary``. + +.. _directory: + +directory (str, list) +++++++++++++ + +difPy supports single and multi-folder search. + +**Single Folder Search**: + +.. code-block:: python + + import difPy + dif = difPy.build("C:/Path/to/Folder/") + search = difPy.search(dif) + +**Multi Folder Search**: + +.. code-block:: python + + import difPy + dif = difPy.build(["C:/Path/to/Folder_A/", "C:/Path/to/Folder_B/", "C:/Path/to/Folder_C/", ... ]) + search = difPy.search(dif) + +Folder paths can be specified as standalone Python strings, or within a list. + +.. _recursive: + +recursive (bool) +++++++++++++ + +By default, difPy will search for matching images recursively within the subdirectories of the :ref:`directory` parameter. If set to ``False``, subdirectories will not be scanned. + +``True`` = (default) searches recursively through all subdirectories in the directory paths + +``False`` = disables recursive search through subdirectories in the directory paths + +.. _in_folder: + +in_folder (bool) +++++++++++++ + +By default, difPy will search for matches in the union of all directories specified in the :ref:`directory` parameter. To have difPy only search for matches within each folder separately, set ``in_folder`` to ``True``. The structure of the ``search.result`` output will be slightly different if ``in_folder`` is set to ``True`` (see :ref:`output`). + +``True`` = searches for matches only among each individual directory, including subdirectories + +``False`` = (default) searches for matches in the union of all directories + +.. _limit_extensions: + +limit_extensions (bool) +++++++++++++ + +.. warning:: + Recommended not to change default value. Only adjust this value if you know what you are doing. difPy result accuracy can not be guaranteed for file formats not covered by "limit_extensions". + +By default, difPy only searches for images with a predefined file type. This speeds up the process, since difPy does not have to attempt to decode files it might not support. Nonetheless, you can let difPy try to decode other file types by setting ``limit_extensions`` to ``False``. + +.. note:: + + Predefined image types includes: ``apng``, ``bw``, ``cdf``, ``cur``, ``dcx``, ``dds``, ``dib``, ``emf``, ``eps``, ``fli``, ``flc``, ``fpx``, ``ftex``, ``fits``, ``gd``, ``gd2``, ``gif``, ``gbr``, ``icb``, ``icns``, ``iim``, ``ico``, ``im``, ``imt``, ``j2k``, ``jfif``, ``jfi``, ``jif``, ``jp2``, ``jpe``, ``jpeg``, ``jpg``, ``jpm``, ``jpf``, ``jpx``, ``jpeg``, ``mic``, ``mpo``, ``msp``, ``nc``, ``pbm``, ``pcd``, ``pcx``, ``pgm``, ``png``, ``ppm``, ``psd``, ``pixar``, ``ras``, ``rgb``, ``rgba``, ``sgi``, ``spi``, ``spider``, ``sun``, ``tga``, ``tif``, ``tiff``, ``vda``, ``vst``, ``wal``, ``webp``, ``xbm``, ``xpm``. + +``True`` = (default) difPy's search is limited to a set of predefined image types + +``False`` = difPy searches through all the input files + +difPy supports most popular image formats. Nevertheless, since it relies on the Pillow library for image decoding, the supported formats are restricted to the ones listed in the `Pillow Documentation`_. Unsupported file types will by marked as invalid and included in the process statistics output under ``invalid_files`` (see :ref:`Process Statistics`). + +.. _Pillow Documentation: https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html + +.. _px_size: + +px_size (int) +++++++++++++ + +.. note:: + + Recommended not to change default value. + +Absolute size in pixels (width x height) of the images before being compared. The higher the ``px_size``, the more precise the comparison, but in turn more computational resources are required for difPy to compare the images. The lower the ``px_size``, the faster, but the more imprecise the comparison process gets. + +By default, ``px_size`` is set to ``50``. + +**Manual setting**: ``px_size`` can be manually adjusted by setting it to any ``int``. + +.. _show_progress: + +show_progress (bool) +++++++++++++ + +By default, difPy will show a progress bar of the running process. + +``True`` = (default) displays the progress bar + +``False`` = disables the progress bar + +.. _processes: + +processes (int) +++++++++++++ + +.. warning:: + Recommended not to change default value. Only adjust this value if you know what you are doing. + +difPy leverages `Multiprocessing`_ to speed up the image comparison process, meaning multiple comparison tasks will be performed in parallel. The ``processes`` parameter defines the maximum number of worker processes (i. e. parallel tasks) to perform when multiprocessing. The higher the parameter, the more performance can be achieved, but in turn, the more computing resources will be required. To learn more, please refer to the `Python Multiprocessing documentation`_. + +.. _Multiprocessing: https://docs.python.org/3/library/multiprocessing.html + +.. _Python Multiprocessing documentation: https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool + +By default, ``processes`` is set to `os.cpu_count()`_. This means that difPy will spawn as many processes as number of CPUs in your machine, which can lead to increased performance, but can also cause a **big computational overhead** depending on the size of your dataset. To reduce the required computing power, it is recommended to reduce this value. + +.. _os.cpu_count(): https://docs.python.org/3/library/os.html#os.cpu_count + +**Manual setting**: ``processes`` can be manually adjusted by setting it to any ``int``. It is dependant on values supported by the ``process`` parameter in the Python Multiprocessing package. To learn more about this parameter, please refer to the `Python Multiprocessing documentation`_. \ No newline at end of file diff --git a/docs/methods/search.rst b/docs/methods/search.rst new file mode 100644 index 00000000..f1ae82c7 --- /dev/null +++ b/docs/methods/search.rst @@ -0,0 +1,115 @@ +.. _difPy.search: + +difPy.search +^^^^^^^^^^ + +After the ``dif`` object has been built using :ref:`difPy.build`, the search can be initiated with ``difPy.search``. + +When invoking ``difPy.search()``, difPy starts comparing the images to find duplicates or similarities, based on the MSE (Mean Squared Error) between both image tensors. The target similarity rate i. e. MSE value is set with the :ref:`similarity` parameter. + +After the search is completed, further actions can be performed using :ref:`search.move_to` and :ref:`search.delete`. + +.. code-block:: python + + difPy.search(difPy_obj, similarity='duplicates', lazy=True, rotate=True, processes=None, chunksize=None, show_progress=False, logs=True) + +``difPy.search`` supports the following parameters: + +.. csv-table:: + :header: Parameter,Input Type,Default Value,Other Values + :widths: 10, 10, 10, 20 + :class: tight-table + + :ref:`difPy_obj`,"``difPy_obj``",, + :ref:`similarity`,"``str``, ``int``",``'duplicates'``, "``'similar'``, any ``int`` or ``float``" + :ref:`lazy`,``bool``,``True``,``False`` + :ref:`rotate`,``bool``,``True``,``False`` + :ref:`show_progress2`,``bool``,``True``,``False`` + :ref:`processes`,``int``,``None`` (``os.cpu_count()``), any ``int`` + :ref:`chunksize`,``int``,``None``, any ``int`` + +.. _difPy_obj: + +difPy_obj +++++++++++++ + +The required ``difPy_obj`` parameter should be pointing to the ``dif`` object that was built during the invocation of :ref:`difPy.build`. + +.. _similarity: + +similarity (str, int) +++++++++++++ + +difPy compares the images to find duplicates or similarities, based on the MSE (Mean Squared Error) between both image tensors. The target similarity rate i. e. MSE value is set with the ``similarity`` parameter. + +``"duplicates"`` = (default) searches for duplicates. MSE threshold is set to ``0``. + +``"similar"`` = searches for similar images. MSE threshold is set to ``5``. + +The search for similar images can be useful when searching for duplicate files that might have different file **types** (i. e. imageA.png has a duplicate imageA.jpg) and/or different file **sizes** (f. e. imageA.png (100MB) has a duplicate imageA.png (50MB)). In these cases, the MSE between the two image tensors might not be exactly == 0, hence they would not be classified as being duplicates even though in reality they are. Setting ``similarity`` to ``"similar"`` searches for duplicates with a certain tolerance, increasing the likelihood of finding duplicate images of different file types and sizes. Depending on which ``similarity`` level is chosen, the ``lazy`` parameter should be adjusted accordingly (see :ref:`lazy`). + +.. figure:: static/assets/choosing_similarity.png + :width: 540 + :height: 390 + :alt: Setting the "similarity" & "lazy" Parameter + :align: center + + Setting the "similarity" and "lazy" parameter + +**Manual setting**: the match MSE threshold can be adjusted manually by setting the ``similarity`` parameter to any ``int`` or ``float``. difPy will then search for images that match an MSE threshold **equal to or lower than** the one specified. + +.. _lazy: + +lazy (bool) +++++++++++++ + +By default, difPy searches using a Lazy algorithm. This algorithm assumes that the image matches we are looking for have **the same dimensions**, i. e.duplicate images have the same width and height. If two images do not have the same dimensions, they are automatically assumed to not be duplicates. Therefore, because these images are skipped, this algorithm can provide a significant **improvement in performance**. + +``True`` = (default) applies the Lazy algorithm + +``False`` = regular algorithm is used + +**When should the Lazy algorithm not be used?** +The Lazy algorithm can speed up the comparison process significantly. Nonetheless, the algorithm might not be suited for your use case and might result in missing some matches. Depending on which ``similarity`` level is chosen, the ``lazy`` parameter should be adjusted accordingly (see :ref:`similarity`). Set ``lazy = False`` if you are searching for duplicate images with: + +* different **file types** (i. e. imageA.png is a duplicate of imageA.jpg) +* and/or different **file sizes** (i. e. imageA.png (100MB) is a duplicate of imageA_compressed.png (50MB)) + +.. _rotate: + +rotate (bool) +++++++++++++ + +By default, difPy will rotate the images on comparison. In total, 3 rotations are performed: 90°, 180° and 270° degree rotations. + +``True`` = (default) rotates images on comparison + +``False`` = images are not rotated before comparison + +show_progress (bool) +++++++++++++ + +See :ref:`show_progress`. + +processes (int) +++++++++++++ + +See :ref:`processes`. + +.. _chunksize: + +chunksize (int) +++++++++++++ + +.. warning:: + Recommended not to change default value. Only adjust this value if you know what you are doing. + +``chunksize`` is only used when dealing with image datasets of **more than 5k images**. See the ":ref:`Using difPy with Large Datasets`" section for further details. + +difPy leverages a different comparison algorithm depending on the size of the input dataset. If the dataset contains more than 5k images, then the Chunking algorithm is used, which leverages generators and vectorization for more efficient computation with large datasets. The ``chunksize`` parameter defines how many chunks of image sets should be compared at once. Therefore, the higher the ``chunksize`` value, the faster the computation but the higher the memory consumption. + +The ``chunksize`` parameter is already **automatically set to an optimal value** relative to the size of the dataset. Nonetheless, it can also be adjusted manually, in order to provide more control over Multiprocessing strategies and memory consumption. + +By default, ``chunksize`` is set to ``None`` which implies: ``1'000'000 / number of images in dataset``. Parameter can only be >= 1. + +**Manual setting**: ``chunksize`` can be manually adjusted by setting it to any ``int`` >= 1. \ No newline at end of file diff --git a/docs/methods/search_delete.rst b/docs/methods/search_delete.rst new file mode 100644 index 00000000..ffff4bbb --- /dev/null +++ b/docs/methods/search_delete.rst @@ -0,0 +1,37 @@ +.. _search.delete: + +search.delete +^^^^^^^^^^ + +difPy can automatically delete the lower quality duplicate/similar images it found. Images can be deleted by invoking ``delete`` on the difPy search: + +.. warning:: + + Please use with care, as this cannot be undone. + +.. code-block:: python + + import difPy + dif = difPy.build("C:/Path/to/Folder_A/") + search = difPy.search(dif) + search.delete(silent_del=False) + +.. code-block:: console + + > Output + Deleted 756 files(s) + +The images are deleted based on the ``lower_quality`` output as described under section :ref:`output`. After auto-deleting the images, every match group will be left with one single image: the image with the highest quality among its match group. + +``delete`` asks for user confirmation before deleting the images. The user confirmation can be skipped by setting :ref:`silent_del` to ``True``. + +.. _silent_del: + +silent_del (bool) +++++++++++++ + +.. note:: + + Please use with care, as this cannot be undone. + +When set to ``True``, the user confirmation for :ref:`search.delete` is skipped and the lower resolution matched images that were found by difPy are automatically deleted from their folder(s). \ No newline at end of file diff --git a/docs/methods/search_moveto.rst b/docs/methods/search_moveto.rst new file mode 100644 index 00000000..36ab8d15 --- /dev/null +++ b/docs/methods/search_moveto.rst @@ -0,0 +1,25 @@ +.. _search.move_to: + +search.move_to +^^^^^^^^^^ + +difPy can automatically move the lower quality duplicate/similar images it found to another directory. Images can be moved by invoking ``move_to`` on the difPy search: + +.. code-block:: python + + import difPy + dif = difPy.build("C:/Path/to/Folder_A/") + search = difPy.search(dif) + search.move_to(destination_path="C:/Path/to/Destination/") + +.. code-block:: console + + > Output + Moved 756 files(s) to "C:/Path/to/Destination" + +.. _destination_path: + +destination_path (str) +++++++++++++ + +Directory of where the lower quality files should me moved. Should be given as Python ``string``. \ No newline at end of file diff --git a/docs/output/lower_quality.rst b/docs/output/lower_quality.rst deleted file mode 100644 index 9a730add..00000000 --- a/docs/output/lower_quality.rst +++ /dev/null @@ -1,28 +0,0 @@ -.. _search.lower_quality: - -Lower Quality Files -^^^^^^^^^^ - -A **list** of duplicates/similar images that have the **lowest resolution** among match groups: - -.. code-block:: python - - search.lower_quality - - > Output: - ['C:/Path/duplicate_image1.jpg', - 'C:/Path/duplicate_image2.jpg', ...] - -To find the lower quality images, difPy compares the **image resolutions** (pixel width x pixel height) within a match group and selects all images that have lowest image file resolutions among the group. - -Lower quality images then can be **moved** to a different location (see :ref:`search.move_to`): - -.. code-block:: python - - search.move_to(destination_path='C:/Path/to/Destination/') - -Or **deleted** (see :ref:`search.delete`): - -.. code-block:: python - - search.delete(silent_del=False) \ No newline at end of file diff --git a/docs/output/main.rst b/docs/output/main.rst deleted file mode 100644 index ba586e93..00000000 --- a/docs/output/main.rst +++ /dev/null @@ -1,6 +0,0 @@ -.. _output: - -Output ----------------- - -difPy returns various types of output: \ No newline at end of file diff --git a/docs/output/result.rst b/docs/output/result.rst deleted file mode 100644 index 8c52d2db..00000000 --- a/docs/output/result.rst +++ /dev/null @@ -1,17 +0,0 @@ -.. _search.result: - -Search Result -^^^^^^^^^^ - -A **dictionary** of duplicates/similar images (i. e. **match groups**) that were found. Each match group has a primary image (the key of the dictionary) which holds the list of its duplicates including their filename and MSE (Mean Squared Error). The lower the MSE, the more similar the primary image and the matched images are. Therefore, an MSE of 0 indicates that two images are exact duplicates. - -.. code-block:: python - - search.result - - > Output: - {'C:/Path/image1.jpg' : [['C:/Path/duplicate_image1a.jpg', 0.0], - ['C:/Path/duplicate_image1b.jpg', 0.0]], - 'C:/Path/image2.jpg' : [['C:/Path/duplicate_image2a.jpg', 0.0]], - ... - } \ No newline at end of file diff --git a/docs/output/result_infolder.rst b/docs/output/result_infolder.rst deleted file mode 100644 index 75c36558..00000000 --- a/docs/output/result_infolder.rst +++ /dev/null @@ -1,13 +0,0 @@ -When :ref:`in_folder` is set to ``True``, the result output is slightly modified and matches are grouped in their separate folders, with the key of the dictionary being the folder path. - -.. code-block:: python - - search.result - - > Output: - {'C:/Path1/' : {'C:/Path1/image1.jpg' : [['C:/Path1/duplicate_image1a.jpg', 0.0], - ['C:/Path1/duplicate_image1b.jpg', 0.0]], - 'C:/Path1/image2.jpg' : [['C:/Path1/duplicate_image2a.jpg', 0.0]], - 'C:/Path2/' : {'C:/Path2/image1.jpg' : [['C:/Path2/duplicate_image1a.jpg', 0.0]], - ... - } \ No newline at end of file diff --git a/docs/output/stats.rst b/docs/output/stats.rst deleted file mode 100644 index c22162ee..00000000 --- a/docs/output/stats.rst +++ /dev/null @@ -1,36 +0,0 @@ -.. _search.stats: - -Search Statistics -^^^^^^^^^^ - -A **JSON formatted collection** with statistics on the completed difPy process: - -.. code-block:: python - - search.stats - - > Output: - {'directory': ['C:/Path1/', 'C:/Path2/', ... ], - 'process': {'build': {'duration': {'start': '2024-02-18T19:52:39.479548', - 'end': '2024-02-18T19:52:41.630027', - 'seconds_elapsed': 2.1505}, - 'parameters': {'recursive': True, - 'in_folder': False, - 'limit_extensions': True, - 'px_size': 50, - 'processes': 5}}, - 'search': {'duration': {'start': '2024-02-18T19:52:41.630027', - 'end': '2024-02-18T19:52:46.770077', - 'seconds_elapsed': 5.14}, - 'parameters': {'similarity_mse': 0, - 'rotate': True, - 'lazy': True, - 'processes': 5, - 'chunksize': None}, - 'files_searched': 3228, - 'matches_found': {'duplicates': 3030, - 'similar': 0}}}, - 'total_files': 3232, - 'invalid_files': {'count': 4, - 'logs': {'C:/Path/invalid_File.pdf': 'Unsupported file type', - ... }}}} \ No newline at end of file diff --git a/docs/parameters/chunksize.rst b/docs/parameters/chunksize.rst deleted file mode 100644 index 4c81724e..00000000 --- a/docs/parameters/chunksize.rst +++ /dev/null @@ -1,15 +0,0 @@ -chunksize (int) -++++++++++++ - -.. warning:: - Recommended not to change default value. Only adjust this value if you know what you are doing. - -``chunksize`` is only used when dealing with image datasets of **more than 5k images**. See the ":ref:`Using difPy with Large Datasets`" section for further details. - -difPy leverages a different comparison algorithm depending on the size of the input dataset. If the dataset contains more than 5k images, then the Chunking algorithm is used, which leverages generators and vectorization for more efficient computation with large datasets. The ``chunksize`` parameter defines how many chunks of image sets should be compared at once. Therefore, the higher the ``chunksize`` value, the faster the computation but the higher the memory consumption. - -The ``chunksize`` parameter is already **automatically set to an optimal value** relative to the size of the dataset. Nonetheless, it can also be adjusted manually, in order to provide more control over Multiprocessing strategies and memory consumption. - -By default, ``chunksize`` is set to ``None`` which implies: ``1'000'000 / number of images in dataset``. Parameter can only be >= 1. - -**Manual setting**: ``chunksize`` can be manually adjusted by setting it to any ``int`` >= 1. \ No newline at end of file diff --git a/docs/parameters/deprecated/logs.rst b/docs/parameters/deprecated/logs.rst deleted file mode 100644 index 6d58b249..00000000 --- a/docs/parameters/deprecated/logs.rst +++ /dev/null @@ -1,6 +0,0 @@ -logs (bool) -++++++++++++ - -``logs`` was deprecated as of v4.1. See the `release notes`_. - -.. _release notes: https://github.com/elisemercury/Duplicate-Image-Finder/releases \ No newline at end of file diff --git a/docs/parameters/destination_path.rst b/docs/parameters/destination_path.rst deleted file mode 100644 index f8afc9f0..00000000 --- a/docs/parameters/destination_path.rst +++ /dev/null @@ -1,4 +0,0 @@ -destination_path (str) -++++++++++++ - -Directory of where the lower quality files should me moved. Should be given as Python ``string``. \ No newline at end of file diff --git a/docs/parameters/directory.rst b/docs/parameters/directory.rst deleted file mode 100644 index 765228f4..00000000 --- a/docs/parameters/directory.rst +++ /dev/null @@ -1,22 +0,0 @@ -directory (str, list) -++++++++++++ - -difPy supports single and multi-folder search. - -**Single Folder Search**: - -.. code-block:: python - - import difPy - dif = difPy.build("C:/Path/to/Folder/") - search = difPy.search(dif) - -**Multi Folder Search**: - -.. code-block:: python - - import difPy - dif = difPy.build(["C:/Path/to/Folder_A/", "C:/Path/to/Folder_B/", "C:/Path/to/Folder_C/", ... ]) - search = difPy.search(dif) - -Folder paths can be specified as standalone Python strings, or within a list. \ No newline at end of file diff --git a/docs/parameters/in_folder.rst b/docs/parameters/in_folder.rst deleted file mode 100644 index 8537a07e..00000000 --- a/docs/parameters/in_folder.rst +++ /dev/null @@ -1,8 +0,0 @@ -in_folder (bool) -++++++++++++ - -By default, difPy will search for matches in the union of all directories specified in the :ref:`directory` parameter. To have difPy only search for matches within each folder separately, set ``in_folder`` to ``True``. The structure of the ``search.result`` output will be slightly different if ``in_folder`` is set to ``True`` (see :ref:`output`). - -``True`` = searches for matches only among each individual directory, including subdirectories - -``False`` = (default) searches for matches in the union of all directories \ No newline at end of file diff --git a/docs/parameters/lazy.rst b/docs/parameters/lazy.rst deleted file mode 100644 index 351bc5fc..00000000 --- a/docs/parameters/lazy.rst +++ /dev/null @@ -1,14 +0,0 @@ -lazy (bool) -++++++++++++ - -By default, difPy searches using a Lazy algorithm. This algorithm assumes that the image matches we are looking for have **the same dimensions**, i. e.duplicate images have the same width and height. If two images do not have the same dimensions, they are automatically assumed to not be duplicates. Therefore, because these images are skipped, this algorithm can provide a significant **improvement in performance**. - -``True`` = (default) applies the Lazy algorithm - -``False`` = regular algorithm is used - -**When should the Lazy algorithm not be used?** -The Lazy algorithm can speed up the comparison process significantly. Nonetheless, the algorithm might not be suited for your use case and might result in missing some matches. Depending on which ``similarity`` level is chosen, the ``lazy`` parameter should be adjusted accordingly (see :ref:`similarity`). Set ``lazy = False`` if you are searching for duplicate images with: - -* different **file types** (i. e. imageA.png is a duplicate of imageA.jpg) -* and/or different **file sizes** (i. e. imageA.png (100MB) is a duplicate of imageA_compressed.png (50MB)) \ No newline at end of file diff --git a/docs/parameters/limit_extensions.rst b/docs/parameters/limit_extensions.rst deleted file mode 100644 index cd0e796c..00000000 --- a/docs/parameters/limit_extensions.rst +++ /dev/null @@ -1,19 +0,0 @@ -limit_extensions (bool) -++++++++++++ - -.. warning:: - Recommended not to change default value. Only adjust this value if you know what you are doing. difPy result accuracy can not be guaranteed for file formats not covered by "limit_extensions". - -By default, difPy only searches for images with a predefined file type. This speeds up the process, since difPy does not have to attempt to decode files it might not support. Nonetheless, you can let difPy try to decode other file types by setting ``limit_extensions`` to ``False``. - -.. note:: - - Predefined image types includes: ``apng``, ``bw``, ``cdf``, ``cur``, ``dcx``, ``dds``, ``dib``, ``emf``, ``eps``, ``fli``, ``flc``, ``fpx``, ``ftex``, ``fits``, ``gd``, ``gd2``, ``gif``, ``gbr``, ``icb``, ``icns``, ``iim``, ``ico``, ``im``, ``imt``, ``j2k``, ``jfif``, ``jfi``, ``jif``, ``jp2``, ``jpe``, ``jpeg``, ``jpg``, ``jpm``, ``jpf``, ``jpx``, ``jpeg``, ``mic``, ``mpo``, ``msp``, ``nc``, ``pbm``, ``pcd``, ``pcx``, ``pgm``, ``png``, ``ppm``, ``psd``, ``pixar``, ``ras``, ``rgb``, ``rgba``, ``sgi``, ``spi``, ``spider``, ``sun``, ``tga``, ``tif``, ``tiff``, ``vda``, ``vst``, ``wal``, ``webp``, ``xbm``, ``xpm``. - -``True`` = (default) difPy's search is limited to a set of predefined image types - -``False`` = difPy searches through all the input files - -difPy supports most popular image formats. Nevertheless, since it relies on the Pillow library for image decoding, the supported formats are restricted to the ones listed in the `Pillow Documentation`_. Unsupported file types will by marked as invalid and included in the process statistics output under ``invalid_files`` (see :ref:`Process Statistics`). - -.. _Pillow Documentation: https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html \ No newline at end of file diff --git a/docs/parameters/main.rst b/docs/parameters/main.rst deleted file mode 100644 index 7a74e54c..00000000 --- a/docs/parameters/main.rst +++ /dev/null @@ -1,199 +0,0 @@ -.. _parameters: - -Parameters ----------------- - -.. _difPy.build: - -difPy.build -^^^^^^^^^^ - -Before difPy can perform any search, it needs to build its image repository and transform the images in the provided directory into tensors. This is what is done when ``difPy.build()`` is invoked. - -Upon completion, ``difPy.build()`` returns a ``dif`` object that can be used in :ref:`difPy.search` to start the search process. - -``difPy.build`` supports the following parameters: - -.. code-block:: python - - difPy.build(*directory, recursive=True, in_folder=False, limit_extensions=True, px_size=50, show_progress=True, processes=None) - -.. csv-table:: - :header: Parameter,Input Type,Default Value,Other Values - :widths: 10, 10, 10, 20 - :class: tight-table - - :ref:`directory`,"``str``, ``list``",, - :ref:`recursive`,``bool``,``True``,``False`` - :ref:`in_folder`,"``bool``, ``False``",``True`` - :ref:`limit_extensions`,``bool``,``True``,``False`` - :ref:`px_size`,"``int``, ``float``",50, ``int`` - :ref:`show_progress`,``bool``,``True``,``False`` - :ref:`processes`,``int``,``None`` (``os.cpu_count()``), ``int`` - -.. note:: - - If you want to reuse the image tensors generated by difPy in your own application, you can access the generated repository by calling ``difPy.build._tensor_dictionary``. To reverse the image IDs to the original filenames, use ``difPy.build._filename_dictionary``. - -.. _directory: - -.. include:: /parameters/directory.rst - -.. _recursive: - -.. include:: /parameters/recursive.rst - -.. _in_folder: - -.. include:: /parameters/in_folder.rst - -.. _limit_extensions: - -.. include:: /parameters/limit_extensions.rst - -.. _px_size: - -.. include:: /parameters/px_size.rst - -.. _show_progress: - -.. include:: /parameters/show_progress.rst - -.. _processes: - -.. include:: /parameters/processes.rst - -.. _logs: - -.. include:: /parameters/deprecated/logs.rst - -.. raw:: html - -
- -.. _difPy.search: - -difPy.search -^^^^^^^^^^ - -After the ``dif`` object has been built using :ref:`difPy.build`, the search can be initiated with ``difPy.search``. - -When invoking ``difPy.search()``, difPy starts comparing the images to find duplicates or similarities, based on the MSE (Mean Squared Error) between both image tensors. The target similarity rate i. e. MSE value is set with the :ref:`similarity` parameter. - -After the search is completed, further actions can be performed using :ref:`search.move_to` and :ref:`search.delete`. - -.. code-block:: python - - difPy.search(difPy_obj, similarity='duplicates', lazy=True, rotate=True, processes=None, chunksize=None, show_progress=False, logs=True) - -``difPy.search`` supports the following parameters: - -.. csv-table:: - :header: Parameter,Input Type,Default Value,Other Values - :widths: 10, 10, 10, 20 - :class: tight-table - - :ref:`difPy_obj`,"``difPy_obj``",, - :ref:`similarity`,"``str``, ``int``",``'duplicates'``, "``'similar'``, any ``int`` or ``float``" - :ref:`lazy`,``bool``,``True``,``False`` - :ref:`rotate`,``bool``,``True``,``False`` - :ref:`show_progress2`,``bool``,``True``,``False`` - :ref:`processes`,``int``,``None`` (``os.cpu_count()``), any ``int`` - :ref:`chunksize`,``int``,``None``, any ``int`` - -.. _difPy_obj: - -difPy_obj -++++++++++++ - -The required ``difPy_obj`` parameter should be pointing to the ``dif`` object that was built during the invocation of :ref:`difPy.build`. - -.. _similarity: - -.. include:: /parameters/similarity.rst - -.. _lazy: - -.. include:: /parameters/lazy.rst - -.. _rotate: - -.. include:: /parameters/rotate.rst - -.. _show_progress2: - -.. include:: /parameters/show_progress.rst - -.. _processes2: - -.. include:: /parameters/processes.rst - -.. _chunksize: - -.. include:: /parameters/chunksize.rst - -.. _logs2: - -.. include:: /parameters/deprecated/logs.rst - -.. raw:: html - -
- -.. _search.move_to: - -search.move_to -^^^^^^^^^^ - -difPy can automatically move the lower quality duplicate/similar images it found to another directory. Images can be moved by invoking ``move_to`` on the difPy search: - -.. code-block:: python - - import difPy - dif = difPy.build("C:/Path/to/Folder_A/") - search = difPy.search(dif) - search.move_to(destination_path="C:/Path/to/Destination/") - -.. code-block:: console - - > Output - Moved 756 files(s) to "C:/Path/to/Destination" - -.. _destination_path: - -.. include:: /parameters/destination_path.rst - -.. raw:: html - -
- -.. _search.delete: - -search.delete -^^^^^^^^^^ - -difPy can automatically delete the lower quality duplicate/similar images it found. Images can be deleted by invoking ``delete`` on the difPy search: - -.. warning:: - - Please use with care, as this cannot be undone. - -.. code-block:: python - - import difPy - dif = difPy.build("C:/Path/to/Folder_A/") - search = difPy.search(dif) - search.delete(silent_del=False) - -.. code-block:: console - - > Output - Deleted 756 files(s) - -The images are deleted based on the ``lower_quality`` output as described under section :ref:`output`. After auto-deleting the images, every match group will be left with one single image: the image with the highest quality among its match group. - -``delete`` asks for user confirmation before deleting the images. The user confirmation can be skipped by setting :ref:`silent_del` to ``True``. - -.. _silent_del: - -.. include:: /parameters/silent_del.rst \ No newline at end of file diff --git a/docs/parameters/processes.rst b/docs/parameters/processes.rst deleted file mode 100644 index c1571a55..00000000 --- a/docs/parameters/processes.rst +++ /dev/null @@ -1,17 +0,0 @@ -processes (int) -++++++++++++ - -.. warning:: - Recommended not to change default value. Only adjust this value if you know what you are doing. - -difPy leverages `Multiprocessing`_ to speed up the image comparison process, meaning multiple comparison tasks will be performed in parallel. The ``processes`` parameter defines the maximum number of worker processes (i. e. parallel tasks) to perform when multiprocessing. The higher the parameter, the more performance can be achieved, but in turn, the more computing resources will be required. To learn more, please refer to the `Python Multiprocessing documentation`_. - -.. _Multiprocessing: https://docs.python.org/3/library/multiprocessing.html - -.. _Python Multiprocessing documentation: https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool - -By default, ``processes`` is set to `os.cpu_count()`_. This means that difPy will spawn as many processes as number of CPUs in your machine, which can lead to increased performance, but can also cause a **big computational overhead** depending on the size of your dataset. To reduce the required computing power, it is recommended to reduce this value. - -.. _os.cpu_count(): https://docs.python.org/3/library/os.html#os.cpu_count - -**Manual setting**: ``processes`` can be manually adjusted by setting it to any ``int``. It is dependant on values supported by the ``process`` parameter in the Python Multiprocessing package. To learn more about this parameter, please refer to the `Python Multiprocessing documentation`_. \ No newline at end of file diff --git a/docs/parameters/px_size.rst b/docs/parameters/px_size.rst deleted file mode 100644 index 9d1b2cd5..00000000 --- a/docs/parameters/px_size.rst +++ /dev/null @@ -1,12 +0,0 @@ -px_size (int) -++++++++++++ - -.. note:: - - Recommended not to change default value. - -Absolute size in pixels (width x height) of the images before being compared. The higher the ``px_size``, the more precise the comparison, but in turn more computational resources are required for difPy to compare the images. The lower the ``px_size``, the faster, but the more imprecise the comparison process gets. - -By default, ``px_size`` is set to ``50``. - -**Manual setting**: ``px_size`` can be manually adjusted by setting it to any ``int``. \ No newline at end of file diff --git a/docs/parameters/recursive.rst b/docs/parameters/recursive.rst deleted file mode 100644 index b3d3571c..00000000 --- a/docs/parameters/recursive.rst +++ /dev/null @@ -1,8 +0,0 @@ -recursive (bool) -++++++++++++ - -By default, difPy will search for matching images recursively within the subdirectories of the :ref:`directory` parameter. If set to ``False``, subdirectories will not be scanned. - -``True`` = (default) searches recursively through all subdirectories in the directory paths - -``False`` = disables recursive search through subdirectories in the directory paths \ No newline at end of file diff --git a/docs/parameters/rotate.rst b/docs/parameters/rotate.rst deleted file mode 100644 index f6a34013..00000000 --- a/docs/parameters/rotate.rst +++ /dev/null @@ -1,8 +0,0 @@ -rotate (bool) -++++++++++++ - -By default, difPy will rotate the images on comparison. In total, 3 rotations are performed: 90°, 180° and 270° degree rotations. - -``True`` = (default) rotates images on comparison - -``False`` = images are not rotated before comparison \ No newline at end of file diff --git a/docs/parameters/show_progress.rst b/docs/parameters/show_progress.rst deleted file mode 100644 index 3e148c5c..00000000 --- a/docs/parameters/show_progress.rst +++ /dev/null @@ -1,8 +0,0 @@ -show_progress (bool) -++++++++++++ - -By default, difPy will show a progress bar of the running process. - -``True`` = (default) displays the progress bar - -``False`` = disables the progress bar \ No newline at end of file diff --git a/docs/parameters/silent_del.rst b/docs/parameters/silent_del.rst deleted file mode 100644 index a6261406..00000000 --- a/docs/parameters/silent_del.rst +++ /dev/null @@ -1,8 +0,0 @@ -silent_del (bool) -++++++++++++ - -.. note:: - - Please use with care, as this cannot be undone. - -When set to ``True``, the user confirmation for :ref:`search.delete` is skipped and the lower resolution matched images that were found by difPy are automatically deleted from their folder(s). \ No newline at end of file diff --git a/docs/parameters/similarity.rst b/docs/parameters/similarity.rst deleted file mode 100644 index 217851b6..00000000 --- a/docs/parameters/similarity.rst +++ /dev/null @@ -1,20 +0,0 @@ -similarity (str, int) -++++++++++++ - -difPy compares the images to find duplicates or similarities, based on the MSE (Mean Squared Error) between both image tensors. The target similarity rate i. e. MSE value is set with the ``similarity`` parameter. - -``"duplicates"`` = (default) searches for duplicates. MSE threshold is set to ``0``. - -``"similar"`` = searches for similar images. MSE threshold is set to ``5``. - -The search for similar images can be useful when searching for duplicate files that might have different file **types** (i. e. imageA.png has a duplicate imageA.jpg) and/or different file **sizes** (f. e. imageA.png (100MB) has a duplicate imageA.png (50MB)). In these cases, the MSE between the two image tensors might not be exactly == 0, hence they would not be classified as being duplicates even though in reality they are. Setting ``similarity`` to ``"similar"`` searches for duplicates with a certain tolerance, increasing the likelihood of finding duplicate images of different file types and sizes. Depending on which ``similarity`` level is chosen, the ``lazy`` parameter should be adjusted accordingly (see :ref:`lazy`). - -.. figure:: static/assets/choosing_similarity.png - :width: 540 - :height: 390 - :alt: Setting the "similarity" & "lazy" Parameter - :align: center - - Setting the "similarity" and "lazy" parameter - -**Manual setting**: the match MSE threshold can be adjusted manually by setting the ``similarity`` parameter to any ``int`` or ``float``. difPy will then search for images that match an MSE threshold **equal to or lower than** the one specified. \ No newline at end of file diff --git a/docs/resources/desktop.rst b/docs/resources/desktop.rst new file mode 100644 index 00000000..2dd5d602 --- /dev/null +++ b/docs/resources/desktop.rst @@ -0,0 +1,8 @@ +.. _desktop: + +difPy for Desktop +---------------- + +difPy for Desktop brings image deduplication as an easy to use app to your desktop. We are now accepting beta tester sign ups and will soon be starting our first tester access wave. + +✨🚀 `Join the difPy for Desktop beta tester program `_ now and be among to first to test the new difPy desktop app! \ No newline at end of file diff --git a/docs/faq.rst b/docs/resources/faq.rst similarity index 99% rename from docs/faq.rst rename to docs/resources/faq.rst index 6ad36713..66dc81a4 100644 --- a/docs/faq.rst +++ b/docs/resources/faq.rst @@ -14,7 +14,7 @@ Starting with `v4.1.0`_, difPy handles small and larger datasets differently. Si When difPy receives a **"small" dataset** (<= 5k images), it uses its classic algorithm and compares **all image combinations at once**, hence all of the image data is loaded into memory. This can speed up the comparison processing time, but in turn is heavier on memory consumption. Therefore, this algorithm is only used on smaller datasets. -.. figure:: static/assets/simple_algorithm.png +.. figure:: ../static/assets/simple_algorithm.png :width: 480 :height: 170 :alt: Simple algorithm visualized diff --git a/docs/resources/large_datasets.rst b/docs/resources/large_datasets.rst new file mode 100644 index 00000000..2e27598e --- /dev/null +++ b/docs/resources/large_datasets.rst @@ -0,0 +1,34 @@ +.. _Using difPy with Large Datasets: + +Using difPy with Large Datasets +---------------- + +Starting with `v4.1.0`_, difPy handles small and larger datasets differently. Since the computational overhead and especially memory consumption can become very high on large image datasets, difPy utilizes a different algorithm specifically to process larger datasets more efficiently and less memory intensive. + +.. _v4.1.0: https://github.com/elisemercury/Duplicate-Image-Finder/releases + +When difPy receives a **"small" dataset** (<= 5k images), it uses its classic algorithm and compares **all image combinations at once**, hence all of the image data is loaded into memory. This can speed up the comparison processing time, but in turn is heavier on memory consumption. Therefore, this algorithm is only used on smaller datasets. + +.. figure:: ../static/assets/simple_algorithm.png + :width: 480 + :height: 170 + :alt: Simple algorithm visualized + :align: center + + Classic algorithm visualized + +When difPy receives a **"large" dataset** (> 5k images), a different algorithm is used which **splits images into smaller groups** and processes these chunk-by-chunk leveraging `Python generators`_. This leads to a significant reduction in memory overhead, as less data is loaded into memory once at a time. Furthermore, images are compared leveraging vectorization which also allows for faster comparison times on larger datasets. + +.. _Python generators: https://docs.python.org/3/reference/expressions.html#yield-expressions + +.. figure:: ../static/assets/batch_algorithm.png + :width: 480 + :height: 250 + :alt: Chunking algorithm visualized + :align: center + + Chunking algorithm visualized + +The picture above visualizes how chunks are processed by the chunking algorithm. Each of the image columns represent a chunk. + +The ``chunksize`` parameter defines **how many of these chunks will be processed at once** (see :ref:`chunksize`). By default, ``chunksize`` is set to ``None`` which implies: ``1'000'000 / number of images in dataset``. This ratio is used to automatically size the ``chunksize`` according to the size of the dataset, with the goal of keeping memory consumption low. This is a good technique for datasets smaller than 1 million images. As soon as the number of images will reach more, then heavier memory consumption increase will become inevitable, as the number of potential image combinations (matches) becomes increasingly large. **It is not recommended to adjust this parameter manually**. diff --git a/docs/resources/report_bug.rst b/docs/resources/report_bug.rst new file mode 100644 index 00000000..9263a86b --- /dev/null +++ b/docs/resources/report_bug.rst @@ -0,0 +1,8 @@ +.. _Report a Bug: + +Report a Bug 🐛 +---------------- + +Should you encounter any issue or unwanted behavior when using difPy, `you can open an issue here `_. + +Since difPy is fully open source, you can also fix the bug yourself and contribute to making difPy better. See :ref:`Contributing` for more information. \ No newline at end of file diff --git a/docs/resources/supported_filetypes.rst b/docs/resources/supported_filetypes.rst new file mode 100644 index 00000000..b9020c7c --- /dev/null +++ b/docs/resources/supported_filetypes.rst @@ -0,0 +1,8 @@ +.. _Supported File Types: + +Supported File Types +---------------- + +difPy supports most popular image formats. Nevertheless, since it relies on the Pillow library for image decoding, the supported formats are restricted to the ones listed in the `Pillow Documentation`_. Unsupported file types will by marked as invalid and included in the process statistics output under ``invalid_files`` (see :ref:`Process Statistics`). + +.. _Pillow Documentation: https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html \ No newline at end of file diff --git a/docs/static/assets/app.png b/docs/static/assets/app.png deleted file mode 100644 index 5341acf2..00000000 Binary files a/docs/static/assets/app.png and /dev/null differ diff --git a/docs/static/assets/app_transp.png b/docs/static/assets/app_transp.png deleted file mode 100644 index 1f6dd31b..00000000 Binary files a/docs/static/assets/app_transp.png and /dev/null differ diff --git a/docs/static/assets/difPy_logo_1.png b/docs/static/assets/difPy_logo_1.png deleted file mode 100644 index 7bb66e19..00000000 Binary files a/docs/static/assets/difPy_logo_1.png and /dev/null differ diff --git a/docs/static/assets/difPy_logo_2.png b/docs/static/assets/difPy_logo_2.png deleted file mode 100644 index 27d06634..00000000 Binary files a/docs/static/assets/difPy_logo_2.png and /dev/null differ diff --git a/docs/static/assets/difPyweb_demo.gif b/docs/static/assets/difPyweb_demo.gif deleted file mode 100644 index efbd468b..00000000 Binary files a/docs/static/assets/difPyweb_demo.gif and /dev/null differ diff --git a/docs/static/assets/result.png b/docs/static/assets/result.png deleted file mode 100644 index 8a794de2..00000000 Binary files a/docs/static/assets/result.png and /dev/null differ diff --git a/docs/static/logos/logo-min.png b/docs/static/logos/logo-min.png new file mode 100644 index 00000000..2974e5c1 Binary files /dev/null and b/docs/static/logos/logo-min.png differ diff --git a/docs/static/assets/difPy_logo_3.png b/docs/static/logos/logo.png similarity index 100% rename from docs/static/assets/difPy_logo_3.png rename to docs/static/logos/logo.png diff --git a/docs/static/logos/logo.svg b/docs/static/logos/logo.svg new file mode 100644 index 00000000..2e482262 --- /dev/null +++ b/docs/static/logos/logo.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/static/logos/logo_small.ico b/docs/static/logos/logo_small.ico new file mode 100644 index 00000000..d5b8db9c Binary files /dev/null and b/docs/static/logos/logo_small.ico differ diff --git a/docs/static/logos/logo_small.png b/docs/static/logos/logo_small.png new file mode 100644 index 00000000..5c85626c Binary files /dev/null and b/docs/static/logos/logo_small.png differ diff --git a/docs/static/logos/logo_small.svg b/docs/static/logos/logo_small.svg new file mode 100644 index 00000000..6b1d9020 --- /dev/null +++ b/docs/static/logos/logo_small.svg @@ -0,0 +1 @@ + \ No newline at end of file