You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
:ref:`processes`,``int``,``os.cpu_count()``, "``int`` >= 1 and <= ``os.cpu_count()``"
28
28
29
29
.. note::
30
30
@@ -131,7 +131,7 @@ processes (int)
131
131
++++++++++++
132
132
133
133
.. warning::
134
-
Recommended not to change default value. Only adjust this value if you know what you are doing.
134
+
Recommended not to change default value. Only adjust this value if you know what you are doing. See :ref:`Adjusting processes and chunksize`.
135
135
136
136
difPy leverages `Multiprocessing`_ to speed up the image comparison process, meaning multiple comparison tasks will be performed in parallel. The ``processes`` parameter defines the maximum number of worker processes (i. e. parallel tasks) to perform when multiprocessing. The higher the parameter, the more performance can be achieved, but in turn, the more computing resources will be required. To learn more, please refer to the `Python Multiprocessing documentation`_.
``difPy.search`` supports the following parameters:
17
17
@@ -21,12 +21,12 @@ After the search is completed, further actions can be performed using :ref:`sear
21
21
:class: tight-table
22
22
23
23
:ref:`difPy_obj`,"``difPy_obj``",,
24
-
:ref:`similarity`,"``str``, ``int``",``'duplicates'``, "``'similar'``, any ``int`` or ``float``"
25
-
:ref:`lazy`,``bool``,``True``,``False``
24
+
:ref:`similarity`,"``str``, ``int``, ``float``",``'duplicates'``, "``'similar'``, ``int`` or ``float`` >= 0"
25
+
:ref:`same_dim`,``bool``,``True``,``False``
26
26
:ref:`rotate`,``bool``,``True``,``False``
27
-
:ref:`show_progress2`,``bool``,``True``,``False``
28
-
:ref:`processes`,``int``,``None`` (``os.cpu_count()``), any ``int``
29
-
:ref:`chunksize`,``int``,``None``, any ``int``
27
+
:ref:`show_progress`,``bool``,``True``,``False``
28
+
:ref:`processes`,``int``,``os.cpu_count()``, "``int`` >= 1 and <= ``os.cpu_count()``"
29
+
:ref:`chunksize`,``int``,``None``, "``int`` >= 1"
30
30
31
31
.. _difPy_obj:
32
32
@@ -37,7 +37,7 @@ The required ``difPy_obj`` parameter should be pointing to the ``dif`` object th
37
37
38
38
.. _similarity:
39
39
40
-
similarity (str, int)
40
+
similarity (str, int, float)
41
41
++++++++++++
42
42
43
43
difPy compares the images to find duplicates or similarities, based on the MSE (Mean Squared Error) between both image tensors. The target similarity rate i. e. MSE value is set with the ``similarity`` parameter.
@@ -46,34 +46,30 @@ difPy compares the images to find duplicates or similarities, based on the MSE (
46
46
47
47
``"similar"`` = searches for similar images. MSE threshold is set to ``5``.
48
48
49
-
The search for similar images can be useful when searching for duplicate files that might have different file **types** (i. e. imageA.png has a duplicate imageA.jpg) and/or different file **sizes** (f. e. imageA.png (100MB) has a duplicate imageA.png (50MB)). In these cases, the MSE between the two image tensors might not be exactly == 0, hence they would not be classified as being duplicates even though in reality they are. Setting ``similarity`` to ``"similar"`` searches for duplicates with a certain tolerance, increasing the likelihood of finding duplicate images of different file types and sizes. Depending on which ``similarity`` level is chosen, the ``lazy`` parameter should be adjusted accordingly (see :ref:`lazy`).
49
+
The search for similar images can be useful when searching for duplicate files that:
* have different file **types** (f. e. imageA.png has a duplicate imageA.jpg)
52
+
* have different file **sizes** (f. e. imageA.png (100MB) has a duplicate imageA.png (50MB))
53
+
* are **cropped** versions of one another (f. e. imageA.png is a cropped version of imageB.png) (in this case, :ref:`same_dim` should be set to ``False``)
56
54
57
-
Setting the "similarity" and "lazy" parameter
55
+
In these cases, the MSE between the two image tensors might not be exactly == 0, hence they would not be classified as being duplicates even though in reality they are. Setting ``similarity`` to ``"similar"`` searches for duplicates with a certain tolerance, increasing the likelihood of finding duplicate images of different file types and sizes.
58
56
59
57
**Manual setting**: the match MSE threshold can be adjusted manually by setting the ``similarity`` parameter to any ``int`` or ``float``. difPy will then search for images that match an MSE threshold **equal to or lower than** the one specified.
60
58
61
-
.. _lazy:
59
+
.. _same_dim:
62
60
63
-
lazy (bool)
61
+
same_dim (bool)
64
62
++++++++++++
65
63
66
-
By default, difPy searches using a Lazy algorithm. This algorithm assumes that the image matches we are looking for have **the same dimensions**, i. e.duplicate images have the same width and height. If two images do not have the same dimensions, they are automatically assumed to not be duplicates. Therefore, because these images are skipped, this algorithm can provide a significant **improvement in performance**.
64
+
By default, when searching for matches, difPy assumes images to have **the same dimensions** (width x height).
67
65
68
-
``True`` = (default) applies the Lazy algorithm
66
+
``True`` = (default) assumes matches have the same dimensions
69
67
70
-
``False`` = regular algorithm is used
68
+
``False`` = assumes matches can have different dimensions
71
69
72
-
**When should the Lazy algorithm not be used?**
73
-
The Lazy algorithm can speed up the comparison process significantly. Nonetheless, the algorithm might not be suited for your use case and might result in missing some matches. Depending on which ``similarity`` level is chosen, the ``lazy`` parameter should be adjusted accordingly (see :ref:`similarity`). Set ``lazy = False`` if you are searching for duplicate images with:
74
-
75
-
* different **file types** (i. e. imageA.png is a duplicate of imageA.jpg)
76
-
* and/or different **file sizes** (i. e. imageA.png (100MB) is a duplicate of imageA_compressed.png (50MB))
70
+
.. note::
71
+
``same_dim`` should be set to ``False`` if you are searching for image matches that have different **file types** (i. e. imageA.png is a duplicate of imageA.jpg)
72
+
and/or if images are **cropped** versions of one another.
77
73
78
74
.. _rotate:
79
75
@@ -102,7 +98,7 @@ chunksize (int)
102
98
++++++++++++
103
99
104
100
.. warning::
105
-
Recommended not to change default value. Only adjust this value if you know what you are doing.
101
+
Recommended not to change default value. Only adjust this value if you know what you are doing. See :ref:`Adjusting processes and chunksize`.
106
102
107
103
``chunksize`` is only used when dealing with image datasets of **more than 5k images**. See the ":ref:`Using difPy with Large Datasets`" section for further details.
0 commit comments