Skip to content

Commit 84d1313

Browse files
authored
Merge pull request #116 from elisemercury/v4.2.0-updates
V4.2.0 updates
2 parents b6dc4d3 + dba97f4 commit 84d1313

20 files changed

+181
-154
lines changed

LICENSE.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
MIT License
22

3-
Copyright (c) 2024 Elise Landman
3+
Copyright (c) 2025 Elise Landman
44

55
Permission is hereby granted, free of charge, to any person obtaining a copy
66
of this software and associated documentation files (the "Software"), to deal

README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -127,7 +127,7 @@ search.stats
127127
'seconds_elapsed': 5.14},
128128
'parameters': {'similarity_mse': 0,
129129
'rotate': True,
130-
'lazy': True,
130+
'same_dim': True,
131131
'processes': 5,
132132
'chunksize': None},
133133
'files_searched': 3232,
@@ -143,12 +143,12 @@ difPy supports the following parameters:
143143

144144
```python
145145
difPy.build(*directory, recursive=True, in_folder=False, limit_extensions=True, px_size=50,
146-
show_progress=True, processes=None)
146+
show_progress=True, processes=os.cpu_count())
147147
```
148148

149149
```python
150-
difPy.search(difpy_obj, similarity='duplicates', rotate=True, lazy=True, show_progress=True,
151-
processes=None, chunksize=None)
150+
difPy.search(difpy_obj, similarity='duplicates', rotate=True, same_dim=True, show_progress=True,
151+
processes=os.cpu_count(), chunksize=None)
152152
```
153153

154154
:notebook: For a **detailed usage guide**, please view the official **[difPy Usage Documentation](https://difpy.readthedocs.io/)**.
@@ -172,14 +172,14 @@ difPy CLI supports the following arguments:
172172
dif.py [-h] [-D DIRECTORY [DIRECTORY ...]] [-Z OUTPUT_DIRECTORY]
173173
[-r {True,False}] [-i {True,False}] [-le {True,False}]
174174
[-px PX_SIZE] [-s SIMILARITY] [-ro {True,False}]
175-
[-la {True,False}] [-proc PROCESSES] [-ch CHUNKSIZE]
175+
[-dim {True,False}] [-proc PROCESSES] [-ch CHUNKSIZE]
176176
[-mv MOVE_TO] [-d {True,False}] [-sd {True,False}]
177177
[-p {True,False}]
178178
```
179179

180180
| | Parameter | | Parameter |
181181
| :---: | ------ | :---: | ------ |
182-
| `-D` | directory | `-la` | lazy |
182+
| `-D` | directory | `-dim` | same_dim |
183183
| `-Z` | output_directory | `-proc` | processes |
184184
| `-r`| recursive | `-ch` | chunksize |
185185
| `-i`| in_folder | `-mv` | move_to |

difPy/dif.py

Lines changed: 97 additions & 87 deletions
Large diffs are not rendered by default.

difPy/version.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = '4.1.3'
1+
__version__ = '4.2.0'

docs/getting_started/cli_usage.rst renamed to docs/01_getting_started/cli_usage.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ difPy in the CLI supports the following arguments:
2424
dif.py [-h] [-D DIRECTORY [DIRECTORY ...]] [-Z OUTPUT_DIRECTORY]
2525
[-r {True,False}] [-i {True,False}] [-le {True,False}]
2626
[-px PX_SIZE] [-s SIMILARITY] [-ro {True,False}]
27-
[-la {True,False}] [-proc PROCESSES] [-ch CHUNKSIZE]
27+
[-dim {True,False}] [-proc PROCESSES] [-ch CHUNKSIZE]
2828
[-mv MOVE_TO] [-d {True,False}] [-sd {True,False}]
2929
[-p {True,False}]
3030
@@ -33,7 +33,7 @@ difPy in the CLI supports the following arguments:
3333
:widths: 5, 10, 5, 10
3434
:class: tight-table
3535

36-
``-D``,:ref:`directory`,``-la``,:ref:`lazy`
36+
``-D``,:ref:`directory`,``-la``,:ref:`same_dim`
3737
``-Z``,output_directory,``-proc``,:ref:`processes`
3838
``-r``,:ref:`recursive`,``-ch``,:ref:`chunksize`
3939
``-i``,:ref:`in_folder`,``-mv``,move_to (see :ref:`search.move_to`)

docs/getting_started/output.rst renamed to docs/01_getting_started/output.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,7 @@ A **JSON formatted collection** with statistics on the completed difPy process:
9292
'seconds_elapsed': 5.14},
9393
'parameters': {'similarity_mse': 0,
9494
'rotate': True,
95-
'lazy': True,
95+
'same_dim': True,
9696
'processes': 5,
9797
'chunksize': None},
9898
'files_searched': 3228,

docs/methods/build.rst renamed to docs/02_methods/build.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,11 @@ Upon completion, ``difPy.build()`` returns a ``dif`` object that can be used in
2020

2121
:ref:`directory`,"``str``, ``list``",,
2222
:ref:`recursive`,``bool``,``True``,``False``
23-
:ref:`in_folder`,"``bool``, ``False``",``True``
23+
:ref:`in_folder`,``bool``,``True``,``False``
2424
:ref:`limit_extensions`,``bool``,``True``,``False``
25-
:ref:`px_size`,"``int``, ``float``",50, ``int``
25+
:ref:`px_size`,``int``,50, "``int`` >= 10 and <= 5000"
2626
:ref:`show_progress`,``bool``,``True``,``False``
27-
:ref:`processes`,``int``,``None`` (``os.cpu_count()``), ``int``
27+
:ref:`processes`,``int``,``os.cpu_count()``, "``int`` >= 1 and <= ``os.cpu_count()``"
2828

2929
.. note::
3030

@@ -131,7 +131,7 @@ processes (int)
131131
++++++++++++
132132

133133
.. warning::
134-
Recommended not to change default value. Only adjust this value if you know what you are doing.
134+
Recommended not to change default value. Only adjust this value if you know what you are doing. See :ref:`Adjusting processes and chunksize`.
135135

136136
difPy leverages `Multiprocessing`_ to speed up the image comparison process, meaning multiple comparison tasks will be performed in parallel. The ``processes`` parameter defines the maximum number of worker processes (i. e. parallel tasks) to perform when multiprocessing. The higher the parameter, the more performance can be achieved, but in turn, the more computing resources will be required. To learn more, please refer to the `Python Multiprocessing documentation`_.
137137

docs/methods/search.rst renamed to docs/02_methods/search.rst

Lines changed: 21 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ After the search is completed, further actions can be performed using :ref:`sear
1111

1212
.. code-block:: python
1313
14-
difPy.search(difPy_obj, similarity='duplicates', lazy=True, rotate=True, processes=None, chunksize=None, show_progress=False, logs=True)
14+
difPy.search(difPy_obj, similarity='duplicates', same_dim=True, rotate=True, processes=None, chunksize=None, show_progress=False, logs=True)
1515
1616
``difPy.search`` supports the following parameters:
1717

@@ -21,12 +21,12 @@ After the search is completed, further actions can be performed using :ref:`sear
2121
:class: tight-table
2222

2323
:ref:`difPy_obj`,"``difPy_obj``",,
24-
:ref:`similarity`,"``str``, ``int``",``'duplicates'``, "``'similar'``, any ``int`` or ``float``"
25-
:ref:`lazy`,``bool``,``True``,``False``
24+
:ref:`similarity`,"``str``, ``int``, ``float``",``'duplicates'``, "``'similar'``, ``int`` or ``float`` >= 0"
25+
:ref:`same_dim`,``bool``,``True``,``False``
2626
:ref:`rotate`,``bool``,``True``,``False``
27-
:ref:`show_progress2`,``bool``,``True``,``False``
28-
:ref:`processes`,``int``,``None`` (``os.cpu_count()``), any ``int``
29-
:ref:`chunksize`,``int``,``None``, any ``int``
27+
:ref:`show_progress`,``bool``,``True``,``False``
28+
:ref:`processes`,``int``,``os.cpu_count()``, "``int`` >= 1 and <= ``os.cpu_count()``"
29+
:ref:`chunksize`,``int``,``None``, "``int`` >= 1"
3030

3131
.. _difPy_obj:
3232

@@ -37,7 +37,7 @@ The required ``difPy_obj`` parameter should be pointing to the ``dif`` object th
3737

3838
.. _similarity:
3939

40-
similarity (str, int)
40+
similarity (str, int, float)
4141
++++++++++++
4242

4343
difPy compares the images to find duplicates or similarities, based on the MSE (Mean Squared Error) between both image tensors. The target similarity rate i. e. MSE value is set with the ``similarity`` parameter.
@@ -46,34 +46,30 @@ difPy compares the images to find duplicates or similarities, based on the MSE (
4646

4747
``"similar"`` = searches for similar images. MSE threshold is set to ``5``.
4848

49-
The search for similar images can be useful when searching for duplicate files that might have different file **types** (i. e. imageA.png has a duplicate imageA.jpg) and/or different file **sizes** (f. e. imageA.png (100MB) has a duplicate imageA.png (50MB)). In these cases, the MSE between the two image tensors might not be exactly == 0, hence they would not be classified as being duplicates even though in reality they are. Setting ``similarity`` to ``"similar"`` searches for duplicates with a certain tolerance, increasing the likelihood of finding duplicate images of different file types and sizes. Depending on which ``similarity`` level is chosen, the ``lazy`` parameter should be adjusted accordingly (see :ref:`lazy`).
49+
The search for similar images can be useful when searching for duplicate files that:
5050

51-
.. figure:: docs/static/assets/choosing_similarity.png
52-
:width: 540
53-
:height: 390
54-
:alt: Setting the "similarity" & "lazy" Parameter
55-
:align: center
51+
* have different file **types** (f. e. imageA.png has a duplicate imageA.jpg)
52+
* have different file **sizes** (f. e. imageA.png (100MB) has a duplicate imageA.png (50MB))
53+
* are **cropped** versions of one another (f. e. imageA.png is a cropped version of imageB.png) (in this case, :ref:`same_dim` should be set to ``False``)
5654

57-
Setting the "similarity" and "lazy" parameter
55+
In these cases, the MSE between the two image tensors might not be exactly == 0, hence they would not be classified as being duplicates even though in reality they are. Setting ``similarity`` to ``"similar"`` searches for duplicates with a certain tolerance, increasing the likelihood of finding duplicate images of different file types and sizes.
5856

5957
**Manual setting**: the match MSE threshold can be adjusted manually by setting the ``similarity`` parameter to any ``int`` or ``float``. difPy will then search for images that match an MSE threshold **equal to or lower than** the one specified.
6058

61-
.. _lazy:
59+
.. _same_dim:
6260

63-
lazy (bool)
61+
same_dim (bool)
6462
++++++++++++
6563

66-
By default, difPy searches using a Lazy algorithm. This algorithm assumes that the image matches we are looking for have **the same dimensions**, i. e.duplicate images have the same width and height. If two images do not have the same dimensions, they are automatically assumed to not be duplicates. Therefore, because these images are skipped, this algorithm can provide a significant **improvement in performance**.
64+
By default, when searching for matches, difPy assumes images to have **the same dimensions** (width x height).
6765

68-
``True`` = (default) applies the Lazy algorithm
66+
``True`` = (default) assumes matches have the same dimensions
6967

70-
``False`` = regular algorithm is used
68+
``False`` = assumes matches can have different dimensions
7169

72-
**When should the Lazy algorithm not be used?**
73-
The Lazy algorithm can speed up the comparison process significantly. Nonetheless, the algorithm might not be suited for your use case and might result in missing some matches. Depending on which ``similarity`` level is chosen, the ``lazy`` parameter should be adjusted accordingly (see :ref:`similarity`). Set ``lazy = False`` if you are searching for duplicate images with:
74-
75-
* different **file types** (i. e. imageA.png is a duplicate of imageA.jpg)
76-
* and/or different **file sizes** (i. e. imageA.png (100MB) is a duplicate of imageA_compressed.png (50MB))
70+
.. note::
71+
``same_dim`` should be set to ``False`` if you are searching for image matches that have different **file types** (i. e. imageA.png is a duplicate of imageA.jpg)
72+
and/or if images are **cropped** versions of one another.
7773

7874
.. _rotate:
7975

@@ -102,7 +98,7 @@ chunksize (int)
10298
++++++++++++
10399

104100
.. warning::
105-
Recommended not to change default value. Only adjust this value if you know what you are doing.
101+
Recommended not to change default value. Only adjust this value if you know what you are doing. See :ref:`Adjusting processes and chunksize`.
106102

107103
``chunksize`` is only used when dealing with image datasets of **more than 5k images**. See the ":ref:`Using difPy with Large Datasets`" section for further details.
108104

0 commit comments

Comments
 (0)