Skip to content

Reimplementation of DifPy for performance #113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
AliSot2000 opened this issue Dec 19, 2024 · 2 comments
Open

Reimplementation of DifPy for performance #113

AliSot2000 opened this issue Dec 19, 2024 · 2 comments
Labels
comment/feedback A comment or feedback about difPy.

Comments

@AliSot2000
Copy link

Hi 👋

I've been running into issues with your implementation. Sadly I have a very large dataset of images which I want to deduplicate and this implementation is running into its limits with both the memory footprint and size of json files.

I've implemented the basic idea of using the mse to compare compressed image tensors but fully focused on performance. Using every trick I know to make the computation faster. To accommodate larger datasets, I've implemented a cache using the file system, I'm using SQLite as the backend to store the file indexes and pairs of diffs and I've added checkpoints for extremely large datasets to be able to interrupt the computation.

You're difpy is already tied into frontends. And I'm not entirely if it would be worth the effort of converting everything to this new implementation. Additionally, the way FastDiffPy is written at the moment it's more a framework than a script like this repo. Output to results.json and lower_quailty.txt is implemented but I'd think the frontends would profit from using the db in their backend in favor of those files.

This is the FastDiffPy at the moment

Best

AliSot2000

@elisemercury elisemercury added the comment/feedback A comment or feedback about difPy. label Dec 19, 2024
@elisemercury
Copy link
Owner

Hi @AliSot2000,

Great project, congrats! Thanks for sharing it with us - it's always great to see when adaptations or improvements of difPy are developed :) Looking forward to seeing the benchmark results!

Best
Elise

@AliSot2000
Copy link
Author

Hi

First Benchmark Results are now published.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comment/feedback A comment or feedback about difPy.
Projects
None yet
Development

No branches or pull requests

2 participants