Skip to content

Clustering performance improvements #5319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

KristinnVikar
Copy link
Contributor

@KristinnVikar KristinnVikar commented Jun 21, 2024

Proposed changes

Reduce clustering from O(n^2) where n is the number of total templates loaded, to O(n).

On my machine, this reduces total nuclei startup time from 19s -> 8.5s (2.23x faster!) (Edit: re-did timing tests)

Checklist

  • Pull request is created against the dev branch
  • All checks passed (lint, unit/integration/regression tests etc.) with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

@GeorginaReeder
Copy link

Thanks so much for your contribution @KristinnVikar , we appreciate it!

We also have a Discord server, which you’re more than welcome to join. It's a great place to connect with fellow contributors and stay updated with the latest developments!

@KristinnVikar
Copy link
Contributor Author

Using a HashMap based clustering reduces clustering time even further!

However I'm getting a slight regression with the change, where I get "Templates clustered: 1544" but without I get "Templates clustered: 1550", some 6 templates that no longer want to cluster together.

I did some further measurements, originally clustering took 14000ms (single threaded bottleneck), now it takes ~18ms, about 777x faster

@KristinnVikar KristinnVikar marked this pull request as ready for review June 24, 2024 13:04
@ehsandeep ehsandeep linked an issue Jun 24, 2024 that may be closed by this pull request
Copy link
Member

@tarunKoyalwar tarunKoyalwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm !

nice work @KristinnVikar , this reduced clustering time by lot , by adding time.Now() to benchmark that particular function on my setup i found

Before

[FTL] Clustering took 3.30753675s: total: 6675

After

[FTL] Clustering took 8.236709ms: total: 6679

that's almost 402x faster !!

Note

we can see there are more clusters compared to dev/latest but that's ok and better and i just added more clustering conditions to ssl to avoid issues

@ehsandeep ehsandeep merged commit 381ebba into projectdiscovery:dev Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clustering performance improvements
5 participants