Skip to content

Unexpectedly large performance difference when using default vs. pre-sized #362

Open
@ilia-permiashkin

Description

@ilia-permiashkin

When profiling my application, I noticed a significant performance difference between the following two usages of ObjectOpenHashSet:

    new ObjectOpenHashSet<>();

and

    new ObjectOpenHashSet<>(initialCapacity);

Unfortunately, I can't share the code with you (it's from an actual application), however, I can provide you with the minimalized example:

public class WordsProvider {
    
    public static List<String> getWords() {
        List<String> words = new ArrayList<>();
        
        InputStream inputStream = Thread.currentThread().getContextClassLoader().getResourceAsStream("words.txt");
        if (inputStream == null) {
            throw new RuntimeException("words.txt not found");
        }
        
        try (BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream))) {
            String line;
            while ((line = reader.readLine()) != null) {
                words.add(line);
            }
        } catch (Exception e) {
            throw new RuntimeException("Error reading words.txt", e);
        }
        
        return words;
    }
}
    Set<String> words = new ObjectOpenHashSet<>(WordsProvider.getWords());

    // case 1: very slow ~40 seconds
    Set<String> copy = new ObjectOpenHashSet<>();
    for(String word: words) {
        if (word != null || !word.isBlank()) {
            copy.add(word);
        }
    }
Image Image
    Set<String> words = new ObjectOpenHashSet<>(WordsProvider.getWords());

    // case 2: very fast ~80 ms
    Set<String> copy = new ObjectOpenHashSet<>(words.size());
    for(String word: words) {
        if (word != null || !word.isBlank()) {
            copy.add(word);
        }
    }
Image Image

I understand that the performance issue was caused because of the bug in our code. However, the performance difference (from 40 seconds down to 80 milliseconds) was quite surprising and felt worth taking a look at it. Please also find the list of words that I used to test these scenarios (8MB or ~1.1M words)

words.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions