Skip to content

Unexpected behavior for set generator #25

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mak-dunkelziffer opened this issue Nov 27, 2024 · 2 comments
Open

Unexpected behavior for set generator #25

mak-dunkelziffer opened this issue Nov 27, 2024 · 2 comments

Comments

@mak-dunkelziffer
Copy link
Contributor

The set generator behaves very naive and therefore creates an unexpected distribution of tests.

Example 1:

items = [1, 2, 3, 4, 5]
G = PropCheck::Generators
G.set(G.one_of(*items.map { G.constant(_1) })).map(&:to_a)

This example creates very few small examples and misses plenty of cases (with n = 100) and very often creates a Set with all 5 items.

Example 2:

items = [1, 2, 3, 4, 5]
G = PropCheck::Generators
G.set(G.one_of(*items.map { G.constant(_1) }), max: items.size).map(&:to_a)

Adding a size limitation makes the generator a bit smarter. The distribution is now better, but still skewed to bigger sets. I think this is skew is reasonable for larger input lists, so I'm not sure I'd call this a bug. I had cases, where some sets were not created at all. I'm not sure, whether that is expected behaviour, if the number of all possible values (2^5 = 32) is smaller than the number of runs (n = 100). But this feels acceptable and gets compensated if you run the test multiple times.

Both cases seem to stem from a missing deduplication. The set generator very naively uses the array generator. For arrays order matters, for sets it doesn't. IMO the set generator would need to deduplicate here to avoid creating equivalent sets all the time.

@Qqwy
Copy link
Owner

Qqwy commented Mar 14, 2025

Thank you for your feedback! You are absolutely right. Looking at libraries in other languages, such as Elixir's StreamData or Haskell's Hedgehog it is clear that these indeed try very hard to generate unique values instead of generating an array and dropping duplicates.

A straightforward approach to fixing this, is to do what Hedgehog does, and to have set defer to hash_of (with nil as values) and hash_of be more clever in generating unique keys.

@Qqwy
Copy link
Owner

Qqwy commented Mar 14, 2025

Actually, looking at the implementation in detail again (It has been a while), we already generate unique elements rather than deduplicating non-unique elements, c.f. Generators::make_array_uniq. This helper function is what ends up being called when the uniq: kwarg is set, including by both set and hash_of.

Maybe we can play with the size value that is passed on to the internal element generation?
For size n, we will generate collections with a randomly chosen 0..n amount of elements (unless further customized with the min, max or length parameters). If after many (by default 100, customisable) attempts at generating another unique element we don't end up with enough, we give up and return the smaller collection. That is clearly what is happening in your first example, and is also why your second example works somewhat better.

Or maybe an even better alternative would be to introduce a some_of generator that combines G.set(G.one_of(...)) into a single generator that therefore knows exactly how many different kinds of unique elements can be generated and can thus provide better defaults. 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants