-
Notifications
You must be signed in to change notification settings - Fork 679
Reducing temporary allocations in CClosure #400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Unless you have experienced some convincing performance gains, I would vote against such a patch, since the code is now longer and less readable. In fact, it does not even feel correct. For instance, the old code was doing
while the new code does
So you have added an extra call to |
Well, I believe it's more uniform than the previous version and thus easier to read. It's pretty clear that 0-lifting is identity (and it is actually implemented like so). |
@MatejKosik would it be possible to benchmark this PR? Thanks! |
Is the benchmark code available somewhere? |
LGTM, indeed this is more uniform. I don't believe this introduces a bug. |
@MatejKosik Is pendulum busy these days? If not, it would be good to bench this. |
@maximedenes Currently pendulum is busy testing two of Matej's PRs but there is nothing preventing to add a job to the queue so I just did it for this PR. Output here. BTW I rebased this PR before testing it. The rebase is here: https://github.com/Zimmi48/coq/commits/coq-pr-400 |
FTR Travis build of the rebase is here: https://travis-ci.org/Zimmi48/coq/builds/239057025 |
Ok, so the rebase has green lights from Travis. @silene we did not discuss this PR during the WG. Are you ok with it being merged? |
Performance tests results:
|
I haven't really changed my opinion on this patch. On the one side, we manually unroll Array.map for the sake of performances, and on the other side, we explicitly add dead code for the sake of uniformity. We should really make clear what our policy is with respect to performance-critical code like CClosure. |
Let's add that performance testing didn't show any significant gain brought by the PR, except on the odd order package, and even there not so impressive. |
I'm advocating for my chapel, but a series of percentish gains can end up making a big difference. Small streams make up big rivers! |
Maybe a way to proceed is to test this patch with |
What should be done about this? Is someone willing to do the testing of this patch with Flambda? |
I am willing but I need to find a slot to update my oexp stuff. |
If you add a tag for "needs: benchmark" , I promise at some point I would be able to send all those PRs to my bench queue. But current ETA is > 1 month. |
FTR, Flambda is likely to have no effect here, because this patch is only about array allocations, a matter over which Flambda does nothing. |
It seems with_stats is only used in cbv after this. |
WG discussion: we rebase, benchmark again, and if no noticeable improvement, we close. |
Please also bench with |
6d4a515
to
13b9854
Compare
In several cases, an intermediary array was generated to be mapped over immediately. We now do everything in one pass. As a by-product, this patch also provides a little code refactoring.
While thought as a macro, the with_stats function was calling lazy + Lazy.force, which has a remarkable cost in the implementation, as it allocates and breaches the write barrier. We simply expand the definition twice, breaking the abstraction but gaining in performance.
13b9854
to
9ab99e3
Compare
Bench running here: https://ci.inria.fr/coq/job/benchmark-part-of-the-branch/365/. @ejgallego How do you bench with flambda? |
I'm not sure anyone did adapt the script to do an flambda test, but it should be very easy. Steps to build coq with flambda:
So indeed, someone should create a flambda test job in jenkins with a new field to add custom flags. By the way it still bothers me a lot that in our benchmarks we are testing the performance of the ocaml compiler itself a lot. IMHO |
Really? On what files? |
All While this is a constant factor, it is quite large and may distort relative analysis. |
Well, I didn't check what |
This is incorrect. |
Indeed, only for the standard library, |
Did you do anything special? In principle, only the stdlib should generate pre-compiled native files. Other developments can get the same behavior by passing the |
Yes never mind, I was under a different setup [that replaces the full |
I think I have made this mistake in public like already 10 times.... |
No worries. I'm aware that these flags are a bit messy. But the short story is we are not doing anything crazy, with the default settings, AFAIK. |
So if you are curious what I was doing by the way was putting a library [mathcomp in this case] and I was compiling it as if the dependencies where calculated globally together with the stdlib. This is in anticipation of the compositional build system PR for libraries. |
@ppedrot did you bench against flambda? |
@maximedenes Nope, I have no idea how to modify the script running on Jenkins. But the last bench showed essentially no change, I think we can close this PR anyways. |
Ok, thanks for the update, so closing. |
This is two small patches that prevent CClosure from overallocating for nothing. It also provides a little code factorization as a byproduct.
This is an innocuous, trivially sound patch that can only enhance performances, and it could be merged quickly, but this touches the kernel so that I prefer to advertise it here.