Skip to content

Reducing temporary allocations in CClosure #400

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

ppedrot
Copy link
Member

@ppedrot ppedrot commented Jan 10, 2017

This is two small patches that prevent CClosure from overallocating for nothing. It also provides a little code factorization as a byproduct.

This is an innocuous, trivially sound patch that can only enhance performances, and it could be merged quickly, but this touches the kernel so that I prefer to advertise it here.

@silene
Copy link
Contributor

silene commented Jan 20, 2017

Unless you have experienced some convincing performance gains, I would vote against such a patch, since the code is now longer and less readable. In fact, it does not even feel correct. For instance, the old code was doing

Array.map (fun b -> constr_fun lfts (mk_clos env b)) ve)

while the new code does

let lfts = el_liftn n lfts in
for i = 0 to len - 1 do
  let c = constr_fun lfts (mk_clos env (Array.unsafe_get v i)) in

So you have added an extra call to el_liftn. Are you fixing a bug there or are you introducing one? My guess is that it is just dead code, but the simple fact I had to think a few minutes about it means that the patch is not as trivial as you think.

@ppedrot
Copy link
Member Author

ppedrot commented Jan 21, 2017

Well, I believe it's more uniform than the previous version and thus easier to read. It's pretty clear that 0-lifting is identity (and it is actually implemented like so).

@maximedenes maximedenes reopened this Mar 23, 2017
@maximedenes
Copy link
Member

@MatejKosik would it be possible to benchmark this PR? Thanks!

@ejgallego
Copy link
Member

@MatejKosik would it be possible to benchmark this PR? Thanks!

Is the benchmark code available somewhere?

@mattam82
Copy link
Member

LGTM, indeed this is more uniform. I don't believe this introduces a bug.

@maximedenes maximedenes added the needs: testing Some manual testing is required. label Apr 6, 2017
@ejgallego ejgallego added this to the 8.7 milestone May 30, 2017
@maximedenes
Copy link
Member

@MatejKosik Is pendulum busy these days? If not, it would be good to bench this.

@Zimmi48
Copy link
Member

Zimmi48 commented Jun 3, 2017

@maximedenes Currently pendulum is busy testing two of Matej's PRs but there is nothing preventing to add a job to the queue so I just did it for this PR. Output here.

BTW I rebased this PR before testing it. The rebase is here: https://github.com/Zimmi48/coq/commits/coq-pr-400

@Zimmi48
Copy link
Member

Zimmi48 commented Jun 3, 2017

FTR Travis build of the rebase is here: https://travis-ci.org/Zimmi48/coq/builds/239057025

@maximedenes
Copy link
Member

Ok, so the rebase has green lights from Travis. @silene we did not discuss this PR during the WG. Are you ok with it being merged?

@Zimmi48
Copy link
Member

Zimmi48 commented Jun 5, 2017

Performance tests results:

┌────────────────────────┬─────────────────────────┬─────────────────────────────────────┬───────────────────────────────────────┐
│                        │      user time [s]      │             CPU cycles              │           CPU instructions            │
│                        │                         │                                     │                                       │
│           package_name │     NEW     OLD PDIFF   │           NEW           OLD PDIFF   │            NEW            OLD PDIFF   │
├────────────────────────┼─────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────┤
│ coq-mathcomp-odd_order │ 1390.96 1424.40 -2.35 % │ 3876607241109 3968322370870 -2.31 % │  6839747535598  6840977143926 -0.02 % │
├────────────────────────┼─────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────┤
│    coq-formal-topology │   37.35   37.60 -0.66 % │  100656916429  101088340716 -0.43 % │   137680031573   137853934475 -0.13 % │
├────────────────────────┼─────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────┤
│        coq-fiat-crypto │ 2217.71 2229.85 -0.54 % │ 6133902887340 6171106120737 -0.60 % │ 10254937269762 10272281567972 -0.17 % │
├────────────────────────┼─────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────┤
│   coq-mathcomp-algebra │  170.87  171.77 -0.52 % │  474394805837  474875042399 -0.10 % │   675769840774   675561303926 +0.03 % │
├────────────────────────┼─────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────┤
│            coq-unimath │ 1170.18 1176.22 -0.51 % │ 3245115734046 3261889179111 -0.51 % │  5582539048861  5583189132346 -0.01 % │
├────────────────────────┼─────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────┤
│              coq-flocq │   55.87   56.12 -0.45 % │  152653335193  152641384865 +0.01 % │   197337002483   197628133865 -0.15 % │
├────────────────────────┼─────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────┤
│                coq-vst │ 1301.06 1306.63 -0.43 % │ 3613238693312 3627097666422 -0.38 % │  5362991032860  5369039459559 -0.11 % │
├────────────────────────┼─────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────┤
│                 coq-sf │   45.10   45.26 -0.35 % │  123378591288  123663129618 -0.23 % │   160985958973   161221537409 -0.15 % │
├────────────────────────┼─────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────┤
│         coq-verdi-raft │ 1338.16 1341.52 -0.25 % │ 3724040820974 3733381084045 -0.25 % │  5333506691093  5335819863519 -0.04 % │
├────────────────────────┼─────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────┤
│             coq-geocoq │ 2164.31 2169.50 -0.24 % │ 6028298728396 6036283823845 -0.13 % │ 10087675878596 10088417060853 -0.01 % │
├────────────────────────┼─────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────┤
│ coq-mathcomp-character │  270.16  270.79 -0.23 % │  751805542452  752843504102 -0.14 % │  1111089710144  1110848350735 +0.02 % │
├────────────────────────┼─────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────┤
│  coq-mathcomp-solvable │  192.30  192.74 -0.23 % │  534171335104  535397642274 -0.23 % │   778992167653   780854297452 -0.24 % │
├────────────────────────┼─────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────┤
│     coq-mathcomp-field │  457.52  458.41 -0.19 % │ 1273517338036 1275788784196 -0.18 % │  2118122081507  2119269974571 -0.05 % │
├────────────────────────┼─────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────┤
│         coq-coquelicot │   72.46   72.58 -0.17 % │  198967708760  199272777980 -0.15 % │   251338781682   251662316588 -0.13 % │
├────────────────────────┼─────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────┤
│  coq-mathcomp-fingroup │   58.71   58.76 -0.09 % │  161734803584  161978440323 -0.15 % │   227516002758   227847612396 -0.15 % │
├────────────────────────┼─────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────┤
│        coq-bedrock-src │  907.12  907.49 -0.04 % │ 2514803116765 2513940822319 +0.03 % │  3219419059616  3220963613322 -0.05 % │
├────────────────────────┼─────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────┤
│ coq-mathcomp-ssreflect │   43.32   43.32 +0.00 % │  118493297825  118623145769 -0.11 % │   152753939753   152893767853 -0.09 % │
├────────────────────────┼─────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────┤
│              coq-color │  582.26  582.08 +0.03 % │ 1616616108943 1614210778297 +0.15 % │  2032556027306  2033485473319 -0.05 % │
├────────────────────────┼─────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────┤
│           coq-compcert │  811.15  810.33 +0.10 % │ 2253496654993 2249193905595 +0.19 % │  3403847313539  3402517321499 +0.04 % │
├────────────────────────┼─────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────┤
│     coq-bedrock-facade │  930.37  929.34 +0.11 % │ 2579269756407 2575236405768 +0.16 % │  3297089091577  3298842016965 -0.05 % │
├────────────────────────┼─────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────┤
│          coq-fiat-core │  128.60  128.44 +0.12 % │  336636892318  336218577362 +0.12 % │   439536975096   439890592154 -0.08 % │
├────────────────────────┼─────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────┤
│               coq-iris │  359.70  358.45 +0.35 % │  997073711057  996837652523 +0.02 % │  1302898066385  1303067636047 -0.01 % │
├────────────────────────┼─────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────┤
│       coq-fiat-parsers │  747.78  745.18 +0.35 % │ 2053951746940 2054420075198 -0.02 % │  3260070982515  3260448939423 -0.01 % │
└────────────────────────┴─────────────────────────┴─────────────────────────────────────┴───────────────────────────────────────┘


PDIFF = proportional difference between measurements done for the NEW and the OLD Coq version
      = (NEW_measurement - OLD_measurement) / OLD_measurement * 100%

NEW = b19411a1ce5d834ca705a5ba7e5f1bf6c0ea207d
OLD = a2a98a4015311af83edcf8fc87aa30a5318bead8

@silene
Copy link
Contributor

silene commented Jun 6, 2017

I haven't really changed my opinion on this patch. On the one side, we manually unroll Array.map for the sake of performances, and on the other side, we explicitly add dead code for the sake of uniformity. We should really make clear what our policy is with respect to performance-critical code like CClosure.

@maximedenes maximedenes added the needs: discussion Further discussion is needed. label Jun 6, 2017
@Zimmi48
Copy link
Member

Zimmi48 commented Jun 6, 2017

Unless you have experienced some convincing performance gains

Let's add that performance testing didn't show any significant gain brought by the PR, except on the odd order package, and even there not so impressive.

@ppedrot
Copy link
Member Author

ppedrot commented Jun 6, 2017

even there not so impressive

I'm advocating for my chapel, but a series of percentish gains can end up making a big difference. Small streams make up big rivers!

@ejgallego
Copy link
Member

Maybe a way to proceed is to test this patch with -flambda and see if it makes any performance gain there?

@maximedenes maximedenes modified the milestones: 8.8, 8.7 Jun 9, 2017
@Zimmi48 Zimmi48 added the needs: rebase Should be rebased on the latest master to solve conflicts or have a newer CI run. label Oct 3, 2017
@maximedenes maximedenes changed the base branch from trunk to master November 7, 2017 08:22
@Zimmi48 Zimmi48 added kind: performance Improvements to performance and efficiency. and removed needs: rebase Should be rebased on the latest master to solve conflicts or have a newer CI run. labels Nov 29, 2017
@Zimmi48
Copy link
Member

Zimmi48 commented Nov 29, 2017

What should be done about this? Is someone willing to do the testing of this patch with Flambda?

@ejgallego
Copy link
Member

I am willing but I need to find a slot to update my oexp stuff.

@ejgallego
Copy link
Member

If you add a tag for "needs: benchmark" , I promise at some point I would be able to send all those PRs to my bench queue. But current ETA is > 1 month.

@ppedrot
Copy link
Member Author

ppedrot commented Nov 29, 2017

FTR, Flambda is likely to have no effect here, because this patch is only about array allocations, a matter over which Flambda does nothing.

@SkySkimmer SkySkimmer added the needs: benchmarking Performance testing is required. label Jan 15, 2018
@SkySkimmer
Copy link
Contributor

It seems with_stats is only used in cbv after this.
Does it even make sense there?

@maximedenes
Copy link
Member

WG discussion: we rebase, benchmark again, and if no noticeable improvement, we close.

@Zimmi48 Zimmi48 added needs: rebase Should be rebased on the latest master to solve conflicts or have a newer CI run. and removed needs: discussion Further discussion is needed. needs: testing Some manual testing is required. labels Feb 26, 2018
@ejgallego
Copy link
Member

Please also bench with flambda.

ppedrot added 2 commits March 6, 2018 11:37
In several cases, an intermediary array was generated to be mapped over
immediately. We now do everything in one pass. As a by-product, this
patch also provides a little code refactoring.
While thought as a macro, the with_stats function was calling lazy + Lazy.force,
which has a remarkable cost in the implementation, as it allocates and breaches
the write barrier. We simply expand the definition twice, breaking the
abstraction but gaining in performance.
@ppedrot ppedrot force-pushed the less-closure-alloc branch from 13b9854 to 9ab99e3 Compare March 6, 2018 14:38
@ppedrot
Copy link
Member Author

ppedrot commented Mar 6, 2018

Bench running here: https://ci.inria.fr/coq/job/benchmark-part-of-the-branch/365/.

@ejgallego How do you bench with flambda?

@ejgallego
Copy link
Member

ejgallego commented Mar 6, 2018

@ejgallego How do you bench with flambda?

I'm not sure anyone did adapt the script to do an flambda test, but it should be very easy. Steps to build coq with flambda:

  • install the 4.06.1+flambda switch
  • configure with -native-compiler no [this won't be needed in 4.07] Pass -flambda-specific options to configure if desired [default options are not a big gain]
  • ?
  • profit!

So indeed, someone should create a flambda test job in jenkins with a new field to add custom flags.

By the way it still bothers me a lot that in our benchmarks we are testing the performance of the ocaml compiler itself a lot. IMHO -native-compiler should be always off.

@maximedenes
Copy link
Member

By the way it still bothers me a lot that in our benchmarks we are testing the performance of the ocaml compiler itself a lot.

Really? On what files?

@ejgallego
Copy link
Member

Really? On what files?

All .v files do indeed generate the native version so a significant amount of time of every build is spent on ocamlc.

While this is a constant factor, it is quite large and may distort relative analysis.

@ejgallego
Copy link
Member

All .v files do indeed generate the native version so a significant amount of time of every build is spent on ocamlc.

Well, I didn't check what opam does tho, I am talking from my setup so I could be wrong 100%.

@maximedenes
Copy link
Member

All .v files do indeed generate the native version

This is incorrect.

@ejgallego
Copy link
Member

ejgallego commented Mar 6, 2018

This is incorrect.

Indeed, only for the standard library, coq_makefile uses a different set of COQOPTS [sight]

@maximedenes
Copy link
Member

I am talking from my setup

Did you do anything special? In principle, only the stdlib should generate pre-compiled native files. Other developments can get the same behavior by passing the -native-compiler flag to coqc. If they don't, compilation is done on demand when native_compute is called.

@ejgallego
Copy link
Member

Did you do anything special?

Yes never mind, I was under a different setup [that replaces the full .vo build system] and got confused [and I always use -native-compiler no anyways].

@ejgallego
Copy link
Member

I think I have made this mistake in public like already 10 times....

@maximedenes
Copy link
Member

I think I have made this mistake in public like already 10 times....

No worries. I'm aware that these flags are a bit messy. But the short story is we are not doing anything crazy, with the default settings, AFAIK.

@ejgallego
Copy link
Member

So if you are curious what I was doing by the way was putting a library [mathcomp in this case] and I was compiling it as if the dependencies where calculated globally together with the stdlib. This is in anticipation of the compositional build system PR for libraries.

@maximedenes maximedenes removed the needs: rebase Should be rebased on the latest master to solve conflicts or have a newer CI run. label Mar 8, 2018
@maximedenes
Copy link
Member

@ppedrot did you bench against flambda?

@ppedrot
Copy link
Member Author

ppedrot commented Mar 8, 2018

@maximedenes Nope, I have no idea how to modify the script running on Jenkins. But the last bench showed essentially no change, I think we can close this PR anyways.

@maximedenes
Copy link
Member

maximedenes commented Mar 8, 2018

Ok, thanks for the update, so closing.

@maximedenes maximedenes closed this Mar 8, 2018
@Zimmi48 Zimmi48 removed this from the 8.8+beta1 milestone Mar 8, 2018
@Zimmi48 Zimmi48 removed the needs: benchmarking Performance testing is required. label Mar 8, 2018
@ppedrot ppedrot deleted the less-closure-alloc branch July 27, 2018 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind: performance Improvements to performance and efficiency.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants