further work on making Inliner faster

5 replies

Mon, 2012-01-30, 18:30

Miguel Garcia 2

Joined: 2011-01-30,

After reviewing the control dependences of the inlining algorithm, I've come up with an improved version that skips work once it's known for a callsite that no inlining will be made.

The commits at
https://github.com/magarciaEPFL/scala/commits/fasterTypeFlow
revolve on how the type-flow is computed, because that's the earliest we can know whether some inlining decision is bound to be negative.

Those willing to review or test it are welcome to provide feedback.

The test suite passes, and the same inlining decisions are made as before (by comparing -Ylog:inliner output, with a focus on lines starting with "[log inliner] Inlining"). There's background info on the Inliner [1] and on dataflow-analyses [2].

As to the actual improvement, we're talking of a 40% reduction in inlining time (as measured when compiling the compiler).

Miguel
http://lampwww.epfl.ch/~magarcia/ScalaCompilerCornerReloaded/

[1] http://lamp.epfl.ch/~magarcia/ScalaCompilerCornerReloaded/2011Q4/Inliner.pdf

[2] http://lamp.epfl.ch/~magarcia/ScalaCompilerCornerReloaded/2011Q4/DFAICode.pdf

Mon, 2012-01-30, 18:51

ijuma

Joined: 2008-08-20,

Re: further work on making Inliner faster

On Mon, Jan 30, 2012 at 5:30 PM, Miguel Garcia <mgarcia512@yahoo.com> wrote:

As to the actual improvement, we're talking of a 40% reduction in inlining time (as measured when compiling the compiler).

Soon inlining won't take any time at all. ;) Great stuff!
Best,Ismael

Tue, 2012-01-31, 16:11

Iulian Dragos

Joined: 2008-12-18,

Re: further work on making Inliner faster

Hi Miguel,
I looked at the code and commented on github in a couple of places. In short, I am not convinced that skipping basic blocks during data-flow analysis gives the same results as before, in all cases. Imagine that the skipped basic block instantiates a closure, and a successor calls Function1.apply. Without analyzing the 'new' call, the information that the concrete class is in fact known will not be propagated to the call site. The fact that tests pass is compelling, but I'm trying to guess your reasoning. Maybe you should document that (either in the code or in a commit message).
cheers,iulian

On Mon, Jan 30, 2012 at 6:30 PM, Miguel Garcia <mgarcia512@yahoo.com> wrote:

After reviewing the control dependences of the inlining algorithm, I've come up with an improved version that skips work once it's known for a callsite that no inlining will be made.

The commits at
https://github.com/magarciaEPFL/scala/commits/fasterTypeFlow
revolve on how the type-flow is computed, because that's the earliest we can know whether some inlining decision is bound to be negative.

Those willing to review or test it are welcome to provide feedback.

The test suite passes, and the same inlining decisions are made as before (by comparing -Ylog:inliner output, with a focus on lines starting with "[log inliner] Inlining"). There's background info on the Inliner [1] and on dataflow-analyses [2].

As to the actual improvement, we're talking of a 40% reduction in inlining time (as measured when compiling the compiler).

Miguel
http://lampwww.epfl.ch/~magarcia/ScalaCompilerCornerReloaded/

[1] http://lamp.epfl.ch/~magarcia/ScalaCompilerCornerReloaded/2011Q4/Inliner.pdf

[2] http://lamp.epfl.ch/~magarcia/ScalaCompilerCornerReloaded/2011Q4/DFAICode.pdf

--
« Je déteste la montagne, ça cache le paysage »
Alphonse Allais

Tue, 2012-01-31, 18:41

Miguel Garcia 2

Joined: 2011-01-30,

Re: further work on making Inliner faster

Iulian,

Thanks for reviewing the code. What follows is a justification for the shortcuts the new algorithm takes, by comparing to what the standard version does.

The standard version computes a type-flow for the methog being analyzed, ie. in- and out-flows are computed for all basic blocks, thus making sure that by the time a CALL_METHOD instruction is analyzed its abstracted type-stack-slot is known. The new version also aims at performing those analysis on CALL_METHOD candidates, with a difference:

(a) an early check is performed while the type-flow is being computed (in an override of `mutatingInterpret`) testing a subset of the conditions that the standard version checks later. The reasoning here is: if the early check fails during some iteration, there's no chance a follow-up iteration (with a yet more abstract type-stack-slot) will succeed. Failure is sufficient to remove that particular CALL_METHOD from the typeflow's `remainingCALLs`.

(b) in case the early check does not fail, no conclusive decision can be made, so everything goes on as in the standard version.

In general, the contents of `remainingCALLs` will get shorter during computation of the typeflow (may be even completely empty). What was it populated with? That was done before starting computing the typeflow proper (in method `populateCALLs()`). As can be seeen, `remainingCALLs` contains initially all the CALL_METHOD instructions that the standard inliner would consider for inlining (and only those).

In other words, `remainingCALLs` tracks at any time those (callsite, basic block) pairs that still remain as candidates. The availability of the containing basic block for a callsite-candidate allows a further optimization: skipping visiting those basic blocks whose in-flow and out-flow isn't needed anway (as explained next). A basic block lacking a callsite in `remainingCALLs`, when visisted by the standard algorithm, will not result in any inlining. But as we know from the way a type-flow is computed, computing the in- and out-flow for a basic block relies on those of other basic blocks. How to keep those, while discarding the rest?

In detail, we want to focus on that sub-graph of the CFG such that control flow may reach a remaining candidate callsite. Those basic blocks not in that subgraph can be skipped altogether (that's why `forwardAnalysis()` in `MethodTFA` now checks for inclusion of a basic block in `relevantBBs` before adding it to the worklist, or before considering it as a useful successor). Actually, keeping those `relevantBBs` up to date (denoting the subgraph of the CFG of interest) isn't that compute intensive: we know all basic blocks represented in `remainingCALLs` belong to that subgraph. Well, a sure way for the typeflow analysis to reach them flawlessly is adding all of their predecessors, transitively. That's what `refresh()` is responsible for, and that's why it's invoked periodically (once per iteration of the worklist) after shrinking of `remainingCALLs`.

One ends up overridding most methods of the dataflow-analysis, but as you can see from the above the added steps all revolve around prunning `remainingCALLs` which in turn allows prunning `relevantBBs` and thus cutting down on the number of iterations. The rest of the story takes place in the inliner proper, which also focuses not on all the method's basic blocks but only on those represented in `remainingCALLs`.

Miguel
http://lampwww.epfl.ch/~magarcia/ScalaCompilerCornerReloaded

Fri, 2012-02-10, 15:41

Iulian Dragos

Joined: 2008-12-18,

Re: further work on making Inliner faster

On Tue, Jan 31, 2012 at 6:32 PM, Miguel Garcia <mgarcia512@yahoo.com> wrote:

Iulian,

Thanks for reviewing the code. What follows is a justification for the shortcuts the new algorithm takes, by comparing to what the standard version does.

The standard version computes a type-flow for the methog being analyzed, ie. in- and out-flows are computed for all basic blocks, thus making sure that by the time a CALL_METHOD instruction is analyzed its abstracted type-stack-slot is known. The new version also aims at performing those analysis on CALL_METHOD candidates, with a difference:

(a) an early check is performed while the type-flow is being computed (in an override of `mutatingInterpret`) testing a subset of the conditions that the standard version checks later. The reasoning here is: if the early check fails during some iteration, there's no chance a follow-up iteration (with a yet more abstract type-stack-slot) will succeed. Failure is sufficient to remove that particular CALL_METHOD from the typeflow's `remainingCALLs`.

This sounds good.

(b) in case the early check does not fail, no conclusive decision can be made, so everything goes on as in the standard version.

In general, the contents of `remainingCALLs` will get shorter during computation of the typeflow (may be even completely empty). What was it populated with? That was done before starting computing the typeflow proper (in method `populateCALLs()`). As can be seeen, `remainingCALLs` contains initially all the CALL_METHOD instructions that the standard inliner would consider for inlining (and only those).

In other words, `remainingCALLs` tracks at any time those (callsite, basic block) pairs that still remain as candidates. The availability of the containing basic block for a callsite-candidate allows a further optimization: skipping visiting those basic blocks whose in-flow and out-flow isn't needed anway (as explained next). A basic block lacking a callsite in `remainingCALLs`, when visisted by the standard algorithm, will not result in any inlining. But as we know from the way a type-flow is computed, computing the in- and out-flow for a basic block relies on those of other basic blocks. How to keep those, while discarding the rest?

In detail, we want to focus on that sub-graph of the CFG such that control flow may reach a remaining candidate callsite. Those basic blocks not in that subgraph can be skipped altogether (that's why `forwardAnalysis()` in `MethodTFA` now checks for inclusion of a basic block in `relevantBBs` before adding it to the worklist, or before considering it as a useful successor).

This is the point I was wondering about. Not visiting those basic blocks (with updated information at their entry, from the relevant BBs) may change the overall results of the analysis. Imagine a graph like
-> R1 -> NR -> R2 ->
where R1,2 are relevant, NR is not relevant. Can you be sure that you can skip the re-analysis of NR, and still have correct solutions at the entry of R2, at the end of the analysis?
On a totally different note, some time ago I started working on a version of type-flow analysis that wouldn't visit every instruction per iteration, but build transfer functions for each basic block, and then use them as abstractions for the effects of a basic block (more or less what all the other data-flow analyses do). That should also speed up the analysis considerably. It may be interesting to follow up.
iulian

Actually, keeping those `relevantBBs` up to date (denoting the subgraph of the CFG of interest) isn't that compute intensive: we know all basic blocks represented in `remainingCALLs` belong to that subgraph. Well, a sure way for the typeflow analysis to reach them flawlessly is adding all of their predecessors, transitively. That's what `refresh()` is responsible for, and that's why it's invoked periodically (once per iteration of the worklist) after shrinking of `remainingCALLs`.

One ends up overridding most methods of the dataflow-analysis, but as you can see from the above the added steps all revolve around prunning `remainingCALLs` which in turn allows prunning `relevantBBs` and thus cutting down on the number of iterations. The rest of the story takes place in the inliner proper, which also focuses not on all the method's basic blocks but only on those represented in `remainingCALLs`.

Miguel
http://lampwww.epfl.ch/~magarcia/ScalaCompilerCornerReloaded

--
« Je déteste la montagne, ça cache le paysage »
Alphonse Allais

Fri, 2012-02-10, 19:11

Miguel Garcia 2

Joined: 2011-01-30,

Re: further work on making Inliner faster

Iulian,

The subset of the CFG being considered is always internally connected. This is a consequence from the way relevant blocks are determined (by reachability over a linearization from some start block). The iterations performed over those relevant basic blocks may result in pruning that set, but always maintaining the property of being internally connected (in other words, only blocks on the perimeter can become non-relevant). That's why the scenario you describe can't happen (where NR is non-relevant but R2 is in the subgraph R1 -> NR -> R2, either both R2 and NR are relevant or none of them).

I've added documentation (100 lines of comments) about on-the-fly pruning at:

https://github.com/magarciaEPFL/scala/blob/c1387a1e6366f7655a7f8f351a005c20597a16ad/src/compiler/scala/tools/nsc/backend/icode/analysis/TypeFlowAnalysis.scala

I agree that summarizing the typeflow of a block in a single step would most likely reduce the cost of a dataflow iteration. Not sure whether that would outperform on-the-fly pruning. Both approaches might be even complementary.

Miguel
http://lampwww.epfl.ch/~magarcia/ScalaCompilerCornerReloaded/

Scala Main Menu

further work on making Inliner faster

Scala Quick Links

Featured News

User login