might be worth your while

10 replies

Mon, 2011-12-12, 23:56

extempore

Joined: 2008-12-17,

On Sun, Sep 19, 2010 at 12:33 PM, martin odersky wrote:
> I have dreamed for about 5 years now that for loops should be just
> as fast as while loops, just using normal optimizations that are
> applicable everywhere. That's why I was always against special-casing
> for loops over ranges which would have been easy to do.

It took me another year and change, but maybe dreams do come true,
in a limited sort of way.

[info] length benchmark ns linear runtime
[info] 10 Foreach 10.35 =
[info] 10 TFor 5.23 =
[info] 10 While 4.94 =
[info] 100 Foreach 39.33 =
[info] 100 TFor 34.39 =
[info] 100 While 29.44 =
[info] 1000 Foreach 343.11 ===
[info] 1000 TFor 329.17 ==
[info] 1000 While 313.10 ==
[info] 10000 Foreach 3333.05 ==============================
[info] 10000 TFor 3281.15 =============================
[info] 10000 While 3096.62 ===========================

What's the catch? Actually I was hoping you guys would work that out and
let me know. Suboptimalities I have observed so far include some things
not being inlined which were inlined before, and that the performance
parity does not make it all the way into nested foreaches. I think I'm
still out in front though.

https://github.com/scala/scala/commit/4cfc633fc6

/** @note Making foreach run as fast as a while loop is a challenge.
* The key elements which I can observe making a difference are:
*
* - the inner loop should be as small as possible
* - the inner loop should be monomorphic
* - the inner loop should perform no boxing and no avoidable tests
*
* This is achieved by:
*
* - keeping initialization logic out of the inner loop
* - dispatching to custom variations based on initial conditions
* - tricking the compiler into always calling Function1#apply$mcVI$sp
*
* The last one is important and less than obvious. Even when foreach
* was specialized on Unit, only Int => Unit arguments benefited from it.
* Other function types would be accepted, but in the absence of full
* specialization the integer argument was boxed on every call. For example:
*
class A {
final def f(x: Int): Int = x + 1
// Calls Range.foreach, which calls Function1.apply
def g1 = 1 until 100 foreach { x => f(x) }
// Calls Range.foreach$mVc$sp, which calls Function1.apply$mcVI$sp
def g2 = 1 until 100 foreach { x => f(x) ; () }
}
*
* However! Since the result of the closure is always discarded, we
* simply cast it to Int => Unit, thereby executing the fast version.
* The seemingly looming ClassCastException can never arrive.
*/

Tue, 2011-12-13, 00:11

Viktor Klang

Joined: 2008-12-17,

Re: might be worth your while

Great work Paul!

What happens if you declare the extra methods inside the foreach method?
The compiler should _definitely_ be able to inline @inline-marked inner methods, no?

Cheers,
√

On Mon, Dec 12, 2011 at 11:56 PM, Paul Phillips <paulp@improving.org> wrote:

On Sun, Sep 19, 2010 at 12:33 PM, martin odersky <martin.odersky@epfl.ch> wrote:
> I have dreamed for about 5 years now that for loops should be just
> as fast as while loops, just using normal optimizations that are
> applicable everywhere. That's why I was always against special-casing
> for loops over ranges which would have been easy to do.

It took me another year and change, but maybe dreams do come true,
in a limited sort of way.

[info] length benchmark ns linear runtime
[info] 10 Foreach 10.35 =
[info] 10 TFor 5.23 =
[info] 10 While 4.94 =
[info] 100 Foreach 39.33 =
[info] 100 TFor 34.39 =
[info] 100 While 29.44 =
[info] 1000 Foreach 343.11 ===
[info] 1000 TFor 329.17 ==
[info] 1000 While 313.10 ==
[info] 10000 Foreach 3333.05 ==============================
[info] 10000 TFor 3281.15 =============================
[info] 10000 While 3096.62 ===========================

What's the catch? Actually I was hoping you guys would work that out and
let me know. Suboptimalities I have observed so far include some things
not being inlined which were inlined before, and that the performance
parity does not make it all the way into nested foreaches. I think I'm
still out in front though.

https://github.com/scala/scala/commit/4cfc633fc6

/** @note Making foreach run as fast as a while loop is a challenge.
* The key elements which I can observe making a difference are:
*
* - the inner loop should be as small as possible
* - the inner loop should be monomorphic
* - the inner loop should perform no boxing and no avoidable tests
*
* This is achieved by:
*
* - keeping initialization logic out of the inner loop
* - dispatching to custom variations based on initial conditions
* - tricking the compiler into always calling Function1#apply$mcVI$sp
*
* The last one is important and less than obvious. Even when foreach
* was specialized on Unit, only Int => Unit arguments benefited from it.
* Other function types would be accepted, but in the absence of full
* specialization the integer argument was boxed on every call. For example:
*
class A {
final def f(x: Int): Int = x + 1
// Calls Range.foreach, which calls Function1.apply
def g1 = 1 until 100 foreach { x => f(x) }
// Calls Range.foreach$mVc$sp, which calls Function1.apply$mcVI$sp
def g2 = 1 until 100 foreach { x => f(x) ; () }
}
*
* However! Since the result of the closure is always discarded, we
* simply cast it to Int => Unit, thereby executing the fast version.
* The seemingly looming ClassCastException can never arrive.
*/

--
Viktor Klang

Akka Tech LeadTypesafe - Enterprise-Grade Scala from the Experts

Twitter: @viktorklang

Tue, 2011-12-13, 01:41

extempore

Joined: 2008-12-17,

Re: might be worth your while

2011/12/12 √iktor Ҡlang :
> The compiler should _definitely_ be able to inline @inline-marked inner
> methods, no?

It can't find the lifted methods to inline. I would believe it is a
bug in the inliner, which does (and doesn't do) a lot of things I
don't entirely understand. (Some of those things I will eventually
understand, but some are bugs I haven't fixed yet.)

Tue, 2011-12-13, 01:51

Viktor Klang

Joined: 2008-12-17,

Re: might be worth your while

On Tue, Dec 13, 2011 at 1:33 AM, Paul Phillips <paulp@improving.org> wrote:

2011/12/12 √iktor Ҡlang <viktor.klang@gmail.com>:
> The compiler should _definitely_ be able to inline @inline-marked inner
> methods, no?

It can't find the lifted methods to inline. I would believe it is a
bug in the inliner, which does (and doesn't do) a lot of things I
don't entirely understand. (Some of those things I will eventually
understand, but some are bugs I haven't fixed yet.)

DAmn, it's just one of those things I wish I had the time to get into and fix :(

--
Viktor Klang

Akka Tech LeadTypesafe - Enterprise-Grade Scala from the Experts

Twitter: @viktorklang

Tue, 2011-12-13, 02:51

nilskp

Joined: 2009-01-30,

Re: might be worth your while

On Mon, Dec 12, 2011 at 4:56 PM, Paul Phillips <paulp@improving.org> wrote:

On Sun, Sep 19, 2010 at 12:33 PM, martin odersky <martin.odersky@epfl.ch> wrote:
> I have dreamed for about 5 years now that for loops should be just
> as fast as while loops, just using normal optimizations that are
> applicable everywhere. That's why I was always against special-casing
> for loops over ranges which would have been easy to do.

It took me another year and change, but maybe dreams do come true,
in a limited sort of way.

[info] length benchmark ns linear runtime
[info] 10 Foreach 10.35 =
[info] 10 TFor 5.23 =
[info] 10 While 4.94 =
[info] 100 Foreach 39.33 =
[info] 100 TFor 34.39 =
[info] 100 While 29.44 =
[info] 1000 Foreach 343.11 ===
[info] 1000 TFor 329.17 ==
[info] 1000 While 313.10 ==
[info] 10000 Foreach 3333.05 ==============================
[info] 10000 TFor 3281.15 =============================
[info] 10000 While 3096.62 ===========================

What's the catch? Actually I was hoping you guys would work that out and
let me know. Suboptimalities I have observed so far include some things
not being inlined which were inlined before, and that the performance
parity does not make it all the way into nested foreaches. I think I'm
still out in front though.

You don't state it outright, but my sixth sense leads me to one inescapable conclusion: Awesome!

Tue, 2011-12-13, 09:41

Ruediger Keller 2

Joined: 2010-04-30,

Re: might be worth your while

Hi Paul,

very nice indeed!

Does this depend on compiling with -optimize? Btw. I remember you
working on special casing Range.foreach some months ago with very
promising results. If I remember correctly Martin was against it, but
I think it had the advantage of not depending on -optimize and working
with nested foreachs, too. Both being very nice properties IMHO.

I wonder if Martin wouldn't change his mind now...

Regards,
Rüdiger

2011/12/12 Paul Phillips :
> On Sun, Sep 19, 2010 at 12:33 PM, martin odersky wrote:
>> I have dreamed for about 5 years now that for loops should be just
>> as fast as while loops, just using normal optimizations that are
>> applicable everywhere. That's why I was always against special-casing
>> for loops over ranges which would have been easy to do.
>
> It took me another year and change, but maybe dreams do come true,
> in a limited sort of way.
>
> [info] length benchmark ns linear runtime
> [info] 10 Foreach 10.35 =
> [info] 10 TFor 5.23 =
> [info] 10 While 4.94 =
> [info] 100 Foreach 39.33 =
> [info] 100 TFor 34.39 =
> [info] 100 While 29.44 =
> [info] 1000 Foreach 343.11 ===
> [info] 1000 TFor 329.17 ==
> [info] 1000 While 313.10 ==
> [info] 10000 Foreach 3333.05 ==============================
> [info] 10000 TFor 3281.15 =============================
> [info] 10000 While 3096.62 ===========================
>
> What's the catch? Actually I was hoping you guys would work that out and
> let me know. Suboptimalities I have observed so far include some things
> not being inlined which were inlined before, and that the performance
> parity does not make it all the way into nested foreaches. I think I'm
> still out in front though.
>
> https://github.com/scala/scala/commit/4cfc633fc6
>
> /** @note Making foreach run as fast as a while loop is a challenge.
> * The key elements which I can observe making a difference are:
> *
> * - the inner loop should be as small as possible
> * - the inner loop should be monomorphic
> * - the inner loop should perform no boxing and no avoidable tests
> *
> * This is achieved by:
> *
> * - keeping initialization logic out of the inner loop
> * - dispatching to custom variations based on initial conditions
> * - tricking the compiler into always calling Function1#apply$mcVI$sp
> *
> * The last one is important and less than obvious. Even when foreach
> * was specialized on Unit, only Int => Unit arguments benefited from it.
> * Other function types would be accepted, but in the absence of full
> * specialization the integer argument was boxed on every call. For example:
> *
> class A {
> final def f(x: Int): Int = x + 1
> // Calls Range.foreach, which calls Function1.apply
> def g1 = 1 until 100 foreach { x => f(x) }
> // Calls Range.foreach$mVc$sp, which calls Function1.apply$mcVI$sp
> def g2 = 1 until 100 foreach { x => f(x) ; () }
> }
> *
> * However! Since the result of the closure is always discarded, we
> * simply cast it to Int => Unit, thereby executing the fast version.
> * The seemingly looming ClassCastException can never arrive.
> */

Tue, 2011-12-13, 10:01

extempore

Joined: 2008-12-17,

Re: might be worth your while

On Tue, Dec 13, 2011 at 12:36 AM, Ruediger Keller
wrote:
> Does this depend on compiling with -optimize? Btw. I remember you
> working on special casing Range.foreach some months ago with very
> promising results. If I remember correctly Martin was against it, but
> I think it had the advantage of not depending on -optimize and working
> with nested foreachs, too. Both being very nice properties IMHO.
>
> I wonder if Martin wouldn't change his mind now...

Yes, it requires -optimise. If macros arrive on schedule this will be
a bit of a hollow victory since they'll offer what you're looking for.
Still, even a hollow victory is better than defeat.

Tue, 2011-12-13, 12:01

Miguel Garcia 2

Joined: 2011-01-30,

Re: might be worth your while

Paul,

There appears to be room for improvement in both Range.foreach and in the Inliner:

http://lamp.epfl.ch/~magarcia/ScalaCompilerCornerReloaded/2011Q4/Inliner.pdf
http://lamp.epfl.ch/~magarcia/ScalaCompilerCornerReloaded/2011Q4/ClosureOptimiz.pdf

Regarding improving the Inliner, work in underway.

Regarding Range.foreach, as things stand, the following snippet

def main(args: Array[String]) {
    for(i <- 1 to 10) { for(j <- 1 to 10) { Console.println(i+j) } }
}

results after DeadCodeElimination in ICode with a lot of baggage, due to the closure argument being accessed four times in Range.foreach. In detail:

def main(args: Array[String] (ARRAY[REF(class String)])): Unit {

locals: value args, variable $inlThis7, variable par18, variable $inlThis15, variable par116, variable v1$11, variable $inlThis20, variable par117, variable loc21, variable v1$12, variable $inlThis21, variable loc25, variable $inlThis22, variable loc26, variable $inlThis23, variable loc27, variable $inlThis16, variable loc22, variable par113, variable $inlThis24, variable loc28, variable $inlThis14, variable par114, variable v1$13, variable loc29, variable loc210, variable loc211, variable loc23, variable par115, variable $inlThis18, variable loc212, variable par11, variable $inlThis1, variable loc24

startBlock: 1

blocks: [1,5,6,17,15,16,76,77,78,80,75,81,79,7,22,20,21,54,55,86,84,85,56,91,89,90,58,53,96,94,95,59,101,99,57,100,9,4,27,25,26,65,66,106,104,105,67,111,109,110,69,64,116,114,115,70,121,119,68,120,10,32,30,8,31]

...

and then come 500 lines of instructions.

Combining the .asInstanceOf[Int => Unit] trick and the version of Range.foreach suggested in https://issues.scala-lang.org/browse/SI-5286 , i.e.

@inline final override def foreach[U](f: Int => U) {
    if (length > 0) {
      val sentinel = last
      var closuVar = start
      var loopCond = true
      while ( loopCond ) {
        (f.asInstanceOf[Int => Unit]).apply(closuVar)
        if(closuVar == sentinel) loopCond = false
        else closuVar += step
      }
    }
}

we get both less baggage and the inlinings (all of them) we were looking for, in fewer than 140 instructions (including zero NEWs for anon-closu classes):

def main(args: Array[String] (ARRAY[REF(class String)])): Unit {

locals: value args, variable $inlThis4, variable par15, variable v11, variable $inlThis7, variable par17, variable loc21, variable $inlThis1, variable loc31, variable loc41, variable v12, variable loc22, variable $inlThis9, variable loc32, variable loc42

startBlock: 1

blocks: [1,7,9,6,5,29,31,28,8,4,27,30,26]

And, the shorter version also passes the test suite.

Miguel
http://lamp.epfl.ch/~magarcia/ScalaCompilerCornerReloaded/

Tue, 2011-12-13, 12:11

extempore

Joined: 2008-12-17,

Re: Re: might be worth your while

On Tue, Dec 13, 2011 at 2:56 AM, Miguel Garcia wrote:
> we get both less baggage and the inlinings (all of them) we were looking
> for, in fewer than 140 instructions (including zero NEWs for anon-closu
> classes):

That sounds pretty promising. Can you or anyone reassure me that I
can introduce code which relies on overflow without losing any sleep?
From what I found in the jvm spec it sounds like we're good, but I'd
like to hear at least one other person say that so I have someone to
blame later. (The suggested implementation relies on Int.MinValue - N
+ N landing on Int.MinValue.)

Tue, 2011-12-13, 12:21

extempore

Joined: 2008-12-17,

Re: Re: might be worth your while

Oh, and I blended up the fact that there are two distinct proposed
implementations going in that ticket. Yours doesn't rely on overflow,
but if we can rely on the overflow, then daniel's looks simpler.

Tue, 2011-12-13, 13:51

#10

dcsobral

Joined: 2009-04-23,

Re: Re: might be worth your while

On Tue, Dec 13, 2011 at 09:02, Paul Phillips wrote:
> On Tue, Dec 13, 2011 at 2:56 AM, Miguel Garcia wrote:
>> we get both less baggage and the inlinings (all of them) we were looking
>> for, in fewer than 140 instructions (including zero NEWs for anon-closu
>> classes):
>
> That sounds pretty promising. Can you or anyone reassure me that I
> can introduce code which relies on overflow without losing any sleep?
> From what I found in the jvm spec it sounds like we're good, but I'd
> like to hear at least one other person say that so I have someone to
> blame later. (The suggested implementation relies on Int.MinValue - N
> + N landing on Int.MinValue.)

I think the spec is absolutely clear, but a comment in one blog (talk
about hearsay...
http://chasethedevil.blogspot.com/2009/06/java-int-overflow-behavior.html)
claims this behavior is tested by the JCK tests, though I don't have
access to them to verify it. So I'm just mentioning it here in case
some can check it out.

Scala Main Menu

might be worth your while

Scala Quick Links

Featured News

User login