This page is no longer maintained — Please continue to the home page at www.scala-lang.org

you may stop trying to optimize Range.foreach :)

14 replies
Miguel Garcia 2
Joined: 2011-01-30,
User offline. Last seen 42 years 45 weeks ago.


This time, I followed Daniel's hint about avoiding lazy vals accesses, and got this (plus: no duplicate inlining of loop bodies)

1 to 1010987314:,   foreachSum wins, 1.000:  foreach 315.741   while 315.840
1 to 1010987314:,   foreachSum wins, 0.997:  foreach 316.227   while 317.090
1 to 1010987314:,     whileSum wins, 0.997:  foreach 316.425   while 315.406
1 to 1155414073:,   foreachSum wins, 0.999:  foreach 361.608   while 361.958
1 to 1155414073:,     whileSum wins, 0.998:  foreach 362.152   while 361.404
1 to 1155414073:,   foreachSum wins, 0.997:  foreach 361.426   while 362.685
1 to 1320473226:,   foreachSum wins, 0.998:  foreach 412.996   while 413.626
1 to 1320473226:,   foreachSum wins, 1.000:  foreach 413.479   while 413.647
1 to 1320473226:,   foreachSum wins, 0.999:  foreach 413.193   while 413.800
1 to 1509112258:,   foreachSum wins, 0.998:  foreach 471.220   while 471.984
1 to 1509112258:,   foreachSum wins, 0.998:  foreach 472.093   while 472.912
1 to 1509112258:,     whileSum wins, 0.996:  foreach 472.886   while 471.027
1 to 1724699723:,   foreachSum wins, 0.998:  foreach 539.537   while 540.882
1 to 1724699723:,   foreachSum wins, 0.997:  foreach 538.399   while 540.080
1 to 1724699723:,     whileSum wins, 1.000:  foreach 539.316   while 539.132
1 to 1971085397:,     whileSum wins, 0.998:  foreach 617.257   while 616.036
1 to 1971085397:,   foreachSum wins, 0.999:  foreach 616.946   while 617.682
1 to 1971085397:,   foreachSum wins, 0.999:  foreach 617.565   while 617.892


The implementation:

  @inline final override def foreach[U](f: Int => U) {
    val goesDown = (step < 0)
    val isInclu  = isInclusive
    var i = start
    while (
      if(goesDown)
        if (isInclu) i >= end
        else         i >  end
      else
        if (isInclu) i <= end
        else         i <  end
    ) {
      f(i) // The cast `(f.asInstanceOf[Int => Unit]).apply(i)` made it slower.
      i += step
    }
  }

And, the above also results in inlining for:

   class XXS {
     private val array = Array.range(0, 100)
     def tst = { var sum = 0; for (i <- 0 until array.length) sum += array(i); sum }
   }

As well as for:

  def main(args: Array[String]) {
    for(i <- 1 to 10) {
      Console.println(i)
      for(j <- 1 to 10) { Console.println(j) }
    }
  }

Just imagine once the Inliner has been improved, too.


Miguel
http://lamp.epfl.ch/~magarcia/ScalaCompilerCornerReloaded/


Viktor Klang
Joined: 2008-12-17,
User offline. Last seen 1 year 27 weeks ago.
Re: you may stop trying to optimize Range.foreach :)
Excellent work!

Would be great to have this optimization for 2.10

On Tue, Dec 13, 2011 at 3:46 PM, Miguel Garcia <mgarcia512@yahoo.com> wrote:


This time, I followed Daniel's hint about avoiding lazy vals accesses, and got this (plus: no duplicate inlining of loop bodies)

1 to 1010987314:,   foreachSum wins, 1.000:  foreach 315.741   while 315.840
1 to 1010987314:,   foreachSum wins, 0.997:  foreach 316.227   while 317.090
1 to 1010987314:,     whileSum wins, 0.997:  foreach 316.425   while 315.406
1 to 1155414073:,   foreachSum wins, 0.999:  foreach 361.608   while 361.958
1 to 1155414073:,     whileSum wins, 0.998:  foreach 362.152   while 361.404
1 to 1155414073:,   foreachSum wins, 0.997:  foreach 361.426   while 362.685
1 to 1320473226:,   foreachSum wins, 0.998:  foreach 412.996   while 413.626
1 to 1320473226:,   foreachSum wins, 1.000:  foreach 413.479   while 413.647
1 to 1320473226:,   foreachSum wins, 0.999:  foreach 413.193   while 413.800
1 to 1509112258:,   foreachSum wins, 0.998:  foreach 471.220   while 471.984
1 to 1509112258:,   foreachSum wins, 0.998:  foreach 472.093   while 472.912
1 to 1509112258:,     whileSum wins, 0.996:  foreach 472.886   while 471.027
1 to 1724699723:,   foreachSum wins, 0.998:  foreach 539.537   while 540.882
1 to 1724699723:,   foreachSum wins, 0.997:  foreach 538.399   while 540.080
1 to 1724699723:,     whileSum wins, 1.000:  foreach 539.316   while 539.132
1 to 1971085397:,     whileSum wins, 0.998:  foreach 617.257   while 616.036
1 to 1971085397:,   foreachSum wins, 0.999:  foreach 616.946   while 617.682
1 to 1971085397:,   foreachSum wins, 0.999:  foreach 617.565   while 617.892


The implementation:

  @inline final override def foreach[U](f: Int => U) {
    val goesDown = (step < 0)
    val isInclu  = isInclusive
    var i = start
    while (
      if(goesDown)
        if (isInclu) i >= end
        else         i >  end
      else
        if (isInclu) i <= end
        else         i <  end
    ) {
      f(i) // The cast `(f.asInstanceOf[Int => Unit]).apply(i)` made it slower.
      i += step
    }
  }

And, the above also results in inlining for:

   class XXS {
     private val array = Array.range(0, 100)
     def tst = { var sum = 0; for (i <- 0 until array.length) sum += array(i); sum }
   }

As well as for:

  def main(args: Array[String]) {
    for(i <- 1 to 10) {
      Console.println(i)
      for(j <- 1 to 10) { Console.println(j) }
    }
  }

Just imagine once the Inliner has been improved, too.


Miguel
http://lamp.epfl.ch/~magarcia/ScalaCompilerCornerReloaded/





--
Viktor Klang

Akka Tech LeadTypesafe - Enterprise-Grade Scala from the Experts

Twitter: @viktorklang
extempore
Joined: 2008-12-17,
User offline. Last seen 35 weeks 3 days ago.
Re: you may stop trying to optimize Range.foreach :)

On Tue, Dec 13, 2011 at 6:46 AM, Miguel Garcia wrote:
> fast foreach

Excellent, I think we have a winner.

> Just imagine once the Inliner has been improved, too.

I can hardly contain myself as it is.

I wonder if there's some simple way we can harness this kind of energy
on a more frequent basis. It doesn't look as if any of us on our own
would have come up with this.

extempore
Joined: 2008-12-17,
User offline. Last seen 35 weeks 3 days ago.
Re: you may stop trying to optimize Range.foreach :)

On Tue, Dec 13, 2011 at 7:17 AM, Ismael Juma wrote:
> Out of curiosity, what does the bytecode for the above look like?

It looks like this, calling into question basically everything I
thought might be true, since it does seem equally fast. But it isn't
even avoiding the boxing. How can it be no slower to call the
unspecialized apply?

public final void foreach(scala.Function1);
Code:
Stack=2, Locals=5, Args_size=2
0: aload_0
1: invokevirtual #141414; //Method step:()I
4: iconst_0
5: if_icmpge 12
8: iconst_1
9: goto 13
12: iconst_0
13: istore_2
14: aload_0
15: invokevirtual #1c1c1c; //Method isInclusive:()Z
18: istore_3
19: aload_0
20: invokevirtual #151515; //Method start:()I
23: istore 4
25: iload_2
26: ifeq 67
29: iload_3
30: ifeq 50
33: iload 4
35: aload_0
36: invokevirtual #111111; //Method end:()I
39: if_icmplt 46
42: iconst_1
43: goto 102
46: iconst_0
47: goto 102
50: iload 4
52: aload_0
53: invokevirtual #111111; //Method end:()I
56: if_icmple 63
59: iconst_1
60: goto 102
63: iconst_0
64: goto 102
67: iload_3
68: ifeq 88
71: iload 4
73: aload_0
74: invokevirtual #111111; //Method end:()I
77: if_icmpgt 84
80: iconst_1
81: goto 102
84: iconst_0
85: goto 102
88: iload 4
90: aload_0
91: invokevirtual #111111; //Method end:()I
94: if_icmpge 101
97: iconst_1
98: goto 102
101: iconst_0
102: ifeq 129
105: aload_1
106: iload 4
108: invokestatic #5c5c5c; //Method
scala/runtime/BoxesRunTime.boxToInteger:(I)Ljava/lang/Integer;
111: invokeinterface #5c5c5c, 2; //InterfaceMethod
scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;
116: pop
117: iload 4
119: aload_0
120: invokevirtual #141414; //Method step:()I
123: iadd
124: istore 4
126: goto 25
129: return

Viktor Klang
Joined: 2008-12-17,
User offline. Last seen 1 year 27 weeks ago.
Re: you may stop trying to optimize Range.foreach :)


On Tue, Dec 13, 2011 at 4:08 PM, Paul Phillips <paulp@improving.org> wrote:
On Tue, Dec 13, 2011 at 6:46 AM, Miguel Garcia <mgarcia512@yahoo.com> wrote:
> fast foreach

Excellent, I think we have a winner.

> Just imagine once the Inliner has been improved, too.

I can hardly contain myself as it is.

I wonder if there's some simple way we can harness this kind of energy
on a more frequent basis.  It doesn't look as if any of us on our own
would have come up with this.

We need some focal point in the alps somewhere where such synergy can be synthesized.



--
Viktor Klang

Akka Tech LeadTypesafe - Enterprise-Grade Scala from the Experts

Twitter: @viktorklang
Ismael Juma 2
Joined: 2011-01-22,
User offline. Last seen 42 years 45 weeks ago.
Re: you may stop trying to optimize Range.foreach :)
On Tue, Dec 13, 2011 at 3:20 PM, Paul Phillips <paulp@improving.org> wrote:
It looks like this, calling into question basically everything I
thought might be true, since it does seem equally fast.  But it isn't
even avoiding the boxing.  How can it be no slower to call the
unspecialized apply?

It would be interesting to see the native code generated by HotSpot, but there is the risk that this is the kind of code that works great in micro-benchmarks, but not so well in real code.
Somewhat relevant to this, do we care about the performance of HotSpot -client? Not sure if it will handle the branches inside the loop as well as -server.
Best,Ismael
Ismael Juma 2
Joined: 2011-01-22,
User offline. Last seen 42 years 45 weeks ago.
Re: you may stop trying to optimize Range.foreach :)
On Tue, Dec 13, 2011 at 2:46 PM, Miguel Garcia <mgarcia512@yahoo.com> wrote:
The implementation:

  @inline final override def foreach[U](f: Int => U) {
    val goesDown = (step < 0)
    val isInclu  = isInclusive
    var i = start
    while (
      if(goesDown)
        if (isInclu) i >= end
        else         i >  end
      else
        if (isInclu) i <= end
        else         i <  end
    ) {
      f(i) // The cast `(f.asInstanceOf[Int => Unit]).apply(i)` made it slower.
      i += step
    }
  }

Out of curiosity, what does the bytecode for the above look like?
Best,Ismael
extempore
Joined: 2008-12-17,
User offline. Last seen 35 weeks 3 days ago.
Re: you may stop trying to optimize Range.foreach :)

Here is the bytecode without and with the cast:

// without
108: invokestatic #5c5c5c; //Method
scala/runtime/BoxesRunTime.boxToInteger:(I)Ljava/lang/Integer;
111: invokeinterface #5c5c5c, 2; //InterfaceMethod
scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;
116: pop

// with
108: invokeinterface #575757, 2; //InterfaceMethod
scala/Function1.apply$mcVI$sp:(I)V

I have trouble believing the second version isn't better. I can
reproduce that it doesn't seem to make a difference on the trivial
benchmark, but I can't reproduce "slower" and it's hard to see how it
could happen unless there is something seriously wrong with the
specialization of Function1.

Viktor Klang
Joined: 2008-12-17,
User offline. Last seen 1 year 27 weeks ago.
Re: you may stop trying to optimize Range.foreach :)


On Tue, Dec 13, 2011 at 4:38 PM, Paul Phillips <paulp@improving.org> wrote:
Here is the bytecode without and with the cast:

// without
 108: invokestatic    #5c5c5c; //Method
scala/runtime/BoxesRunTime.boxToInteger:(I)Ljava/lang/Integer;
 111: invokeinterface #5c5c5c,  2; //InterfaceMethod
scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;
 116: pop

// with
 108:  invokeinterface #575757,  2; //InterfaceMethod
scala/Function1.apply$mcVI$sp:(I)V

I have trouble believing the second version isn't better.  I can
reproduce that it doesn't seem to make a difference on the trivial
benchmark, but I can't reproduce "slower" and it's hard to see how it
could happen unless there is something seriously wrong with the
specialization of Function1.

Run it with a tiny heap and use a range that is starts at 129


--
Viktor Klang

Akka Tech LeadTypesafe - Enterprise-Grade Scala from the Experts

Twitter: @viktorklang
extempore
Joined: 2008-12-17,
User offline. Last seen 35 weeks 3 days ago.
Re: you may stop trying to optimize Range.foreach :)

Well, I take that back after looking more closely at what happens with
function specialization.

Tiark Rompf
Joined: 2009-02-18,
User offline. Last seen 42 years 45 weeks ago.
Re: you may stop trying to optimize Range.foreach :)

If the closure body is small and the apply callsite is monomorphic a possible explanation is that HotSpot inlines the closure and optimizes away the boxing/unboxing.

- Tiark

On Dec 13, 2011, at 4:38 PM, Paul Phillips wrote:

> Here is the bytecode without and with the cast:
>
> // without
> 108: invokestatic #5c5c5c; //Method
> scala/runtime/BoxesRunTime.boxToInteger:(I)Ljava/lang/Integer;
> 111: invokeinterface #5c5c5c, 2; //InterfaceMethod
> scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;
> 116: pop
>
> // with
> 108: invokeinterface #575757, 2; //InterfaceMethod
> scala/Function1.apply$mcVI$sp:(I)V
>
> I have trouble believing the second version isn't better. I can
> reproduce that it doesn't seem to make a difference on the trivial
> benchmark, but I can't reproduce "slower" and it's hard to see how it
> could happen unless there is something seriously wrong with the
> specialization of Function1.

dcsobral
Joined: 2009-04-23,
User offline. Last seen 38 weeks 5 days ago.
Re: you may stop trying to optimize Range.foreach :)

On Tue, Dec 13, 2011 at 12:46, Miguel Garcia wrote:
>
> The implementation:
>
>   @inline final override def foreach[U](f: Int => U) {
>     val goesDown = (step < 0)
>     val isInclu  = isInclusive
>     var i = start
>     while (
>       if(goesDown)
>         if (isInclu) i >= end
>         else         i >  end
>       else
>         if (isInclu) i <= end
>         else         i <  end
>     ) {
>       f(i) // The cast `(f.asInstanceOf[Int => Unit]).apply(i)` made it
> slower.
>       i += step
>     }
>   }

So, while eating lunch my mind reminded me "last" was there for good
reason. What would happen in the following two cases?

1 to 1 by 2 foreach println
Int.MaxValue to Int.MaxValue by 2 foreach println

Miguel Garcia 2
Joined: 2011-01-30,
User offline. Last seen 42 years 45 weeks ago.
Re: you may stop trying to optimize Range.foreach :)

Daniel,

It's dinner time over here so I'll go get some ideas :) The "last" thing I needed, "Range.last" !

Miguel
Iulian Dragos
Joined: 2008-12-18,
User offline. Last seen 42 years 45 weeks ago.
Re: you may stop trying to optimize Range.foreach :)
All these discussions would be so much better with the benchmarking code attached. I am very skeptical of any claims of making foreach faster without removing boxing/unboxing. It's most probably a benchmarking artifact. Mini-benchmarks ar hard to get right.
cheers,iulian

On Tue, Dec 13, 2011 at 5:21 PM, Daniel Sobral <dcsobral@gmail.com> wrote:
On Tue, Dec 13, 2011 at 12:46, Miguel Garcia <mgarcia512@yahoo.com> wrote:
>
> The implementation:
>
>   @inline final override def foreach[U](f: Int => U) {
>     val goesDown = (step < 0)
>     val isInclu  = isInclusive
>     var i = start
>     while (
>       if(goesDown)
>         if (isInclu) i >= end
>         else         i >  end
>       else
>         if (isInclu) i <= end
>         else         i <  end
>     ) {
>       f(i) // The cast `(f.asInstanceOf[Int => Unit]).apply(i)` made it
> slower.
>       i += step
>     }
>   }

So, while eating lunch my mind reminded me "last" was there for good
reason. What would happen in the following two cases?

1 to 1 by 2 foreach println
Int.MaxValue to Int.MaxValue by 2 foreach println

--
Daniel C. Sobral

I travel to the future all the time.



--
« Je déteste la montagne, ça cache le paysage »
Alphonse Allais
dcsobral
Joined: 2009-04-23,
User offline. Last seen 38 weeks 5 days ago.
Re: you may stop trying to optimize Range.foreach :)

On Tue, Dec 13, 2011 at 15:12, iulian dragos wrote:
> All these discussions would be so much better with the benchmarking code
> attached. I am very skeptical of any claims of making foreach faster without
> removing boxing/unboxing. It's most probably a benchmarking artifact.
> Mini-benchmarks ar hard to get right.

I've been using
https://github.com/dcsobral/scala-benchmarking-template to benchmark.
It uses Caliper to make the measurements, and it tests ranges of
multiple sizes against a while loop, foreach, and a specialized
for-loop class. The body is small, but that's precisely when the
difference in speed makes the biggest impact.

I want to expand the benchmarks to include slow functions in addition
to fast functions, and given paul's discovery, functions that are not
Int => Unit as well. But that will wait until I can get SBT to run
multiple benchmarks -- it hasn't bent to my will so far on a couple of
issues, even though the rest of the infrastructure is working (not on
the github though, since it presently doesn't run at all).

Copyright © 2012 École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland