- About Scala
- Documentation
- Code Examples
- Software
- Scala Developers
you may stop trying to optimize Range.foreach :)
Tue, 2011-12-13, 15:46
This time, I followed Daniel's hint about avoiding lazy vals accesses, and got this (plus: no duplicate inlining of loop bodies)
1 to 1010987314:, foreachSum wins, 1.000: foreach 315.741 while 315.840
1 to 1010987314:, foreachSum wins, 0.997: foreach 316.227 while 317.090
1 to 1010987314:, whileSum wins, 0.997: foreach 316.425 while 315.406
1 to 1155414073:, foreachSum wins, 0.999: foreach 361.608 while 361.958
1 to 1155414073:, whileSum wins, 0.998: foreach 362.152 while 361.404
1 to 1155414073:, foreachSum wins, 0.997: foreach 361.426 while 362.685
1 to 1320473226:, foreachSum wins, 0.998: foreach 412.996 while 413.626
1 to 1320473226:, foreachSum wins, 1.000: foreach 413.479 while 413.647
1 to 1320473226:, foreachSum wins, 0.999: foreach 413.193 while 413.800
1 to 1509112258:, foreachSum wins, 0.998: foreach 471.220 while 471.984
1 to 1509112258:, foreachSum wins, 0.998: foreach 472.093 while 472.912
1 to 1509112258:, whileSum wins, 0.996: foreach 472.886 while 471.027
1 to 1724699723:, foreachSum wins, 0.998: foreach 539.537 while 540.882
1 to 1724699723:, foreachSum wins, 0.997: foreach 538.399 while 540.080
1 to 1724699723:, whileSum wins, 1.000: foreach 539.316 while 539.132
1 to 1971085397:, whileSum wins, 0.998: foreach 617.257 while 616.036
1 to 1971085397:, foreachSum wins, 0.999: foreach 616.946 while 617.682
1 to 1971085397:, foreachSum wins, 0.999: foreach 617.565 while 617.892
The implementation:
@inline final override def foreach[U](f: Int => U) {
val goesDown = (step < 0)
val isInclu = isInclusive
var i = start
while (
if(goesDown)
if (isInclu) i >= end
else i > end
else
if (isInclu) i <= end
else i < end
) {
f(i) // The cast `(f.asInstanceOf[Int => Unit]).apply(i)` made it slower.
i += step
}
}
And, the above also results in inlining for:
class XXS {
private val array = Array.range(0, 100)
def tst = { var sum = 0; for (i <- 0 until array.length) sum += array(i); sum }
}
As well as for:
def main(args: Array[String]) {
for(i <- 1 to 10) {
Console.println(i)
for(j <- 1 to 10) { Console.println(j) }
}
}
Just imagine once the Inliner has been improved, too.
Miguel
http://lamp.epfl.ch/~magarcia/ScalaCompilerCornerReloaded/
Tue, 2011-12-13, 16:21
#2
Re: you may stop trying to optimize Range.foreach :)
On Tue, Dec 13, 2011 at 6:46 AM, Miguel Garcia wrote:
> fast foreach
Excellent, I think we have a winner.
> Just imagine once the Inliner has been improved, too.
I can hardly contain myself as it is.
I wonder if there's some simple way we can harness this kind of energy
on a more frequent basis. It doesn't look as if any of us on our own
would have come up with this.
Tue, 2011-12-13, 16:21
#3
Re: you may stop trying to optimize Range.foreach :)
On Tue, Dec 13, 2011 at 7:17 AM, Ismael Juma wrote:
> Out of curiosity, what does the bytecode for the above look like?
It looks like this, calling into question basically everything I
thought might be true, since it does seem equally fast. But it isn't
even avoiding the boxing. How can it be no slower to call the
unspecialized apply?
public final void foreach(scala.Function1);
Code:
Stack=2, Locals=5, Args_size=2
0: aload_0
1: invokevirtual #141414; //Method step:()I
4: iconst_0
5: if_icmpge 12
8: iconst_1
9: goto 13
12: iconst_0
13: istore_2
14: aload_0
15: invokevirtual #1c1c1c; //Method isInclusive:()Z
18: istore_3
19: aload_0
20: invokevirtual #151515; //Method start:()I
23: istore 4
25: iload_2
26: ifeq 67
29: iload_3
30: ifeq 50
33: iload 4
35: aload_0
36: invokevirtual #111111; //Method end:()I
39: if_icmplt 46
42: iconst_1
43: goto 102
46: iconst_0
47: goto 102
50: iload 4
52: aload_0
53: invokevirtual #111111; //Method end:()I
56: if_icmple 63
59: iconst_1
60: goto 102
63: iconst_0
64: goto 102
67: iload_3
68: ifeq 88
71: iload 4
73: aload_0
74: invokevirtual #111111; //Method end:()I
77: if_icmpgt 84
80: iconst_1
81: goto 102
84: iconst_0
85: goto 102
88: iload 4
90: aload_0
91: invokevirtual #111111; //Method end:()I
94: if_icmpge 101
97: iconst_1
98: goto 102
101: iconst_0
102: ifeq 129
105: aload_1
106: iload 4
108: invokestatic #5c5c5c; //Method
scala/runtime/BoxesRunTime.boxToInteger:(I)Ljava/lang/Integer;
111: invokeinterface #5c5c5c, 2; //InterfaceMethod
scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;
116: pop
117: iload 4
119: aload_0
120: invokevirtual #141414; //Method step:()I
123: iadd
124: istore 4
126: goto 25
129: return
Tue, 2011-12-13, 16:31
#4
Re: you may stop trying to optimize Range.foreach :)
On Tue, Dec 13, 2011 at 4:08 PM, Paul Phillips <paulp@improving.org> wrote:
On Tue, Dec 13, 2011 at 6:46 AM, Miguel Garcia <mgarcia512@yahoo.com> wrote:
> fast foreach
Excellent, I think we have a winner.
> Just imagine once the Inliner has been improved, too.
I can hardly contain myself as it is.
I wonder if there's some simple way we can harness this kind of energy
on a more frequent basis. It doesn't look as if any of us on our own
would have come up with this.
We need some focal point in the alps somewhere where such synergy can be synthesized.
--
Viktor Klang
Akka Tech LeadTypesafe - Enterprise-Grade Scala from the Experts
Twitter: @viktorklang
Tue, 2011-12-13, 16:31
#5
Re: you may stop trying to optimize Range.foreach :)
On Tue, Dec 13, 2011 at 3:20 PM, Paul Phillips <paulp@improving.org> wrote:
It would be interesting to see the native code generated by HotSpot, but there is the risk that this is the kind of code that works great in micro-benchmarks, but not so well in real code.
Somewhat relevant to this, do we care about the performance of HotSpot -client? Not sure if it will handle the branches inside the loop as well as -server.
Best,Ismael
It looks like this, calling into question basically everything I
thought might be true, since it does seem equally fast. But it isn't
even avoiding the boxing. How can it be no slower to call the
unspecialized apply?
It would be interesting to see the native code generated by HotSpot, but there is the risk that this is the kind of code that works great in micro-benchmarks, but not so well in real code.
Somewhat relevant to this, do we care about the performance of HotSpot -client? Not sure if it will handle the branches inside the loop as well as -server.
Best,Ismael
Tue, 2011-12-13, 16:41
#6
Re: you may stop trying to optimize Range.foreach :)
On Tue, Dec 13, 2011 at 2:46 PM, Miguel Garcia <mgarcia512@yahoo.com> wrote:
Out of curiosity, what does the bytecode for the above look like?
Best,Ismael
The implementation:
@inline final override def foreach[U](f: Int => U) {
val goesDown = (step < 0)
val isInclu = isInclusive
var i = start
while (
if(goesDown)
if (isInclu) i >= end
else i > end
else
if (isInclu) i <= end
else i < end
) {
f(i) // The cast `(f.asInstanceOf[Int => Unit]).apply(i)` made it slower.
i += step
}
}
Out of curiosity, what does the bytecode for the above look like?
Best,Ismael
Tue, 2011-12-13, 16:41
#7
Re: you may stop trying to optimize Range.foreach :)
Here is the bytecode without and with the cast:
// without
108: invokestatic #5c5c5c; //Method
scala/runtime/BoxesRunTime.boxToInteger:(I)Ljava/lang/Integer;
111: invokeinterface #5c5c5c, 2; //InterfaceMethod
scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;
116: pop
// with
108: invokeinterface #575757, 2; //InterfaceMethod
scala/Function1.apply$mcVI$sp:(I)V
I have trouble believing the second version isn't better. I can
reproduce that it doesn't seem to make a difference on the trivial
benchmark, but I can't reproduce "slower" and it's hard to see how it
could happen unless there is something seriously wrong with the
specialization of Function1.
Tue, 2011-12-13, 16:51
#8
Re: you may stop trying to optimize Range.foreach :)
On Tue, Dec 13, 2011 at 4:38 PM, Paul Phillips <paulp@improving.org> wrote:
Here is the bytecode without and with the cast:
// without
108: invokestatic #5c5c5c; //Method
scala/runtime/BoxesRunTime.boxToInteger:(I)Ljava/lang/Integer;
111: invokeinterface #5c5c5c, 2; //InterfaceMethod
scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;
116: pop
// with
108: invokeinterface #575757, 2; //InterfaceMethod
scala/Function1.apply$mcVI$sp:(I)V
I have trouble believing the second version isn't better. I can
reproduce that it doesn't seem to make a difference on the trivial
benchmark, but I can't reproduce "slower" and it's hard to see how it
could happen unless there is something seriously wrong with the
specialization of Function1.
Run it with a tiny heap and use a range that is starts at 129
--
Viktor Klang
Akka Tech LeadTypesafe - Enterprise-Grade Scala from the Experts
Twitter: @viktorklang
Tue, 2011-12-13, 16:51
#9
Re: you may stop trying to optimize Range.foreach :)
Well, I take that back after looking more closely at what happens with
function specialization.
Tue, 2011-12-13, 17:11
#10
Re: you may stop trying to optimize Range.foreach :)
If the closure body is small and the apply callsite is monomorphic a possible explanation is that HotSpot inlines the closure and optimizes away the boxing/unboxing.
- Tiark
On Dec 13, 2011, at 4:38 PM, Paul Phillips wrote:
> Here is the bytecode without and with the cast:
>
> // without
> 108: invokestatic #5c5c5c; //Method
> scala/runtime/BoxesRunTime.boxToInteger:(I)Ljava/lang/Integer;
> 111: invokeinterface #5c5c5c, 2; //InterfaceMethod
> scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;
> 116: pop
>
> // with
> 108: invokeinterface #575757, 2; //InterfaceMethod
> scala/Function1.apply$mcVI$sp:(I)V
>
> I have trouble believing the second version isn't better. I can
> reproduce that it doesn't seem to make a difference on the trivial
> benchmark, but I can't reproduce "slower" and it's hard to see how it
> could happen unless there is something seriously wrong with the
> specialization of Function1.
Tue, 2011-12-13, 17:41
#11
Re: you may stop trying to optimize Range.foreach :)
On Tue, Dec 13, 2011 at 12:46, Miguel Garcia wrote:
>
> The implementation:
>
> @inline final override def foreach[U](f: Int => U) {
> val goesDown = (step < 0)
> val isInclu = isInclusive
> var i = start
> while (
> if(goesDown)
> if (isInclu) i >= end
> else i > end
> else
> if (isInclu) i <= end
> else i < end
> ) {
> f(i) // The cast `(f.asInstanceOf[Int => Unit]).apply(i)` made it
> slower.
> i += step
> }
> }
So, while eating lunch my mind reminded me "last" was there for good
reason. What would happen in the following two cases?
1 to 1 by 2 foreach println
Int.MaxValue to Int.MaxValue by 2 foreach println
Tue, 2011-12-13, 18:11
#12
Re: you may stop trying to optimize Range.foreach :)
Daniel,
It's dinner time over here so I'll go get some ideas :) The "last" thing I needed, "Range.last" !
Miguel
Tue, 2011-12-13, 18:21
#13
Re: you may stop trying to optimize Range.foreach :)
All these discussions would be so much better with the benchmarking code attached. I am very skeptical of any claims of making foreach faster without removing boxing/unboxing. It's most probably a benchmarking artifact. Mini-benchmarks ar hard to get right.
cheers,iulian
On Tue, Dec 13, 2011 at 5:21 PM, Daniel Sobral <dcsobral@gmail.com> wrote:
--
« Je déteste la montagne, ça cache le paysage »
Alphonse Allais
cheers,iulian
On Tue, Dec 13, 2011 at 5:21 PM, Daniel Sobral <dcsobral@gmail.com> wrote:
On Tue, Dec 13, 2011 at 12:46, Miguel Garcia <mgarcia512@yahoo.com> wrote:
>
> The implementation:
>
> @inline final override def foreach[U](f: Int => U) {
> val goesDown = (step < 0)
> val isInclu = isInclusive
> var i = start
> while (
> if(goesDown)
> if (isInclu) i >= end
> else i > end
> else
> if (isInclu) i <= end
> else i < end
> ) {
> f(i) // The cast `(f.asInstanceOf[Int => Unit]).apply(i)` made it
> slower.
> i += step
> }
> }
So, while eating lunch my mind reminded me "last" was there for good
reason. What would happen in the following two cases?
1 to 1 by 2 foreach println
Int.MaxValue to Int.MaxValue by 2 foreach println
--
Daniel C. Sobral
I travel to the future all the time.
--
« Je déteste la montagne, ça cache le paysage »
Alphonse Allais
Tue, 2011-12-13, 19:01
#14
Re: you may stop trying to optimize Range.foreach :)
On Tue, Dec 13, 2011 at 15:12, iulian dragos wrote:
> All these discussions would be so much better with the benchmarking code
> attached. I am very skeptical of any claims of making foreach faster without
> removing boxing/unboxing. It's most probably a benchmarking artifact.
> Mini-benchmarks ar hard to get right.
I've been using
https://github.com/dcsobral/scala-benchmarking-template to benchmark.
It uses Caliper to make the measurements, and it tests ranges of
multiple sizes against a while loop, foreach, and a specialized
for-loop class. The body is small, but that's precisely when the
difference in speed makes the biggest impact.
I want to expand the benchmarks to include slow functions in addition
to fast functions, and given paul's discovery, functions that are not
Int => Unit as well. But that will wait until I can get SBT to run
multiple benchmarks -- it hasn't bent to my will so far on a couple of
issues, even though the rest of the infrastructure is working (not on
the github though, since it presently doesn't run at all).
Would be great to have this optimization for 2.10
On Tue, Dec 13, 2011 at 3:46 PM, Miguel Garcia <mgarcia512@yahoo.com> wrote:
--
Viktor Klang
Akka Tech LeadTypesafe - Enterprise-Grade Scala from the Experts
Twitter: @viktorklang