optimizing simple fors

21 replies

Sat, 2010-09-18, 21:31

extempore

Joined: 2008-12-17,

I thought of this issue while working on the parser this morning and I
couldn't resist implementing it to see what would happen.

Background reading:
https://lampsvn.epfl.ch/trac/scala/ticket/1338
"Optimize simple for loops"

All of the following results are from the last few lines of "MatMul"
from #1338, run as shown here:

% scalac [options] MatMul.scala
% scala -Dfactor=15 MatMul

I measured everything with and without -optimise, but I'm omitting the
optimized numbers because it never made a difference of more than a
couple percent in run time, and not necessarily for the better. (It did
make a big difference in another way, to be revealed at the end.)

We start here to appreciate why this is/was a major issue when reported:

** scala 2.7.7
Iterators 11,064,903,000ns
Limits 19,097,835,000ns
Ranges 19,956,333,000ns
While Loop 2,702,107,000ns

** scala 2.8.0
Iterators 7,253,378,000ns
Limits 12,648,149,000ns
Ranges 12,679,702,000ns
While Loop 2,688,374,000ns

A modest improvement in 2.8.0, but then:

** scala trunk, build r23026
Iterators 3,536,780,000ns
Limits 3,558,103,000ns
Ranges 3,452,037,000ns
While Loop 2,627,364,000ns

Whoa, what happened since 2.8.0? That's easy enough to see:

** scala trunk, build r23026, -no-specialization
Iterators 16,509,992,000ns
Limits 23,061,769,000ns
Ranges 23,069,826,000ns
While Loop 2,634,965,000ns

Of course 2.8.0 has specialization, but it looks like one or more bugs
fixed since release have been corkers.

Now we get to me, and we do a little hack here and a little hack there
and try it again:

** scala trunk, build r23026 + paulp patch
Iterators 2,705,412,000ns
Limits 2,761,504,000ns
Ranges 3,461,005,000ns
While Loop 2,696,050,000ns

Not fully appreciable without also seeing this:

** scala trunk, build r23026 + paulp patch, -no-specialization
Iterators 2,770,781,000ns
Limits 2,794,777,000ns
Ranges 16,453,614,000ns
While Loop 2,683,085,000ns

Here is the generated code post-patch for method matMulUsingLimits.
(The unoptimized version is almost identical, with only a few
instructions eliminated.)

// r23026 + paulp patch, -optimise
public void matMulUsingLimits(double[][], double[][], double[][]);
Code:
Stack=6, Locals=16, Args_size=4
0: aload_2
1: arraylength
2: newarray double
4: astore 9
6: aload_1
7: arraylength
8: istore 6
10: aload_2
11: iconst_0
12: aaload
13: arraylength
14: istore 4
16: aload_2
17: arraylength
18: istore 7
20: iconst_0
21: istore 15
23: iload 15
25: iload 4
27: if_icmpge 146
30: iconst_0
31: istore 5
33: iload 5
35: iload 7
37: if_icmpge 61
40: aload 9
42: iload 5
44: aload_2
45: iload 5
47: aaload
48: iload 15
50: daload
51: dastore
52: iload 5
54: iconst_1
55: iadd
56: istore 5
58: goto 33
61: iconst_0
62: istore 14
64: iload 14
66: iload 6
68: if_icmpge 137
71: aload_3
72: iload 14
74: aaload
75: astore 11
77: aload_1
78: iload 14
80: aaload
81: astore 8
83: dconst_0
84: dstore 12
86: iconst_0
87: istore 10
89: iload 10
91: iload 7
93: if_icmpge 121
96: dload 12
98: aload 8
100: iload 10
102: daload
103: aload 9
105: iload 10
107: daload
108: dmul
109: dadd
110: dstore 12
112: iload 10
114: iconst_1
115: iadd
116: istore 10
118: goto 89
121: aload 11
123: iload 15
125: dload 12
127: dastore
128: iload 14
130: iconst_1
131: iadd
132: istore 14
134: goto 64
137: iload 15
139: iconst_1
140: iadd
141: istore 15
143: goto 23
146: return

Here is current trunk:

// r23026, unoptimized
public void matMulUsingLimits(double[][], double[][], double[][]);
Code:
Stack=9, Locals=8, Args_size=4
0: aload_2
1: arraylength
2: newarray double
4: astore 4
6: aload_1
7: arraylength
8: istore 5
10: aload_2
11: iconst_0
12: aaload
13: arraylength
14: istore 6
16: aload_2
17: arraylength
18: istore 7
20: getstatic #27; //Field scala/Predef$.MODULE$:Lscala/Predef$;
23: iconst_0
24: invokevirtual #31; //Method scala/Predef$.intWrapper:(I)Lscala/runtime/RichInt;
27: iload 6
29: invokevirtual #37; //Method scala/runtime/RichInt.until:(I)Lscala/collection/immutable/Range;
32: new #73; //class MatMul$$anonfun$matMulUsingLimits$1
35: dup
36: aload_1
37: aload_2
38: aload_3
39: aload 4
41: iload 5
43: iload 7
45: invokespecial #76; //Method MatMul$$anonfun$matMulUsingLimits$1."":([[D[[D[[D[DII)V
48: invokevirtual #48; //Method scala/collection/immutable/Range.foreach$mVc$sp:(Lscala/Function1;)V
51: return

And here is a small portion of the optimized code, the size of which can
be inferred from the instruction labels. It's no faster and over 10x as
large, so it may bear some investigation.

// r23026, -optimise
804: invokeinterface #aaa, 2; //InterfaceMethod scala/Function1.apply$mcVI$sp:(I)V
809: iload 45
811: aload 44
813: invokevirtual #93; //Method scala/collection/immutable/Range.step:()I
816: iadd
817: istore 45
819: goto 761
822: aload 46
824: iload 48
826: invokeinterface #aaa, 2; //InterfaceMethod scala/Function1.apply$mcVI$sp:(I)V
831: iload 48
833: aload 47
835: invokevirtual #93; //Method scala/collection/immutable/Range.step:()I
838: iadd
839: istore 48
841: goto 680
844: aload 49
846: iload 51
848: invokeinterface #aaa, 2; //InterfaceMethod scala/Function1.apply$mcVI$sp:(I)V
853: iload 51
855: aload 50
857: invokevirtual #93; //Method scala/collection/immutable/Range.step:()I
860: iadd
861: istore 51
863: goto 611

If a single person makes it down to this line, you are a good candidate
for this question: can you suggest any scala projects which I can try
this out on which will let me measure the impact with somewhere in the
neighborhood of zero effort? Let me know, high attention span reader who
may or may not exist.

Sat, 2010-09-18, 22:17

ijuma

Joined: 2008-08-20,

Re: optimizing simple fors

On Sat, Sep 18, 2010 at 9:31 PM, Paul Phillips wrote:
> Now we get to me, and we do a little hack here and a little hack there
> and try it again:
>
> ** scala trunk, build r23026 + paulp patch
> Iterators 2,705,412,000ns
> Limits 2,761,504,000ns
> Ranges 3,461,005,000ns
> While Loop 2,696,050,000ns
>
> Not fully appreciable without also seeing this:
>
> ** scala trunk, build r23026 + paulp patch, -no-specialization
> Iterators 2,770,781,000ns
> Limits 2,794,777,000ns
> Ranges 16,453,614,000ns
> While Loop 2,683,085,000ns

Wow, this is really cool. :) I hope it gets merged to trunk.

> If a single person makes it down to this line, you are a good candidate
> for this question: can you suggest any scala projects which I can try
> this out on which will let me measure the impact with somewhere in the
> neighborhood of zero effort? Let me know, high attention span reader who
> may or may not exist.

Scalala should be a good candidate. Jason has used it for some of his
projects, so he may have something more specific to say. One issue is
that people that need this kind of performance improvement tend to
manually change the code to while loops in the critical path reducing
the measured improvement.

Best,
Ismael

Sat, 2010-09-18, 22:57

Lex

Joined: 2010-02-28,

Re: optimizing simple fors

The optimized for loops are most welcomed. Thank you!

Sun, 2010-09-19, 04:27

dcsobral

Joined: 2009-04-23,

Re: optimizing simple fors

Two questions:
1) What are "limits"?2) The results seem to indicate you haven't optimized range loops. Why not?

On Sat, Sep 18, 2010 at 17:31, Paul Phillips <paulp@improving.org> wrote:

I thought of this issue while working on the parser this morning and I
couldn't resist implementing it to see what would happen.

Background reading:
https://lampsvn.epfl.ch/trac/scala/ticket/1338
"Optimize simple for loops"

All of the following results are from the last few lines of "MatMul"
from #1338, run as shown here:

% scalac [options] MatMul.scala
% scala -Dfactor=15 MatMul

I measured everything with and without -optimise, but I'm omitting the
optimized numbers because it never made a difference of more than a
couple percent in run time, and not necessarily for the better. (It did
make a big difference in another way, to be revealed at the end.)

We start here to appreciate why this is/was a major issue when reported:

** scala 2.7.7
Iterators 11,064,903,000ns
Limits 19,097,835,000ns
Ranges 19,956,333,000ns
While Loop 2,702,107,000ns

** scala 2.8.0
Iterators 7,253,378,000ns
Limits 12,648,149,000ns
Ranges 12,679,702,000ns
While Loop 2,688,374,000ns

A modest improvement in 2.8.0, but then:

** scala trunk, build r23026
Iterators 3,536,780,000ns
Limits 3,558,103,000ns
Ranges 3,452,037,000ns
While Loop 2,627,364,000ns

Whoa, what happened since 2.8.0? That's easy enough to see:

** scala trunk, build r23026, -no-specialization
Iterators 16,509,992,000ns
Limits 23,061,769,000ns
Ranges 23,069,826,000ns
While Loop 2,634,965,000ns

Of course 2.8.0 has specialization, but it looks like one or more bugs
fixed since release have been corkers.

Now we get to me, and we do a little hack here and a little hack there
and try it again:

** scala trunk, build r23026 + paulp patch
Iterators 2,705,412,000ns
Limits 2,761,504,000ns
Ranges 3,461,005,000ns
While Loop 2,696,050,000ns

Not fully appreciable without also seeing this:

** scala trunk, build r23026 + paulp patch, -no-specialization
Iterators 2,770,781,000ns
Limits 2,794,777,000ns
Ranges 16,453,614,000ns
While Loop 2,683,085,000ns

Here is the generated code post-patch for method matMulUsingLimits.
(The unoptimized version is almost identical, with only a few
instructions eliminated.)

// r23026 + paulp patch, -optimise
public void matMulUsingLimits(double[][], double[][], double[][]);
Code:
Stack=6, Locals=16, Args_size=4
0: aload_2
1: arraylength
2: newarray double
4: astore 9
6: aload_1
7: arraylength
8: istore 6
10: aload_2
11: iconst_0
12: aaload
13: arraylength
14: istore 4
16: aload_2
17: arraylength
18: istore 7
20: iconst_0
21: istore 15
23: iload 15
25: iload 4
27: if_icmpge 146
30: iconst_0
31: istore 5
33: iload 5
35: iload 7
37: if_icmpge 61
40: aload 9
42: iload 5
44: aload_2
45: iload 5
47: aaload
48: iload 15
50: daload
51: dastore
52: iload 5
54: iconst_1
55: iadd
56: istore 5
58: goto 33
61: iconst_0
62: istore 14
64: iload 14
66: iload 6
68: if_icmpge 137
71: aload_3
72: iload 14
74: aaload
75: astore 11
77: aload_1
78: iload 14
80: aaload
81: astore 8
83: dconst_0
84: dstore 12
86: iconst_0
87: istore 10
89: iload 10
91: iload 7
93: if_icmpge 121
96: dload 12
98: aload 8
100: iload 10
102: daload
103: aload 9
105: iload 10
107: daload
108: dmul
109: dadd
110: dstore 12
112: iload 10
114: iconst_1
115: iadd
116: istore 10
118: goto 89
121: aload 11
123: iload 15
125: dload 12
127: dastore
128: iload 14
130: iconst_1
131: iadd
132: istore 14
134: goto 64
137: iload 15
139: iconst_1
140: iadd
141: istore 15
143: goto 23
146: return

Here is current trunk:

// r23026, unoptimized
public void matMulUsingLimits(double[][], double[][], double[][]);
Code:
Stack=9, Locals=8, Args_size=4
0: aload_2
1: arraylength
2: newarray double
4: astore 4
6: aload_1
7: arraylength
8: istore 5
10: aload_2
11: iconst_0
12: aaload
13: arraylength
14: istore 6
16: aload_2
17: arraylength
18: istore 7
20: getstatic #27; //Field scala/Predef$.MODULE$:Lscala/Predef$;
23: iconst_0
24: invokevirtual #31; //Method scala/Predef$.intWrapper:(I)Lscala/runtime/RichInt;
27: iload 6
29: invokevirtual #37; //Method scala/runtime/RichInt.until:(I)Lscala/collection/immutable/Range;
32: new #73; //class MatMul$$anonfun$matMulUsingLimits$1
35: dup
36: aload_1
37: aload_2
38: aload_3
39: aload 4
41: iload 5
43: iload 7
45: invokespecial #76; //Method MatMul$$anonfun$matMulUsingLimits$1."<init>":([[D[[D[[D[DII)V
48: invokevirtual #48; //Method scala/collection/immutable/Range.foreach$mVc$sp:(Lscala/Function1;)V
51: return

And here is a small portion of the optimized code, the size of which can
be inferred from the instruction labels. It's no faster and over 10x as
large, so it may bear some investigation.

// r23026, -optimise
804: invokeinterface #aaa, 2; //InterfaceMethod scala/Function1.apply$mcVI$sp:(I)V
809: iload 45
811: aload 44
813: invokevirtual #93; //Method scala/collection/immutable/Range.step:()I
816: iadd
817: istore 45
819: goto 761
822: aload 46
824: iload 48
826: invokeinterface #aaa, 2; //InterfaceMethod scala/Function1.apply$mcVI$sp:(I)V
831: iload 48
833: aload 47
835: invokevirtual #93; //Method scala/collection/immutable/Range.step:()I
838: iadd
839: istore 48
841: goto 680
844: aload 49
846: iload 51
848: invokeinterface #aaa, 2; //InterfaceMethod scala/Function1.apply$mcVI$sp:(I)V
853: iload 51
855: aload 50
857: invokevirtual #93; //Method scala/collection/immutable/Range.step:()I
860: iadd
861: istore 51
863: goto 611

If a single person makes it down to this line, you are a good candidate
for this question: can you suggest any scala projects which I can try
this out on which will let me measure the impact with somewhere in the
neighborhood of zero effort? Let me know, high attention span reader who
may or may not exist.

--
Paul Phillips | If this is raisin, make toast with it.
Apatheist |
Empiricist |
i pull his palp! |----------* http://www.improving.org/paulp/ *----------

--
Daniel C. Sobral

I travel to the future all the time.

Sun, 2010-09-19, 04:47

extempore

Joined: 2008-12-17,

Re: optimizing simple fors

On Sun, Sep 19, 2010 at 12:26:22AM -0300, Daniel Sobral wrote:
> 1) What are "limits"?

https://lampsvn.epfl.ch/trac/scala/ticket/1338

> 2) The results seem to indicate you haven't optimized range loops. Why
> not?

I didn't want to disappoint by leaving you nothing extra to request. Of
secondary interest is that I'd like some idea whether it has any chance
of being used before I coat it with sequins. And you probably think
that line means something other than it does: I refer you to answer 1.

Sun, 2010-09-19, 05:17

ichoran

Joined: 2009-08-14,

Re: optimizing simple fors

The easiest thing to try first might be to move from the bug report benchmark to a Computer Languages Benchmark Game benchmark. These _are_ benchmarks, but on the other hand they contain operations that are not _so_ unusual for people doing numerical work. For example, Scala's fannkuch-redux #1

http://shootout.alioth.debian.org/u32q/program.php?test=fannkuchredux&lang=scala&id=1

is chock-full of while loops, and they're there specifically because nothing else is fast enough. There are some also in these programs:

http://shootout.alioth.debian.org/u32q/program.php?test=fasta&lang=scala&id=1
http://shootout.alioth.debian.org/u32q/program.php?test=binarytrees&lang=scala&id=4
http://shootout.alioth.debian.org/u32q/program.php?test=knucleotide&lang=scala&id=4
http://shootout.alioth.debian.org/u32q/program.php?test=nbody&lang=scala&id=1
http://shootout.alioth.debian.org/u32q/program.php?test=spectralnorm&lang=scala&id=1

Also, I have rather a lot of code that would like to have its while loops removed in favor of iterators or limits, but unfortunately it tends to be pretty hard to extract (and requires some datasets of nontrivial size to run on, and since I had to use while anyway I sometimes stuck other things inside the loops that make refactoring a little harder, etc.), so I don't think this meets your "neighborhood of zero effort" criterion.

(Does rewriting two or three while loops per benchmark count as ~zero effort?)

--Rex

On Sat, Sep 18, 2010 at 4:31 PM, Paul Phillips <paulp@improving.org> wrote:

If a single person makes it down to this line, you are a good candidate
for this question: can you suggest any scala projects which I can try
this out on which will let me measure the impact with somewhere in the
neighborhood of zero effort? Let me know, high attention span reader who
may or may not exist.

Who says said person didn't just skip to the end? You should have stuck some comment like "mention wildebeests in your comment to verify that you actually _did_ read all this" in the middle (maybe as a fake comment in the bytecode...).

--Rex

P.S. -optimise seems to be full of wildebeests.

Sun, 2010-09-19, 07:27

extempore

Joined: 2008-12-17,

Re: optimizing simple fors

On Sun, Sep 19, 2010 at 12:09:07AM -0400, Rex Kerr wrote:
> The easiest thing to try first might be to move from the bug report
> benchmark to a Computer Languages Benchmark Game benchmark. These
> _are_ benchmarks, but on the other hand they contain operations that
> are not _so_ unusual for people doing numerical work. For example,
> Scala's fannkuch-redux #1
>
> http://shootout.alioth.debian.org/u32q/program.php?test=fannkuchredux&la...

If translating that pile of letters and numbers is something you can do
in your sleep then I applaud your lowleveliness. It's too much effort
for me. I'm going to depend on enlightened self-interest to turn up
somewhere and lower the effort barrier to zero (or failing that,
conclude this wasn't of all that much interest after all.)

> Who says said person didn't just skip to the end? You should have
> stuck some comment like "mention wildebeests in your comment to verify
> that you actually _did_ read all this" in the middle (maybe as a fake
> comment in the bytecode...).

Who says the question at the end was the real question, and not the
decoy? Who says the real question isn't encoded in the bytecode,
possibly along with a treasure map and location of a blimp upon which
you will embark on your race around the world? End questions are the
oldest trick in the book, sucker.

Sun, 2010-09-19, 07:57

ichoran

Joined: 2009-08-14,

Re: optimizing simple fors

On Sun, Sep 19, 2010 at 2:24 AM, Paul Phillips <paulp@improving.org> wrote:

On Sun, Sep 19, 2010 at 12:09:07AM -0400, Rex Kerr wrote:
> http://shootout.alioth.debian.org/u32q/program.php?test=fannkuchredux&lang=scala&id=1

If translating that pile of letters and numbers is something you can do
in your sleep then I applaud your lowleveliness.

I've been forced to pay too much attention to too much low level stuff for far too long. So by now I might be able to figure out how to refactor it in my sleep, but sadly, I can't type while asleep (alas, hypotonia!). Once awake again, however, I can do it in my spare time pretty trivially. Tomorrow I'll pull together some versions useful for testing.

--Rex

P.S. I find it somewhat difficult to believe that someone could simultaneously gain sizable insight from several hundred lines of Java bytecode and yet find a while-loop to for-loop conversion a non-negligible low-level effort. But I am already somewhat familiar with the code and how to test it for correct behavior, so I'm happy to be the one to create the different versions.

Sun, 2010-09-19, 08:07

extempore

Joined: 2008-12-17,

Re: optimizing simple fors

On Sun, Sep 19, 2010 at 02:56:53AM -0400, Rex Kerr wrote:
> P.S. I find it somewhat difficult to believe that someone could
> simultaneously gain sizable insight from several hundred lines of Java
> bytecode and yet find a while-loop to for-loop conversion a
> non-negligible low-level effort. But I am already somewhat familiar
> with the code and how to test it for correct behavior, so I'm happy to
> be the one to create the different versions.

Something I've learned from years of studying motivation is that
anything can exceed your grasp if you put your heart out of it.

Sun, 2010-09-19, 08:17

ichoran

Joined: 2009-08-14,

Re: optimizing simple fors

On Sun, Sep 19, 2010 at 3:02 AM, Paul Phillips <paulp@improving.org> wrote:

On Sun, Sep 19, 2010 at 02:56:53AM -0400, Rex Kerr wrote:
> P.S. I find it somewhat difficult to believe that someone could
> simultaneously gain sizable insight from several hundred lines of Java
> bytecode and yet find a while-loop to for-loop conversion a
> non-negligible low-level effort. But I am already somewhat familiar
> with the code and how to test it for correct behavior, so I'm happy to
> be the one to create the different versions.

Something I've learned from years of studying motivation is that
anything can exceed your grasp if you put your heart out of it.

True enough. Fortunately I am motivated. Code shall follow tomorrow.

--Rex

Sun, 2010-09-19, 13:07

#10

ijuma

Joined: 2008-08-20,

Re: optimizing simple fors

On Sun, Sep 19, 2010 at 4:42 AM, Paul Phillips wrote:
> And you probably think that line means something other than it does: I refer you to answer 1.

Yes, the names are misleading. I was also confused before I checked
the actual benchmark.

Best,
Ismael

Sun, 2010-09-19, 13:17

#11

Jason Zaugg

Joined: 2009-05-18,

Re: optimizing simple fors

I see a different pattern of results:

http://gist.github.com/586702

It shows approximately the opposite of your results for 2.8.0 and trunk.

There actually aren't a lot of simple for loops in scalala, it might
not be a great test bench.

-jason

On Sat, Sep 18, 2010 at 8:31 PM, Paul Phillips wrote:
> ** scala 2.8.0
> Iterators 7,253,378,000ns
> Limits 12,648,149,000ns
> Ranges 12,679,702,000ns
> While Loop 2,688,374,000ns
>
> A modest improvement in 2.8.0, but then:
>
> ** scala trunk, build r23026
> Iterators 3,536,780,000ns
> Limits 3,558,103,000ns
> Ranges 3,452,037,000ns
> While Loop 2,627,364,000ns
>
> Whoa, what happened since 2.8.0? That's easy enough to see:
>
> ** scala trunk, build r23026, -no-specialization
> Iterators 16,509,992,000ns
> Limits 23,061,769,000ns
> Ranges 23,069,826,000ns
> While Loop 2,634,965,000ns
>
> Of course 2.8.0 has specialization, but it looks like one or more bugs
> fixed since release have been corkers.

Sun, 2010-09-19, 13:27

#12

Jason Zaugg

Joined: 2009-05-18,

Re: optimizing simple fors

Scratch that, with -nocompdaemon I see the same pattern as Paul.

-jason

On Sun, Sep 19, 2010 at 12:06 PM, Jason Zaugg wrote:
> I see a different pattern of results:
>
> http://gist.github.com/586702
>
> It shows approximately the opposite of your results for 2.8.0 and trunk.
>
> There actually aren't a lot of simple for loops in scalala, it might
> not be a great test bench.
>
> -jason
>
> On Sat, Sep 18, 2010 at 8:31 PM, Paul Phillips wrote:
>> ** scala 2.8.0
>> Iterators 7,253,378,000ns
>> Limits 12,648,149,000ns
>> Ranges 12,679,702,000ns
>> While Loop 2,688,374,000ns
>>
>> A modest improvement in 2.8.0, but then:
>>
>> ** scala trunk, build r23026
>> Iterators 3,536,780,000ns
>> Limits 3,558,103,000ns
>> Ranges 3,452,037,000ns
>> While Loop 2,627,364,000ns
>>
>> Whoa, what happened since 2.8.0? That's easy enough to see:
>>
>> ** scala trunk, build r23026, -no-specialization
>> Iterators 16,509,992,000ns
>> Limits 23,061,769,000ns
>> Ranges 23,069,826,000ns
>> While Loop 2,634,965,000ns
>>
>> Of course 2.8.0 has specialization, but it looks like one or more bugs
>> fixed since release have been corkers.
>

Sun, 2010-09-19, 15:07

#13

ijuma

Joined: 2008-08-20,

Re: optimizing simple fors

On Sun, Sep 19, 2010 at 1:25 PM, Jason Zaugg wrote:
> Scratch that, with -nocompdaemon I see the same pattern as Paul.

Phew. :)

Ismael

Sun, 2010-09-19, 18:27

#14

ichoran

Joined: 2009-08-14,

Re: optimizing simple fors

Hi Paul,

Here are a couple of pieces of code (attached) that might be useful for testing. Some of the others already were in good enough shape with 2.8 specialization so additional improvements weren't necessary, or required or generated huge data files where I/O could change the results from machine to machine.

scalac fannkuch.scala
scala fannkuch // Shows about 4x improvement with while

scalac nbody.scala
scala nbody // Shows about 2x improvement with while

I didn't distinguish between "iterators" and "limits" as the previous code did; I called it limits, but it was a mix of the two, really. (The two cases didn't look much different to me anyway.)

--Rex

Sun, 2010-09-19, 20:27

#15

extempore

Joined: 2008-12-17,

Re: optimizing simple fors

On Sun, Sep 19, 2010 at 01:20:33PM -0400, Rex Kerr wrote:
> Here are a couple of pieces of code (attached) that might be useful
> for testing. Some of the others already were in good enough shape
> with 2.8 specialization so additional improvements weren't necessary,
> or required or generated huge data files where I/O could change the
> results from machine to machine.

Thanks a lot, especially for them being 100% runnable out of the box and
not 90% + random(8). If I had a few people who would do that sort of
thing for me I could produce a LOT more code: it's unfortunate for all
of us how much of my time is spent on tasks which could be done by
someone who has logged fewer than 5000 hours working on scala trunk.

The first one was already there, the second one sent me back to the
grind: not that I didn't know there were a bunch of holes in it, but as
I may have mentioned, I'm disinclined to work too hard on dead ends
anymore. Still, a test is motivation!

So post-second-hack we have:

% scala fannkuch
While Pfannkuchen(10) = 38. Elapsed: 0.557 s
Limit PFannkuchen(10) = 38. Elapsed: 0.428 s
Range PFannkuchen(10) = 38. Elapsed: 1.645 s
While Pfannkuchen(10) = 38. Elapsed: 0.420 s
Limit PFannkuchen(10) = 38. Elapsed: 0.413 s
Range PFannkuchen(10) = 38. Elapsed: 1.567 s
While Pfannkuchen(10) = 38. Elapsed: 0.458 s
Limit PFannkuchen(10) = 38. Elapsed: 0.426 s
Range PFannkuchen(10) = 38. Elapsed: 1.557 s

% scala nbody
While Energy: -0.169075164; change: 0.000048878. Elapsed: 0.610 s
Limit Energy: -0.169075164; change: 0.000048878. Elapsed: 0.522 s
Range Energy: -0.169075164; change: 0.000048878. Elapsed: 1.183 s
While Energy: -0.169075164; change: 0.000048878. Elapsed: 0.520 s
Limit Energy: -0.169075164; change: 0.000048878. Elapsed: 0.508 s
Range Energy: -0.169075164; change: 0.000048878. Elapsed: 1.086 s
While Energy: -0.169075164; change: 0.000048878. Elapsed: 0.514 s
Limit Energy: -0.169075164; change: 0.000048878. Elapsed: 0.504 s
Range Energy: -0.169075164; change: 0.000048878. Elapsed: 1.098 s

As is probably obvious I haven't bothered unrolling range instances yet.
But I did kick it up a notch with steps. Here is a test method, along
with before and after bytecode.

def hello = {
val q = 1
for (x <- q to 9 by 2; y <- q+1 to q+9) println(x+y)
}

// trunk
public void hello();
Code:
Stack=5, Locals=2, Args_size=1
0: iconst_1
1: istore_1
2: getstatic #12; //Field scala/Predef$.MODULE$:Lscala/Predef$;
5: iload_1
6: invokevirtual #16; //Method scala/Predef$.intWrapper:(I)Lscala/runtime/RichInt;
9: ldc #17; //int 9
11: invokevirtual #23; //Method scala/runtime/RichInt.to:(I)Lscala/collection/immutable/Range$Inclusive;
14: iconst_2
15: invokevirtual #29; //Method scala/collection/immutable/Range$Inclusive.by:(I)Lscala/collection/immutable/Range;
18: new #31; //class A$$anonfun$hello$1
21: dup
22: aload_0
23: iload_1
24: invokespecial #35; //Method A$$anonfun$hello$1."":(LA;I)V
27: invokevirtual #41; //Method scala/collection/immutable/Range.foreach$mVc$sp:(Lscala/Function1;)V
30: return

// me
public void hello();
Code:
Stack=3, Locals=7, Args_size=1
0: iconst_1
1: istore_1
2: iload_1
3: istore_2
4: ldc #7; //int 9
6: istore_3
7: iconst_2
8: istore 4
10: iload_2
11: iload_3
12: if_icmpgt 63
15: iload_1
16: iconst_1
17: iadd
18: istore 5
20: iload_1
21: ldc #7; //int 9
23: iadd
24: istore 6
26: iload 5
28: iload 6
30: if_icmpgt 55
33: getstatic #13; //Field scala/Predef$.MODULE$:Lscala/Predef$;
36: iload_2
37: iload 5
39: iadd
40: invokestatic #19; //Method scala/runtime/BoxesRunTime.boxToInteger:(I)Ljava/lang/Integer;
43: invokevirtual #23; //Method scala/Predef$.println:(Ljava/lang/Object;)V
46: iload 5
48: iconst_1
49: iadd
50: istore 5
52: goto 26
55: iload_2
56: iload 4
58: iadd
59: istore_2
60: goto 10
63: return

Sun, 2010-09-19, 20:37

#16

odersky

Joined: 2008-07-29,

Re: optimizing simple fors

Iulian will defend his thesis on Tuesday so is unlikely to respond before that. And we are on the verge of publishing
the 2.8.1 RC. So I believe there are few chances that Paul's patch will make it into 2.8.1 RC1. But for afterwards it looks great. I have dreamed for about 5 years now that for loops should be just as fast as while loops, just using normal optimizations that are applicable everywhere. That's why I was always against special-casing for loops over ranges which would have been easy to do. It seems we have finally arrived. Great job!

And btw, I'm dying to see your patch, Paul!

Cheers

-- Martin

Sun, 2010-09-19, 21:17

#17

extempore

Joined: 2008-12-17,

Re: optimizing simple fors

On Sun, Sep 19, 2010 at 09:33:08PM +0200, martin odersky wrote:
> Iulian will defend his thesis on Tuesday so is unlikely to respond
> before that. And we are on the verge of publishing the 2.8.1 RC. So I
> believe there are few chances that Paul's patch will make it into
> 2.8.1 RC1. But for afterwards it looks great.

That's great. I never envisioned anything going into 2.8.1: "ever"
better approximates the limit of my ambition.

> I was always against special-casing for loops over ranges which would
> have been easy to do. It seems we have finally arrived. Great job!
>
> And btw, I'm dying to see your patch, Paul!

Ha, well, er, there are reasons johnny diff hasn't come out to play. We
have not exaaaactly avoided special casing for loops over ranges, and I
don't want to spook anyone with actual details of the current
implementation, which exists only as a "conversation starter". I'll get
us to an acceptable patch, but at this point I'm laying some blacktop
between here and trunk and that way the jeep will have enough fuel and
we won't get stranded in the foreboding Tributary of Git.

Mon, 2010-09-20, 00:07

#18

ichoran

Joined: 2009-08-14,

Re: optimizing simple fors

On Sun, Sep 19, 2010 at 3:33 PM, martin odersky <martin.odersky@epfl.ch> wrote:

I have dreamed for about 5 years now that for loops should be just as fast as while loops, just using normal optimizations that are applicable everywhere.

Without aggressively grabbing and inlining bytecode, I'm not sure how one could attain that goal. The JVM has to pay the cost of analyzing every run, during runtime, so it is necessarily going to apply carefully targeted and somewhat limited optimizations. But the compiler only has to pay the cost of optimizing once, so in the more-difficult-to-analyze cases, the compiler should take the lead, relying on the JVM to tidy up the details.

Take a look at these benchmark times (after my signature) comparing loops that are rather cruel to the JVM--the comparison is between while loops with for loops using range ("Limit") and the specialized utility loop function "cfor" (which emulates the C for loop). Code is attached. The bottom line is that while loops are fast, whether they are a single loop (While1) or nested multiple times (While3, While5) or are a mishmash of a bunch of different loops (While?). Relying upon the JVM to optimize generic specialized code is occasionally--well, just look at the timings, especially the "Limit1" and "Cfor1" times in the second block.

That performance is a great improvement over unspecialized stuff for, say, web development or database or somesuch. For serious numerical work, it's pretty intolerable to get 10x slowdowns because of the particular execution path your code takes de-optimizing a critical loop.

For loops in the special case of while loops _should_ be as fast as they are, but given how common the common cases are, why not write custom code for them?

--Rex

P.S. Paul--y'might want to try this out on your for loop also.

P.P.S. Here are the timings:

-1243309312 While1 Elapsed: 0.305 s
-1243309312 While3 Elapsed: 0.319 s
-1243309312 While5 Elapsed: 0.438 s
-17609343 While? Elapsed: 1.163 s
-1243309312 Limit1 Elapsed: 0.318 s
-1243309312 Limit3 Elapsed: 0.512 s
-1243309312 Limit5 Elapsed: 1.279 s
-17609343 Limit? Elapsed: 2.603 s
-1243309312 Cfor1 Elapsed: 0.323 s
-1243309312 Cfor3 Elapsed: 0.612 s
-1243309312 Cfor5 Elapsed: 0.883 s
-17609343 Cfor?? Elapsed: 5.648 s

-1243309312 While1 Elapsed: 0.296 s
-1243309312 While3 Elapsed: 0.315 s
-1243309312 While5 Elapsed: 0.435 s
-17609343 While? Elapsed: 1.146 s
-1243309312 Limit1 Elapsed: 3.976 s
-1243309312 Limit3 Elapsed: 0.348 s
-1243309312 Limit5 Elapsed: 1.238 s
-17609343 Limit? Elapsed: 2.695 s
-1243309312 Cfor1 Elapsed: 9.792 s
-1243309312 Cfor3 Elapsed: 0.329 s
-1243309312 Cfor5 Elapsed: 0.855 s
-17609343 Cfor?? Elapsed: 5.608 s

-1243309312 While1 Elapsed: 0.296 s
-1243309312 While3 Elapsed: 0.302 s
-1243309312 While5 Elapsed: 0.314 s
-17609343 While? Elapsed: 1.142 s
-1243309312 Limit1 Elapsed: 3.966 s
-1243309312 Limit3 Elapsed: 0.339 s
-1243309312 Limit5 Elapsed: 1.216 s
-17609343 Limit? Elapsed: 2.689 s
-1243309312 Cfor1 Elapsed: 9.774 s
-1243309312 Cfor3 Elapsed: 0.329 s
-1243309312 Cfor5 Elapsed: 0.826 s
-17609343 Cfor?? Elapsed: 5.606 s

-1243309312 While1 Elapsed: 0.296 s
-1243309312 While3 Elapsed: 0.301 s
-1243309312 While5 Elapsed: 0.316 s
-17609343 While? Elapsed: 1.140 s
-1243309312 Limit1 Elapsed: 3.963 s
-1243309312 Limit3 Elapsed: 0.340 s
-1243309312 Limit5 Elapsed: 1.193 s
-17609343 Limit? Elapsed: 2.683 s
-1243309312 Cfor1 Elapsed: 9.783 s
-1243309312 Cfor3 Elapsed: 0.342 s
-1243309312 Cfor5 Elapsed: 0.837 s
-17609343 Cfor?? Elapsed: 5.628 s

Mon, 2010-09-20, 10:07

#19

ijuma

Joined: 2008-08-20,

Re: optimizing simple fors

On Mon, Sep 20, 2010 at 12:04 AM, Rex Kerr <ichoran@gmail.com> wrote:

Without aggressively grabbing and inlining bytecode, I'm not sure how one could attain that goal. The JVM has to pay the cost of analyzing every run, during runtime, so it is necessarily going to apply carefully targeted and somewhat limited optimizations. But the compiler only has to pay the cost of optimizing once, so in the more-difficult-to-analyze cases, the compiler should take the lead, relying on the JVM to tidy up the details.

That is part of it. The other part is that the Scala compiler has a better understanding of some code than the JVM. Optimising Ranges into loops is much easier for Scala than it is for a JVM where a large set of optimisations have to come together simultaneously for matching performance to be achieved.
Also, if we want Scala to be used outside of server applications, we can't use optimizations that are not even enabled by default in HotSpot's -server JIT (i.e. escape analysis) as the baseline. HotSpot's -client JIT is much more conservative in its optimizations (although Tiered Compilation[1] would help here) and Android's Dalvik is even less sophisticated (it only received a JIT in the very latest version and GC is also slower).
Best,Ismael [1] http://weblogs.java.net/blog/forax/archive/2010/09/04/tiered-compilation

Thu, 2010-10-14, 11:17

#20

Iulian Dragos

Joined: 2008-12-18,

Re: optimizing simple fors

I finally got around to reading this thread. I think there are two main ideas that I want to follow. The first one (pointed by the original benchmark), is why the trunk compiler is about 30% slower than 2.8.1. The second has to do with this benchmark.

On Mon, Sep 20, 2010 at 1:04 AM, Rex Kerr <ichoran@gmail.com> wrote:

On Sun, Sep 19, 2010 at 3:33 PM, martin odersky <martin.odersky@epfl.ch> wrote:

I have dreamed for about 5 years now that for loops should be just as fast as while loops, just using normal optimizations that are applicable everywhere.

Without aggressively grabbing and inlining bytecode, I'm not sure how one could attain that goal. The JVM has to pay the cost of analyzing every run, during runtime, so it is necessarily going to apply carefully targeted and somewhat limited optimizations. But the compiler only has to pay the cost of optimizing once, so in the more-difficult-to-analyze cases, the compiler should take the lead, relying on the JVM to tidy up the details.

That's the underlying assumption of the scalac optimizer. Inlining is in fact a necessary step, but performance is gained only when boxing and closures are eliminated, which rely on having a good side-effect analysis (which we don't have yet). Maybe the future effect type system will help us there.

Take a look at these benchmark times (after my signature) comparing loops that are rather cruel to the JVM--the comparison is between while loops with for loops using range ("Limit") and the specialized utility loop function "cfor" (which emulates the C for loop). Code is attached. The bottom line is that while loops are fast, whether they are a single loop (While1) or nested multiple times (While3, While5) or are a mishmash of a bunch of different loops (While?). Relying upon the JVM to optimize generic specialized code is occasionally--well, just look at the timings, especially the "Limit1" and "Cfor1" times in the second block.

I ran your tests, and after a few changes to your setup (especially taking IO out of the timing loop), they look more reasonable:
While1 Elapsed: 0.452 s While3 Elapsed: 0.556 sWhile5 Elapsed: 0.695 sWhile? Elapsed: 1.580 s Limit1 Elapsed: 1.066 sLimit3 Elapsed: 1.359 sLimit5 Elapsed: 2.406 s Limit? Elapsed: 2.108 sCfor1 Elapsed: 0.527 sCfor2 Elapsed: 1.033 s Cfor4 Elapsed: 1.533 sCfor?? Elapsed: 10.792 s
I also added a warmup method (just run the benchmark once before measuring). Now times are monotonically increasing, as one would expect from the source code. Also, Limit1 is 2-3 times slower, while cfor has an outlier, for cfor??. So there are no 10x slowdowns, which I believe were due to IO being measured together with the benchmarks.
I am looking now into why the optimizer is not able to do a better job (there's a nasty exception thrown while parsing Predef, so it cannot perform some interesting inlining).
iulian

Thu, 2010-10-14, 14:47

#21

Rüdiger Keller

Joined: 2010-01-24,

Re: optimizing simple fors

Hi Iulian,

I don't know if you have seen the thread by Olivier Chafik recently.
He wrote a Scala compiler plugin for optimizing for-loops and
operations on arrays, like foreach and map.

The plugin can be found here:
http://code.google.com/p/scalacl/wiki/ScalaCLPlugin

Maybe you can find some inspiration by looking at his code.
Personally, I think it would be nice if optimizations like these made
it into the compiler itself.

Regards,
Ruediger

2010/10/14 iulian dragos :
> I finally got around to reading this thread. I think there are two main
> ideas that I want to follow. The first one (pointed by the original
> benchmark), is why the trunk compiler is about 30% slower than 2.8.1. The
> second has to do with this benchmark.
>
>
> On Mon, Sep 20, 2010 at 1:04 AM, Rex Kerr wrote:
>>
>> On Sun, Sep 19, 2010 at 3:33 PM, martin
>> odersky wrote:
>>>
>>> I have dreamed for about 5 years now that for loops should be just as
>>> fast as while loops, just using normal optimizations that are applicable
>>> everywhere.
>>
>> Without aggressively grabbing and inlining bytecode, I'm not sure how one
>> could attain that goal. The JVM has to pay the cost of analyzing every run,
>> during runtime, so it is necessarily going to apply carefully targeted and
>> somewhat limited optimizations. But the compiler only has to pay the cost
>> of optimizing once, so in the more-difficult-to-analyze cases, the compiler
>> should take the lead, relying on the JVM to tidy up the details.
>
> That's the underlying assumption of the scalac optimizer. Inlining is in
> fact a necessary step, but performance is gained only when boxing and
> closures are eliminated, which rely on having a good side-effect analysis
> (which we don't have yet). Maybe the future effect type system will help us
> there.
>
>>
>> Take a look at these benchmark times (after my signature) comparing loops
>> that are rather cruel to the JVM--the comparison is between while loops with
>> for loops using range ("Limit") and the specialized utility loop function
>> "cfor" (which emulates the C for loop). Code is attached. The bottom line
>> is that while loops are fast, whether they are a single loop (While1) or
>> nested multiple times (While3, While5) or are a mishmash of a bunch of
>> different loops (While?). Relying upon the JVM to optimize generic
>> specialized code is occasionally--well, just look at the timings, especially
>> the "Limit1" and "Cfor1" times in the second block.
>
> I ran your tests, and after a few changes to your setup (especially taking
> IO out of the timing loop), they look more reasonable:
> While1 Elapsed: 0.452 s
> While3 Elapsed: 0.556 s
> While5 Elapsed: 0.695 s
> While? Elapsed: 1.580 s
> Limit1 Elapsed: 1.066 s
> Limit3 Elapsed: 1.359 s
> Limit5 Elapsed: 2.406 s
> Limit? Elapsed: 2.108 s
> Cfor1 Elapsed: 0.527 s
> Cfor2 Elapsed: 1.033 s
> Cfor4 Elapsed: 1.533 s
> Cfor?? Elapsed: 10.792 s
> I also added a warmup method (just run the benchmark once before measuring).
> Now times are monotonically increasing, as one would expect from the source
> code. Also, Limit1 is 2-3 times slower, while cfor has an outlier, for
> cfor??. So there are no 10x slowdowns, which I believe were due to IO being
> measured together with the benchmarks.
> I am looking now into why the optimizer is not able to do a better job
> (there's a nasty exception thrown while parsing Predef, so it cannot perform
> some interesting inlining).
> iulian

Scala Main Menu

optimizing simple fors

Scala Quick Links

Featured News

User login