- About Scala
- Documentation
- Code Examples
- Software
- Scala Developers
scala performance for numerical algorithms
Sat, 2010-01-23, 19:54
Hi,
I have been holding off on using Scala due to performance issues with "simple" "for comprehensions". For the numerical component of what I do, tight loops are very important and they need to be as close to the metal as possible. (i.e. imagine implementing a matrix operation or a convolution in scala).
I apologize in advance for reopening a topic (in fact one that I had posted on 10 months ago). Here is the request on trac (now 16 months old):
https://lampsvn.epfl.ch/trac/scala/ticket/1338
In the trac case comments one of the scala team says: "Have you tried '-optimize'? It can help a lot. It's very unlikely this will move from library to compiler-generated loops".
I am concerned with the last statement.
Now, I recognize that an optimisations of this sort is "inelegant" in that it does not present a generalization. It only reduces the simple cases of a more general form -- but this is par for the course in compilers. So perhaps the view is around the inelegant aspect of special casing an optimization? I am quite doubtful that the -optimize will ever get to the while() level of performance without actually doing the remapping:
for (i <- 0 until N) { <do something> > }
TO:
val i = 0 while (i < N) { <do something> i += 1 }
I cannot stress enough how important tight loops are for numerical work. Please tell me that -optimize is not the last word on performance here. The special case solution for this class of for comprehensions does not appear to be difficult to implement for someone very familiar with the parser AST. I and a growing list people would be thrilled to see this fixed.
best regards
Jonathan Shore--http://tr8dr.wordpress.com/
Sat, 2010-01-23, 20:27
#2
Re: scala performance for numerical algorithms
Hi Jonathan,
On Sat, 2010-01-23 at 13:54 -0500, Jonathan Shore wrote:
> I am quite doubtful that the -optimize will ever get to the while()
> level of performance without actually doing the remapping:
Even though I would also like to see this optimisation, this is not the
best way to convince others to implement it. :)
Create a micro-benchmark suited to the JVM that compares a for
comprehension with -optimise (and possibly specialized) and a while loop
and show that the performance difference is still large. That is more
likely to have the desired effect.
Best,
Ismael
Sat, 2010-01-23, 20:57
#3
Re: scala performance for numerical algorithms
On Sat, Jan 23, 2010 at 2:00 PM, Ismael Juma <mlists@juma.me.uk> wrote:
I'd already done something much like this when I was working on a fast vector library. For example, comparing:
- while loop with primitives
- custom IntRange class with dedicated foreach
- regular Range (extends Projection[Int])
- for loop
- reduceLeft
gives the following answers
$ scalac -optimise DoNotUseForOrReduce.scala
$ scala DoNotUseForOrReduce
Time (seconds): while=0.56 custom=4.81 foreach=4.67 for=6.13 reduce=13.82
Time (seconds): while=0.56 custom=4.72 foreach=4.67 for=6.13 reduce=13.97
Time (seconds): while=0.56 custom=4.66 foreach=4.73 for=6.15 reduce=14.14
Time (seconds): while=0.56 custom=4.65 foreach=4.71 for=6.13 reduce=14.11
Time (seconds): while=0.56 custom=4.69 foreach=4.61 for=6.14 reduce=14.16
(etc.)
which shows that a bare while is 8-25x faster than anything else. That was for 2.7.3; 2.8.0.r20326 is a bit better:
$ bin/scala DoNotUseForOrReduce
Time (seconds): while=0.56 custom=4.74 foreach=4.72 for=5.32 reduce=6.02
Time (seconds): while=0.56 custom=4.94 foreach=4.94 for=5.30 reduce=5.98
Time (seconds): while=0.56 custom=4.72 foreach=4.71 for=5.33 reduce=6.15
Time (seconds): while=0.56 custom=4.75 foreach=4.94 for=5.37 reduce=5.96
(etc.)
Now it's "only" 8-10x slower. I don't see how to usefully -specialize this.
Code is below.
--Rex
object DoNotUseForOrReduce {
val N = 100000
def t = System.currentTimeMillis
case class IntRange(i0: Int, i1: Int) {
def foreach(f: Int=>Unit) = {
var i = i0
while (i<=i1) {
f(i)
i += 1
}
}
}
def facA(m: Int) = {
var i = 1
var fac = 1
while (i<=m) {
fac *= i
i += 1
}
fac
}
def facB(m: Int) = {
var fac = 1
IntRange(1,m).foreach( fac *= _ )
fac
}
def facC(m: Int) = {
var fac = 1
(1 to m).foreach( fac *= _ )
fac
}
def facD(m: Int) = {
var fac = 1
for (i <- 1 to m) { fac *= i; }
fac
}
def facE(m: Int) = (1 to m).reduceLeft(_*_)
def run() {
val t0 = t
var fA,fB,fC,fD,fE = 0
for (i <- 1 to 100) fA += facA(N*i)
val t1 = t
for (i <- 1 to 100) fB += facB(N*i)
val t2 = t
for (i <- 1 to 100) fC += facC(N*i)
val t3 = t
for (i <- 1 to 100) fD += facD(N*i)
val t4 = t
for (i <- 1 to 100) fE += facE(N*i)
val t5 = t
printf("Time (seconds): while=%.2f custom=%.2f foreach=%.2f for=%.2f reduce=%.2f\n",
1e-3*(t1-t0),1e-3*(t2-t1),1e-3*(t3-t2),1e-3*(t4-t3),1e-3*(t5-t4));
}
def main(args:Array[String]) {
for (j <- 1 to 10) run()
}
}
On Sat, 2010-01-23 at 13:54 -0500, Jonathan Shore wrote:
> I am quite doubtful that the -optimize will ever get to the while()
> level of performance without actually doing the remapping:
Create a micro-benchmark suited to the JVM that compares a for
comprehension with -optimise (and possibly specialized) and a while loop
and show that the performance difference is still large.
I'd already done something much like this when I was working on a fast vector library. For example, comparing:
- while loop with primitives
- custom IntRange class with dedicated foreach
- regular Range (extends Projection[Int])
- for loop
- reduceLeft
gives the following answers
$ scalac -optimise DoNotUseForOrReduce.scala
$ scala DoNotUseForOrReduce
Time (seconds): while=0.56 custom=4.81 foreach=4.67 for=6.13 reduce=13.82
Time (seconds): while=0.56 custom=4.72 foreach=4.67 for=6.13 reduce=13.97
Time (seconds): while=0.56 custom=4.66 foreach=4.73 for=6.15 reduce=14.14
Time (seconds): while=0.56 custom=4.65 foreach=4.71 for=6.13 reduce=14.11
Time (seconds): while=0.56 custom=4.69 foreach=4.61 for=6.14 reduce=14.16
(etc.)
which shows that a bare while is 8-25x faster than anything else. That was for 2.7.3; 2.8.0.r20326 is a bit better:
$ bin/scala DoNotUseForOrReduce
Time (seconds): while=0.56 custom=4.74 foreach=4.72 for=5.32 reduce=6.02
Time (seconds): while=0.56 custom=4.94 foreach=4.94 for=5.30 reduce=5.98
Time (seconds): while=0.56 custom=4.72 foreach=4.71 for=5.33 reduce=6.15
Time (seconds): while=0.56 custom=4.75 foreach=4.94 for=5.37 reduce=5.96
(etc.)
Now it's "only" 8-10x slower. I don't see how to usefully -specialize this.
Code is below.
--Rex
object DoNotUseForOrReduce {
val N = 100000
def t = System.currentTimeMillis
case class IntRange(i0: Int, i1: Int) {
def foreach(f: Int=>Unit) = {
var i = i0
while (i<=i1) {
f(i)
i += 1
}
}
}
def facA(m: Int) = {
var i = 1
var fac = 1
while (i<=m) {
fac *= i
i += 1
}
fac
}
def facB(m: Int) = {
var fac = 1
IntRange(1,m).foreach( fac *= _ )
fac
}
def facC(m: Int) = {
var fac = 1
(1 to m).foreach( fac *= _ )
fac
}
def facD(m: Int) = {
var fac = 1
for (i <- 1 to m) { fac *= i; }
fac
}
def facE(m: Int) = (1 to m).reduceLeft(_*_)
def run() {
val t0 = t
var fA,fB,fC,fD,fE = 0
for (i <- 1 to 100) fA += facA(N*i)
val t1 = t
for (i <- 1 to 100) fB += facB(N*i)
val t2 = t
for (i <- 1 to 100) fC += facC(N*i)
val t3 = t
for (i <- 1 to 100) fD += facD(N*i)
val t4 = t
for (i <- 1 to 100) fE += facE(N*i)
val t5 = t
printf("Time (seconds): while=%.2f custom=%.2f foreach=%.2f for=%.2f reduce=%.2f\n",
1e-3*(t1-t0),1e-3*(t2-t1),1e-3*(t3-t2),1e-3*(t4-t3),1e-3*(t5-t4));
}
def main(args:Array[String]) {
for (j <- 1 to 10) run()
}
}
Sat, 2010-01-23, 21:07
#4
Re: scala performance for numerical algorithms
Ismael,
I had already posted a benchmark 10 months ago on this. Running now with -optimize on the latest 2.8 nightly shows no improvement:
[*] timing of 50 iterations of "[for] array calc: 1000000" total: 7892 ms, cycle: 157858740 ns or 157.84 ms, memory: 18954K[*] timing of 50 iterations of "[while] array calc: 1000000" total: 1210 ms, cycle: 24206860 ns or 24.2 ms, memory: 0K
I see a 8x difference between these implementations. It appears not to be a big deal to fix this in the AST processing part of the compiler. I just am not familiar enough with the AST to know how to match the situations that can be remapped and how to express the AST for the replacement. Anyone dealing with the compiler on the Scala team could implement this within a couple of days. I am familiar enough with compiler design and parsing to know that this is not a huge deal.
I'm sorry to sound impatient, but is *has* been 16 months. I would like to hear something from the Scala team as to whether they plan to do this or not and when. Otherwise I will, frankly, have to find another solution.
If Scala is to appeal to functional programming people in the scientific arena and/or as the "next java", fundamental performance *does* need to be addressed.
Thanks and sorry if this comes out the wrong way.
regards
Jonathan
On Jan 23, 2010, at 2:00 PM, Ismael Juma wrote:
I had already posted a benchmark 10 months ago on this. Running now with -optimize on the latest 2.8 nightly shows no improvement:
[*] timing of 50 iterations of "[for] array calc: 1000000" total: 7892 ms, cycle: 157858740 ns or 157.84 ms, memory: 18954K[*] timing of 50 iterations of "[while] array calc: 1000000" total: 1210 ms, cycle: 24206860 ns or 24.2 ms, memory: 0K
I see a 8x difference between these implementations. It appears not to be a big deal to fix this in the AST processing part of the compiler. I just am not familiar enough with the AST to know how to match the situations that can be remapped and how to express the AST for the replacement. Anyone dealing with the compiler on the Scala team could implement this within a couple of days. I am familiar enough with compiler design and parsing to know that this is not a huge deal.
I'm sorry to sound impatient, but is *has* been 16 months. I would like to hear something from the Scala team as to whether they plan to do this or not and when. Otherwise I will, frankly, have to find another solution.
If Scala is to appeal to functional programming people in the scientific arena and/or as the "next java", fundamental performance *does* need to be addressed.
Thanks and sorry if this comes out the wrong way.
regards
Jonathan
On Jan 23, 2010, at 2:00 PM, Ismael Juma wrote:
Hi Jonathan,
On Sat, 2010-01-23 at 13:54 -0500, Jonathan Shore wrote:I am quite doubtful that the -optimize will ever get to the while()level of performance without actually doing the remapping:
Even though I would also like to see this optimisation, this is not the
best way to convince others to implement it. :)
Create a micro-benchmark suited to the JVM that compares a for
comprehension with -optimise (and possibly specialized) and a while loop
and show that the performance difference is still large. That is more
likely to have the desired effect.
Best,
Ismael
Sat, 2010-01-23, 21:17
#5
Re: scala performance for numerical algorithms
On Sat, Jan 23, 2010 at 2:50 PM, Rex Kerr <ichoran@gmail.com> wrote:
Maybe I should think for a few more seconds before saying something like that.
$ bin/scala DoNotUseForReduce
Time (seconds): while=0.56 custom=0.58 foreach=4.65 for=5.30 reduce=5.91
Time (seconds): while=0.56 custom=0.57 foreach=4.63 for=5.36 reduce=5.92
Time (seconds): while=0.56 custom=0.57 foreach=4.64 for=5.25 reduce=6.11
Time (seconds): while=0.56 custom=0.57 foreach=4.81 for=5.31 reduce=5.91
(etc.)
Code for custom (which could be wrapped into Range's foreach with @specialize in the library):
abstract class IntUnitFunction {
def apply(i: Int):Unit
}
case class IntRange(i0: Int, i1: Int) {
def foreach(f: IntUnitFunction) = {
var i = i0
while (i<=i1) {
f(i)
i += 1
}
}
}
def facB(m: Int) = {
var fac = 1
IntRange(1,m).foreach( new IntUnitFunction{ def apply(i: Int) { fac*=i } } )
fac
}
--Rex
Now it's "only" 8-10x slower. I don't see how to usefully -specialize this.
Maybe I should think for a few more seconds before saying something like that.
$ bin/scala DoNotUseForReduce
Time (seconds): while=0.56 custom=0.58 foreach=4.65 for=5.30 reduce=5.91
Time (seconds): while=0.56 custom=0.57 foreach=4.63 for=5.36 reduce=5.92
Time (seconds): while=0.56 custom=0.57 foreach=4.64 for=5.25 reduce=6.11
Time (seconds): while=0.56 custom=0.57 foreach=4.81 for=5.31 reduce=5.91
(etc.)
Code for custom (which could be wrapped into Range's foreach with @specialize in the library):
abstract class IntUnitFunction {
def apply(i: Int):Unit
}
case class IntRange(i0: Int, i1: Int) {
def foreach(f: IntUnitFunction) = {
var i = i0
while (i<=i1) {
f(i)
i += 1
}
}
}
def facB(m: Int) = {
var fac = 1
IntRange(1,m).foreach( new IntUnitFunction{ def apply(i: Int) { fac*=i } } )
fac
}
--Rex
Sat, 2010-01-23, 21:27
#6
Re: scala performance for numerical algorithms
Hmm, I see what you are doing and seems like a good workaround when the work unit is a function or is large enough that function overhead does not matter. However there are many algorithms that require tight nested loops over a very primitive operation (say on arrays). At the moment the only performant way is to use while loops (i think?)
Jonathan
On Jan 23, 2010, at 3:04 PM, Rex Kerr wrote:
Jonathan
On Jan 23, 2010, at 3:04 PM, Rex Kerr wrote:
On Sat, Jan 23, 2010 at 2:50 PM, Rex Kerr <ichoran@gmail.com> wrote:Now it's "only" 8-10x slower. I don't see how to usefully -specialize this.
Maybe I should think for a few more seconds before saying something like that.
$ bin/scala DoNotUseForReduce
Time (seconds): while=0.56 custom=0.58 foreach=4.65 for=5.30 reduce=5.91
Time (seconds): while=0.56 custom=0.57 foreach=4.63 for=5.36 reduce=5.92
Time (seconds): while=0.56 custom=0.57 foreach=4.64 for=5.25 reduce=6.11
Time (seconds): while=0.56 custom=0.57 foreach=4.81 for=5.31 reduce=5.91
(etc.)
Code for custom (which could be wrapped into Range's foreach with @specialize in the library):
abstract class IntUnitFunction {
def apply(i: Int):Unit
}
case class IntRange(i0: Int, i1: Int) {
def foreach(f: IntUnitFunction) = {
var i = i0
while (i<=i1) {
f(i)
i += 1
}
}
}
def facB(m: Int) = {
var fac = 1
IntRange(1,m).foreach( new IntUnitFunction{ def apply(i: Int) { fac*=i } } )
fac
}
--Rex
Sat, 2010-01-23, 21:47
#7
Re: scala performance for numerical algorithms
All I'm doing is integer multiplication! An array access takes more time than that. I don't have time right now to find my old posts on the topic, but I've previously benchmarked code where
val x:Float[]
val y:Float[]
while (...)
is actually _slower_ by 20-30% (IIRC) than
val xy:Float[]
val b = BVecF( xy )
b.foreach(...)
where BVecF wraps an array of (x0,y0,x1,y1,x2,y2...). The win on the single array access vs. two is bigger than the overhead of a (admittedly carefully optimized) foreach call.
--Rex
On Sat, Jan 23, 2010 at 3:18 PM, Jonathan Shore <jonathan.shore@gmail.com> wrote:
val x:Float[]
val y:Float[]
while (...)
is actually _slower_ by 20-30% (IIRC) than
val xy:Float[]
val b = BVecF( xy )
b.foreach(...)
where BVecF wraps an array of (x0,y0,x1,y1,x2,y2...). The win on the single array access vs. two is bigger than the overhead of a (admittedly carefully optimized) foreach call.
--Rex
On Sat, Jan 23, 2010 at 3:18 PM, Jonathan Shore <jonathan.shore@gmail.com> wrote:
Hmm, I see what you are doing and seems like a good workaround when the work unit is a function or is large enough that function overhead does not matter. However there are many algorithms that require tight nested loops over a very primitive operation (say on arrays). At the moment the only performant way is to use while loops (i think?)
Jonathan
Sat, 2010-01-23, 21:57
#8
Re: scala performance for numerical algorithms
Hi Jonathan,
On Sat, 2010-01-23 at 15:03 -0500, Jonathan Shore wrote:
> I had already posted a benchmark 10 months ago on this.
I know you did. But the comment on Trac was only a few weeks ago. It
only makes sense to say that performance is still bad after trying it
with a build more recent than that. You have done that below, good.
> I'm sorry to sound impatient, but is *has* been 16 months.
You make it sound like the Scala team has done nothing since then. ;) A
lot of improvements have been made in the last 16 months and many of
them related to performance (with more general solutions). I repeat that
I would also like to see the simple transformation you're asking and I
explain why here (search for ticket 1338):
http://blog.juma.me.uk/2009/10/26/new-jvm-options-and-scala-iteration-pe...
Also, I benchmarked something similar to the trick Rex mentions in
another thread in 2008:
> I would like to hear something from the Scala team as to whether
> they plan to do this or not and when. Otherwise I will, frankly, have
> to find another solution.
>
>
> If Scala is to appeal to functional programming people in the
> scientific arena and/or as the "next java", fundamental performance
> *does* need to be addressed.
>
>
> Thanks and sorry if this comes out the wrong way.
>
>
> regards
>
>
> Jonathan
>
>
>
> On Jan 23, 2010, at 2:00 PM, Ismael Juma wrote:
>
> > Hi Jonathan,
> >
> > On Sat, 2010-01-23 at 13:54 -0500, Jonathan Shore wrote:
> >
> > > I am quite doubtful that the -optimize will ever get to the
> > > while()
> > > level of performance without actually doing the remapping:
> >
> > Even though I would also like to see this optimisation, this is not
> > the
> > best way to convince others to implement it. :)
> >
> > Create a micro-benchmark suited to the JVM that compares a for
> > comprehension with -optimise (and possibly specialized) and a while
> > loop
> > and show that the performance difference is still large. That is
> > more
> > likely to have the desired effect.
> >
> > Best,
> > Ismael
> >
> >
> >
>
Sat, 2010-01-23, 22:07
#9
Re: scala performance for numerical algorithms
(sorry, I sent the previous email prematurely)
Hi Jonathan,
On Sat, 2010-01-23 at 15:03 -0500, Jonathan Shore wrote:
> I had already posted a benchmark 10 months ago on this.
I know you did. But the comment on Trac was only a few weeks ago. It
only makes sense to say that performance is still bad after trying it
with a build more recent than that. You have done that below, good.
> I'm sorry to sound impatient, but is *has* been 16 months.
You make it sound like the Scala team has done nothing since then. ;) A
lot of improvements have been made in the last 16 months and many of
them related to performance (with more general solutions). And the
specialized stuff is not finished yet as I understand.
I repeat that I would also like to see the simple transformation you're
asking and I explain why here (search for ticket 1338):
http://blog.juma.me.uk/2009/10/26/new-jvm-options-and-scala-iteration-pe...
Also, I benchmarked something similar to the trick Rex mentions in
another thread in 2008:
http://blog.juma.me.uk/2008/09/15/efficient-scala-with-primitive-collect...
Still, that doesn't always work because of the inlining issue mentioned
in the more recent blog entry.
> If Scala is to appeal to functional programming people in the
> scientific arena and/or as the "next java", fundamental performance
> *does* need to be addressed.
You can write Scala that is as fast as Java. It's just as ugly... or
even a bit uglier. ;)
Best,
Ismael
Sat, 2010-01-23, 23:07
#10
Re: scala performance for numerical algorithms
On Jan 23, 2010, at 4:01 PM, Ismael Juma wrote:
I know you did. But the comment on Trac was only a few weeks ago. It
only makes sense to say that performance is still bad after trying it
with a build more recent than that. You have done that below, good.
Yes. I decided to post again in reference to that comment. I ran the same tests again with -optimize and did not see much if any improvement.
I'm sorry to sound impatient, but is *has* been 16 months.
You make it sound like the Scala team has done nothing since then. ;) A
lot of improvements have been made in the last 16 months and many of
them related to performance (with more general solutions). And the
specialized stuff is not finished yet as I understand.
Sorry if this came out wrong. And I do realize that there is a lot of important improvements and new features in progress.
That said, the comment in the trac entry was rather dismissive of the problem *and* went further to state that it was unlikely ever to be addressed. I think this is "wrong" (or at a minimum is a major turnoff for people like me) and should be highlighted here in the user community.
As for the specialized stuff you mention, sounds interesting.
There are some lessons to be learned from the experience the numerical community had with Java. There was quite a bit of interest early on in the numerical community around using Java. After a couple of years it "petered out" when it became apparent that the JVM and java was not ready. The main concerns were:
- performance- performance (did I say that) ;)- lack of operator abstraction
In the years since, the JVM has approached the speed of C++ / C in many respects, even with array access. I am effectively able to use java and get pretty much the same performance by C++ colleagues get at lower "development cost".
Scala not only fixes one of the major inconveniences (lack of operator overloading) but is functional as well. This should be a fantastic language for numerical people. In that regard should not have performance surprises around fundamental operations. These need to be addressed early on, especially in the early stages of momentum, or is likely to turn away adopters much as Java did in the beginning for a certain class of programmer. i.e. I want to see Scala succeed in my community and now they are starting to look at it.
I realize it will take some time to get "there". But the dismissive comment in the trac seemed to imply that these considerations are not even on the radar.
If Scala is to appeal to functional programming people in thescientific arena and/or as the "next java", fundamental performance*does* need to be addressed.
You can write Scala that is as fast as Java. It's just as ugly... or
even a bit uglier. ;)
For better or worse languages are chosen based on ergonomics, taste, performance, etc. Scala is not far away from doing a most things well. Why not fix something this fundamental early on.
Best,
Ismael
Sat, 2010-01-23, 23:27
#11
Re: scala performance for numerical algorithms
You might find this discussion interesting:
http://stackoverflow.com/questions/1257021/suitable-functional-language-...
On Sat, Jan 23, 2010 at 10:56 PM, Jonathan Shore
wrote:
>
> On Jan 23, 2010, at 4:01 PM, Ismael Juma wrote:
>
> I know you did. But the comment on Trac was only a few weeks ago. It
> only makes sense to say that performance is still bad after trying it
> with a build more recent than that. You have done that below, good.
>
> Yes. I decided to post again in reference to that comment. I ran the same
> tests again with -optimize and did not see much if any improvement.
>
>
>
> I'm sorry to sound impatient, but is *has* been 16 months.
>
> You make it sound like the Scala team has done nothing since then. ;) A
> lot of improvements have been made in the last 16 months and many of
> them related to performance (with more general solutions). And the
> specialized stuff is not finished yet as I understand.
>
>
> Sorry if this came out wrong. And I do realize that there is a lot of
> important improvements and new features in progress.
> That said, the comment in the trac entry was rather dismissive of the
> problem *and* went further to state that it was unlikely ever to be
> addressed. I think this is "wrong" (or at a minimum is a major turnoff for
> people like me) and should be highlighted here in the user community.
> As for the specialized stuff you mention, sounds interesting.
> There are some lessons to be learned from the experience the numerical
> community had with Java. There was quite a bit of interest early on in the
> numerical community around using Java. After a couple of years it "petered
> out" when it became apparent that the JVM and java was not ready. The main
> concerns were:
> - performance
> - performance (did I say that) ;)
> - lack of operator abstraction
> In the years since, the JVM has approached the speed of C++ / C in many
> respects, even with array access. I am effectively able to use java and
> get pretty much the same performance by C++ colleagues get at lower
> "development cost".
> Scala not only fixes one of the major inconveniences (lack of operator
> overloading) but is functional as well. This should be a fantastic
> language for numerical people. In that regard should not have performance
> surprises around fundamental operations. These need to be addressed early
> on, especially in the early stages of momentum, or is likely to turn away
> adopters much as Java did in the beginning for a certain class of
> programmer. i.e. I want to see Scala succeed in my community and now
> they are starting to look at it.
> I realize it will take some time to get "there". But the dismissive
> comment in the trac seemed to imply that these considerations are not even
> on the radar.
>
>
>
>
> If Scala is to appeal to functional programming people in the
>
> scientific arena and/or as the "next java", fundamental performance
>
> *does* need to be addressed.
>
> You can write Scala that is as fast as Java. It's just as ugly... or
> even a bit uglier. ;)
>
>
> For better or worse languages are chosen based on ergonomics, taste,
> performance, etc. Scala is not far away from doing a most things well.
> Why not fix something this fundamental early on.
>
> Best,
> Ismael
>
>
>
Sun, 2010-01-24, 00:27
#12
Re: scala performance for numerical algorithms
On Jan 23, 2010, at 4:01 PM, Ismael Juma wrote:
I repeat that I would also like to see the simple transformation you're
asking and I explain why here (search for ticket 1338):
mgarcia in the trac thread already points out where the transform can be done, i.e.:
private def makeFor(mapName: Name, flatMapName: Name, enums: List[Enumerator], body: Tree): Tree
The relevant file is: src/compiler/scala/tools/nsc/ast/parser/TreeBuilder.scala
I am not clear on mapName vs flatMapName, but the enumerator list in the case of the loop:for (v <- 0 until N) <body>
would be the Range object Range (0, N, 1) and "body" our <body>. The transform would be as follows, under the following special case conditions:
1. there is only one enumerator2. the enumerator is a Range or something equivalent (I would probably call this Sequence)
Loosely speaking the above for comprehension without optimization could be transformed into (I'm guessing that it ends up being worse, with a call to a closure/function representing the body on each iteration):
val seq = new Range (0,N,1) var i = 0 while (i < seq.length) { val v = seq(i) <body> i += 1 }
If the above two constraints are met can rather map to:
val seq = new Range (0,N,1) var v = seq.start while (v < seq.end) { <body> v += seq.step }
Incidentally, I've found that the performance difference ranges between 2x to 8x depending on the size of inner loops, etc. I'm testing on a MacPro quad core with latest release of Java 6, -server mode and Scala 2.8.0 nightly (as of today) with -optimize.
I just don't know the AST well enough to know how to do it without some weeks of effort to understand ...
Jonathan
Sun, 2010-01-24, 00:37
#13
Re: scala performance for numerical algorithms
On Saturday January 23 2010, Jonathan Shore wrote:
> ...
>
> I just don't know the AST well enough to know how to do it without
> some weeks of effort to understand ...
But probably less than 40+ or 64+ weeks? (10 and 16 months, resp.)
> Jonathan
Randall Schulz
Sun, 2010-01-24, 01:07
#14
Re: scala performance for numerical algorithms
On Jan 23, 2010, at 4:01 PM, Ismael Juma wrote:
Ismael, my hat it off to you. You've done some very detailed optimisation and performance analysis. I think what you are testing, though, is a bit different (i.e. application of a function/body across a collection via foreach) but suffers from some of the same problems, namely:
http://blog.juma.me.uk/2009/10/26/new-jvm-options-and-scala-iteration-performance/
- lack of inlining of function- cost of accessing the iterator
While it is possible that the function representing the loop body can be inlined, I find it less likely that the JVM will optimise the cost of accessing the iterator away. For instance, my test has the following loops:
for (ii <- 0 until iterations) { for (i <- 1 until array.length) array(i) = i / 1000.0
val seqA = new Range (0,iterations,1) var iA = 0 while (iA < seqA.length) { val A = seqA(iA) iA += 1
The performance of the version below was nearly the same as the one above, showing that the cost of accessing the iterators dominated the loop time. The same loop without iterators was 8x faster for large array and iterations.
So at least for trivial loop bodies, I don't think that inlining is the biggest problem.
With a special case mapping to a non-iterator, non-closure based approach, the performance is the same as Java (or C++ for that matter).
Jonathan
Sun, 2010-01-24, 01:37
#15
Re: scala performance for numerical algorithms
On Jan 23, 2010, at 6:33 PM, Randall R Schulz wrote:
> On Saturday January 23 2010, Jonathan Shore wrote:
>> ...
>>
>> I just don't know the AST well enough to know how to do it without
>> some weeks of effort to understand ...
>
> But probably less than 40+ or 64+ weeks? (10 and 16 months, resp.)
>
Wish I could but am way overcommitted as part of a nascent startup.
>
>> Jonathan
>
>
> Randall Schulz
Sun, 2010-01-24, 01:47
#16
Re: scala performance for numerical algorithms
On Sat, Jan 23, 2010 at 06:23:40PM -0500, Jonathan Shore wrote:
> I just don't know the AST well enough to know how to do it without
> some weeks of effort to understand ...
I can do it. First you have to convince martin, then I'll do it(*).
(*) Always possible it's harder than I think for whatever reason, in
which case I reserve the right to not do it.
Sun, 2010-01-24, 04:07
#17
Re: scala performance for numerical algorithms
On Sat, Jan 23, 2010 at 6:57 PM, Jonathan Shore <jonathan.shore@gmail.com> wrote:
The JVM does this pretty well (and anyway it's pretty cheap--a function call is typically a couple of cycles).
The JVM does this pretty well too, and it's pretty cheap too. (A pointer dereference is also usually only a couple of cycles.)
The really expensive thing is object creation. (My benchmarks have in the past indicated that it takes at least 12 cycles.) And you have to do it in both of your tests because Range (which is not currently @specialized) extends Product[Int], which eventually extends Function1, which has a generic abstract apply. That means that Range has to autobox its apply, which means you get object creation every time you access the iterator.
This is a problem that is solved by
(1) creating mutable iterators that wrap primitives, or
(2) @specializing Function1 so that Range doesn't need to autobox.
For an example of (1), try something like
case class Indexer(val i0: Int, val i1: Int) extends Iterator[Indexer] {
var i = i0
def hasNext = i<i1
def next = { i += 1 ; this }
def apply() = i
}
and redo your for loop with j <- Indexer(1,8) and such.
I bet it gets at least 4x faster (i.e. 2x or less slower than the raw while loops).
(I would try it myself, but I'm away from the appropriate machine at the moment.)
--Rex
I think what you are testing, though, is a bit different (i.e. application of a function/body across a collection via foreach) but suffers from some of the same problems, namely:
- lack of inlining of function
The JVM does this pretty well (and anyway it's pretty cheap--a function call is typically a couple of cycles).
- cost of accessing the iterator
The JVM does this pretty well too, and it's pretty cheap too. (A pointer dereference is also usually only a couple of cycles.)
The really expensive thing is object creation. (My benchmarks have in the past indicated that it takes at least 12 cycles.) And you have to do it in both of your tests because Range (which is not currently @specialized) extends Product[Int], which eventually extends Function1, which has a generic abstract apply. That means that Range has to autobox its apply, which means you get object creation every time you access the iterator.
This is a problem that is solved by
(1) creating mutable iterators that wrap primitives, or
(2) @specializing Function1 so that Range doesn't need to autobox.
For an example of (1), try something like
case class Indexer(val i0: Int, val i1: Int) extends Iterator[Indexer] {
var i = i0
def hasNext = i<i1
def next = { i += 1 ; this }
def apply() = i
}
and redo your for loop with j <- Indexer(1,8) and such.
I bet it gets at least 4x faster (i.e. 2x or less slower than the raw while loops).
(I would try it myself, but I'm away from the appropriate machine at the moment.)
--Rex
While it is possible that the function representing the loop body can be inlined, I find it less likely that the JVM will optimise the cost of accessing the iterator away. For instance, my test has the following loops:
for (ii <- 0 until iterations) { for (i <- 1 until array.length) array(i) = i / 1000.0for (i <- 8 until array.length) { for (j <- 1 until 8) array(i) += array(i-j) * array(i) } }
Sun, 2010-01-24, 06:57
#18
Re: scala performance for numerical algorithms
Hi Paul,
On Sat, 2010-01-23 at 16:38 -0800, Paul Phillips wrote:
> On Sat, Jan 23, 2010 at 06:23:40PM -0500, Jonathan Shore wrote:
> > I just don't know the AST well enough to know how to do it without
> > some weeks of effort to understand ...
>
> I can do it.
I'd be very happy if I could remove a bunch of while loops from my
code. ;)
> First you have to convince martin, then I'll do it(*).
Has Martin mentioned what are his objections? I mean, I understand the
idea that EPFL has given priority to other things, but if someone was
willing to do it (you!), then I don't see any disadvantage. Specially
for arrays.
Best,
Ismael
Sun, 2010-01-24, 07:07
#19
Re: scala performance for numerical algorithms
On Sat, 2010-01-23 at 18:57 -0500, Jonathan Shore wrote:
> - lack of inlining of function
> - cost of accessing the iterator
>
> While it is possible that the function representing the loop body can
> be inlined, I find it less likely that the JVM will optimise the cost
> of accessing the iterator away.
I am not sure what you're saying here. You realise that a for
comprehension without a yield actually invokes foreach (as in my example
above) and an iterator doesn't have to be involved at all?
Best,
Ismael
Sun, 2010-01-24, 07:17
#20
Re: scala performance for numerical algorithms
Hi Rex,
On Sat, 2010-01-23 at 21:57 -0500, Rex Kerr wrote:
> On Sat, Jan 23, 2010 at 6:57 PM, Jonathan Shore
> wrote:
> I think what you are testing, though, is a bit different (i.e.
> application of a function/body across a collection via
> foreach) but suffers from some of the same problems, namely:
>
>
> - lack of inlining of function
>
> The JVM does this pretty well (and anyway it's pretty cheap--a
> function call is typically a couple of cycles).
>
> - cost of accessing the iterator
>
> The JVM does this pretty well too, and it's pretty cheap too. (A
> pointer dereference is also usually only a couple of cycles.)
>
> The really expensive thing is object creation.
This is a lot of truth here, but if you have a foreach that is called
with several different anonymous functions and the foreach itself is not
inlined (both which can happen in real-life code as shown in my blog
entry and easily missed by simple micro-benchmarks), then you end up
with a megamorphic method call for the anonymous function which causes
performance to go to the toilet. That is the biggest issue for which
there is no solution at the moment when it comes to the standard
library. For your own code, you could make your foreach final and add
@inline to it (you'd lose the ability to override it, of course).
John Rose comments briefly on the issue and how HotSpot could solve it:
http://groups.google.com/group/jvm-languages/msg/0c7edb6a45794d97
Best,
Ismael
Sun, 2010-01-24, 13:17
#21
Re: scala performance for numerical algorithms
Couldn't you cope with much of these issues by consequently inlining
stuff? I know it isn't working right now, because you can't inline
from external jars but if you could, you could make at least all the
higher-order methods from ArrayOps inlinable which should improve the
situation in many cases.
On Sun, Jan 24, 2010 at 7:09 AM, Ismael Juma wrote:
> Hi Rex,
>
> On Sat, 2010-01-23 at 21:57 -0500, Rex Kerr wrote:
>> On Sat, Jan 23, 2010 at 6:57 PM, Jonathan Shore
>> wrote:
>> I think what you are testing, though, is a bit different (i.e.
>> application of a function/body across a collection via
>> foreach) but suffers from some of the same problems, namely:
>>
>>
>> - lack of inlining of function
>>
>> The JVM does this pretty well (and anyway it's pretty cheap--a
>> function call is typically a couple of cycles).
>>
>> - cost of accessing the iterator
>>
>> The JVM does this pretty well too, and it's pretty cheap too. (A
>> pointer dereference is also usually only a couple of cycles.)
>>
>> The really expensive thing is object creation.
>
> This is a lot of truth here, but if you have a foreach that is called
> with several different anonymous functions and the foreach itself is not
> inlined (both which can happen in real-life code as shown in my blog
> entry and easily missed by simple micro-benchmarks), then you end up
> with a megamorphic method call for the anonymous function which causes
> performance to go to the toilet. That is the biggest issue for which
> there is no solution at the moment when it comes to the standard
> library. For your own code, you could make your foreach final and add
> @inline to it (you'd lose the ability to override it, of course).
>
> John Rose comments briefly on the issue and how HotSpot could solve it:
>
> http://groups.google.com/group/jvm-languages/msg/0c7edb6a45794d97
>
> Best,
> Ismael
>
>
>
Sun, 2010-01-24, 14:07
#22
Re: scala performance for numerical algorithms
Hi Johannes,
On Sun, 2010-01-24 at 13:11 +0100, Johannes Rudolph wrote:
> Couldn't you cope with much of these issues by consequently inlining
> stuff?
You mean inlining done by scalac, right? That only works for cases where
overriding isn't possible (which I hinted at in the previous message) or
if scalac had a whole program optimisation mode. While talking about the
general case, I neglected arrays and you make a good point below.
> I know it isn't working right now, because you can't inline
> from external jars
I wasn't aware of this. Do you have a reference? In the link below,
Iulian doesn't mention that as a problem.
> but if you could, you could make at least all the
> higher-order methods from ArrayOps inlinable which should improve the
> situation in many cases.
The array case is special as allowing overriding is not needed (or even
possible with the current approach as far as I can see). So, it should
be possible to allow inlining and it should help quite a bit.
The last time this came up, Iulian mentioned that the optimiser had
issues identifying the ArrayOps leaf subclasses as practically final
(they are non-final classes defined inside an object) and he said that
he intended fix that:
http://article.gmane.org/gmane.comp.lang.scala.internals/1830
Not sure if that has been committed since. I do wonder why they're not
simply made final though.
Still, that is just the array case, it would be nice to have something
for the rest of the collections.
Best,
Ismael
Sun, 2010-01-24, 14:07
#23
Re: scala performance for numerical algorithms
On Sun, Jan 24, 2010 at 1:47 PM, Ismael Juma wrote:
> Hi Johannes,
>
> On Sun, 2010-01-24 at 13:11 +0100, Johannes Rudolph wrote:
>> Couldn't you cope with much of these issues by consequently inlining
>> stuff?
>
> You mean inlining done by scalac, right? That only works for cases where
> overriding isn't possible (which I hinted at in the previous message) or
> if scalac had a whole program optimisation mode. While talking about the
> general case, I neglected arrays and you make a good point below.
>
>> I know it isn't working right now, because you can't inline
>> from external jars
>
> I wasn't aware of this. Do you have a reference? In the link below,
> Iulian doesn't mention that as a problem.
I thought someone once said that, but if I looked into the code and it
looks like it should work.
>> but if you could, you could make at least all the
>> higher-order methods from ArrayOps inlinable which should improve the
>> situation in many cases.
>
> The array case is special as allowing overriding is not needed (or even
> possible with the current approach as far as I can see). So, it should
> be possible to allow inlining and it should help quite a bit.
>
> The last time this came up, Iulian mentioned that the optimiser had
> issues identifying the ArrayOps leaf subclasses as practically final
> (they are non-final classes defined inside an object) and he said that
> he intended fix that:
>
> http://article.gmane.org/gmane.comp.lang.scala.internals/1830
>
> Not sure if that has been committed since. I do wonder why they're not
> simply made final though.
Yes, that's what I thought, too. And thanks for the pointer to that discussion.
> Still, that is just the array case, it would be nice to have something
> for the rest of the collections.
Sure, but people needing fast loops would in many cases use arrays
anyway and it would be nice if you could tell them, that using the
array higher-order functions will optimize just fine.
Sun, 2010-01-24, 14:37
#24
Re: scala performance for numerical algorithms
On Sun, 2010-01-24 at 13:57 +0100, Johannes Rudolph wrote:
> Sure, but people needing fast loops would in many cases use arrays
> anyway and it would be nice if you could tell them, that using the
> array higher-order functions will optimize just fine.
Definitely.
Ismael
Sun, 2010-01-24, 15:17
#25
Re: scala performance for numerical algorithms
You're right. I'm just showing my ignorance with Scala. I've been on the sidelines seeing how it progresses perf-wise before jumping in.
On Jan 24, 2010, at 12:52 AM, Ismael Juma wrote:
> On Sat, 2010-01-23 at 18:57 -0500, Jonathan Shore wrote:
>> - lack of inlining of function
>> - cost of accessing the iterator
>>
>> While it is possible that the function representing the loop body can
>> be inlined, I find it less likely that the JVM will optimise the cost
>> of accessing the iterator away.
>
> I am not sure what you're saying here. You realise that a for
> comprehension without a yield actually invokes foreach (as in my example
> above) and an iterator doesn't have to be involved at all?
>
> Best,
> Ismael
>
>
Sun, 2010-01-24, 15:47
#26
Re: scala performance for numerical algorithms
On Jan 24, 2010, at 1:09 AM, Ismael Juma wrote:
This is a lot of truth here, but if you have a foreach that is called
with several different anonymous functions and the foreach itself is not
inlined (both which can happen in real-life code as shown in my blog
entry and easily missed by simple micro-benchmarks), then you end up
with a megamorphic method call for the anonymous function which causes
performance to go to the toilet. That is the biggest issue for which
there is no solution at the moment when it comes to the standard
library. For your own code, you could make your foreach final and add
@inline to it (you'd lose the ability to override it, of course).
hmm, if I understand this correctly, I would introduce my own class or specialization of Range and annotate with @inline on the class or foreach method? If I'm having performance issues with Range in the context of the series of "simple" loops:
for (ii <- 0 until iterations) { for (i <- 1 until array.length) array(i) = i / 1000.0
for (i <- 8 until array.length) { for (j <- 1 until 8) array(i) += array(i-j) * array(i) } }
Looking at the Range code:
override def foreach(f: Int => Unit) { if (step > 0) { var i = this.start val until = if (inInterval(end)) end + 1 else end
while (i < until) { f(i) i += step } } else { ... } } }
, it would seem to imply that my performance issue is either coming from one or both of the following:
- the cost of creating the Range object in inner loops- the cost of calling f(i) when the JVM misses the opportunity to inline.
I'll have to read your blog entry again. It is not clear to me in what situations we can expect inlining and in which situations not.
In summary though, unless can find a way to specialize range and make the overhead "go away", I still think we need to have a mapping to while for the simple cases, not relying on the JRE to be extra clever. The alternative would be to add in the old Java / C++ for loop into the language and allow for more concise specification of loops (unnecessary for sure).
Major kudos to Phil or whomever implements!
Jonathan
Mon, 2010-01-25, 00:07
#27
Re: scala performance for numerical algorithms
On Sun, Jan 24, 2010 at 9:42 AM, Jonathan Shore <jonathan.shore@gmail.com> wrote:
It's not the all cost of calling f(i). It's the cost of boxing i for that call. Here's the bytecode for range as of 18987; hopefully they've pulled the if function out of the while loop since then.
public void foreach(scala.Function1);
Code:
0: aload_0
1: invokevirtual #4b4b4b; //Method start:()I
4: istore_2
5: aload_0
6: invokevirtual #4f4f4f; //Method step:()I
9: iconst_0
10: if_icmple 29
13: iload_2
14: aload_0
15: invokevirtual #5c5c5c; //Method limit:()I
18: if_icmpge 25
21: iconst_1
22: goto 42
25: iconst_0
26: goto 42
29: iload_2
30: aload_0
31: invokevirtual #5c5c5c; //Method limit:()I
34: if_icmple 41
37: iconst_1
38: goto 42
41: iconst_0
42: ifeq 66
45: aload_1
46: iload_2
47: invokestatic #2c2c2c; //Method scala/runtime/BoxesRunTime.boxToInteger:(I)Ljava/lang/Integer;
50: invokeinterface #232323, 2; //InterfaceMethod scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;
55: pop
56: iload_2
57: aload_0
58: invokevirtual #4f4f4f; //Method step:()I
61: iadd
62: istore_2
63: goto 5
66: return
Note 47--it invokes boxToInteger right before the call to Function1.apply (i.e. f(i)). That's certainly not the _only_ thing going on inside that loop, but there's no way to avoid a lot of work if you're creating a class every loop. If you had a specialized Function1 that understood primitive ints, you wouldn't need to box it, and that part of the expense would go away.
Now, inlining is considerable help when it can happen, but in your tests that's only half the issue. For example, testing the following:
- Raw while loop as you posted
- Indexer class like I posted in for loops, but with while in innermost loop
- Indexer class like I posted in for loops
- Indexer class with implicit goodies to make it act like Range
- Range
gives the following timings:
#00004 while=0.162 forIwhile=0.374 forI=0.363 forIWI=0.451 range=0.520
#00016 while=0.358 forIwhile=0.559 forI=0.765 forIWI=0.779 range=1.210
#00256 while=0.623 forIwhile=0.839 forI=1.210 forIWI=1.217 range=2.041
#65536 while=0.633 forIwhile=0.871 forI=1.268 forIWI=1.266 range=2.330
(Array length is on the left; all are chosen so that iterations*array.length == 16M.)
The main difference between range and the others is the boxing, which is almost half the time taken. All the implicit goodies do make your code look pretty without adding _any_ overhead. And innermost loops are, as always, by far the most important (except in the array.length = 4 case, since then the innermost loop is never called).
Anyway, for right now, this suggests the following optimization tips for high-performance Scala coding: write powerful classes with while loops as the innermost loop. Avoid object creations in tight loops. Compose the rest of your problem out of these highly-tuned primitives and enjoy all the expressiveness of Scala at minimal performance overhead.
--Rex
P.S. Writing an class that wraps an array is about as efficient as the Indexer idea that I showed above; using update and apply and implicits and such, you can write stuff like
for (a <- array in (0,5)) { a() = 5 - a }
which is often even clearer than all the explicit indexing.
P.P.S. Benchmark code follows:
object TestArraySpeed {
def t = scala.compat.Platform.currentTime
final case class Indexer(i0: Int, i1: Int) extends Iterator[Indexer] {
var i = i0-1
def hasNext = i+1<i1
def next = { i += 1 ; this }
def apply() = i
}
def rawWhile(iterations: Int, array: Array[Double]) = {
var sum = 0.0
var ii = 0
while (ii < iterations) {
var i = 0
while (i < array.length) {
array(i) = i / 1e8
i += 1
}
i = 8
while (i < array.length) {
var j = 1
while (j < 8) {
array(i) += array(i-j) * array(i)
j += 1
}
i += 1
}
i = 8
while (i < array.length) {
sum += array(i)*ii
i += 1
}
ii += 1
}
sum
}
def forIndexerWhile(iterations: Int, array: Array[Double]) = {
var sum = 0.0
for (ii <- Indexer(0,iterations)) {
for (i <- Indexer(0,array.length)) { array(i()) = i() / 1e8 }
for (i <- Indexer(8,array.length)) {
var j = 1
while (j < 8) {
array(i()) += array(i()-j)*array(i())
j += 1
}
}
for (i <- Indexer(8,array.length)) { sum += array(i())*ii() }
}
sum
}
def forIndexer(iterations: Int, array: Array[Double]) = {
var sum = 0.0
for (ii <- Indexer(0,iterations)) {
for (i <- Indexer(0,array.length)) { array(i()) = i() / 1e8 }
for (i <- Indexer(8,array.length)) {
for (j <- Indexer(1,8)) { array(i()) += array(i()-j())*array(i()) }
}
for (i <- Indexer(8,array.length)) { sum += array(i())*ii() }
}
sum
}
def forIndexerWithImplicits(iterations: Int, array: Array[Double]) = {
final class Upto(i: Int) { def upto(j: Int) = Indexer(i,j) }
implicit def intWithUpto(i: Int) = new Upto(i)
implicit def indexer2int(ix: Indexer) = ix.i
var sum = 0.0
for (ii <- 0 upto iterations) {
for (i <- 0 upto array.length) { array(i) = i / 1e8 }
for (i <- 8 upto array.length) {
for (j <- 1 upto 8) array(i) += array(i-j) * array(i)
}
for (i <- 8 upto array.length) sum += array(i)*ii
}
sum
}
def forRange(iterations: Int, array: Array[Double]) = {
var sum = 0.0
for (ii <- 0 until iterations) {
for (i <- 0 until array.length) { array(i) = i / 1e8 }
for (i <- 8 until array.length) {
for (j <- 1 until 8) array(i) += array(i-j) * array(i)
}
for (i <- 8 until array.length) sum += array(i)*ii
}
sum
}
def run() {
val sizes = List(4,16,256,65536)
val iter = sizes map ((16*1024*1024) / _)
val arrays = sizes map ( new Array[Double](_) )
for ((i,a) <- iter zip arrays) {
val t0 = t
val sumA = rawWhile(i,a)
val t1 = t
val sumB = forIndexerWhile(i,a)
val t2 = t
val sumC = forIndexer(i,a)
val t3 = t
val sumD = forIndexerWithImplicits(i,a)
val t4 = t
val sumE = forRange(i,a)
val t5 = t
assert(sumA==sumB && sumA==sumC && sumA==sumD && sumA==sumE)
printf("#%05d while=%.3f forIwhile=%.3f forI=%.3f forIWI=%.3f range=
a.length,1e-3*(t1-t0),1e-3*(t2-t1),1e-3*(t3-t2),1e-3*(t4-t3),1e-3*
}
}
def main(args: Array[String]) {
for (i <- 1 to 3) run()
}
}
Looking at the Range code:
override def foreach(f: Int => Unit) { if (step > 0) { var i = this.start val until = if (inInterval(end)) end + 1 else end
while (i < until) { f(i) i += step } } else { ... } } }
, it would seem to imply that my performance issue is either coming from one or both of the following:
- the cost of creating the Range object in inner loops - the cost of calling f(i) when the JVM misses the opportunity to inline.
It's not the all cost of calling f(i). It's the cost of boxing i for that call. Here's the bytecode for range as of 18987; hopefully they've pulled the if function out of the while loop since then.
public void foreach(scala.Function1);
Code:
0: aload_0
1: invokevirtual #4b4b4b; //Method start:()I
4: istore_2
5: aload_0
6: invokevirtual #4f4f4f; //Method step:()I
9: iconst_0
10: if_icmple 29
13: iload_2
14: aload_0
15: invokevirtual #5c5c5c; //Method limit:()I
18: if_icmpge 25
21: iconst_1
22: goto 42
25: iconst_0
26: goto 42
29: iload_2
30: aload_0
31: invokevirtual #5c5c5c; //Method limit:()I
34: if_icmple 41
37: iconst_1
38: goto 42
41: iconst_0
42: ifeq 66
45: aload_1
46: iload_2
47: invokestatic #2c2c2c; //Method scala/runtime/BoxesRunTime.boxToInteger:(I)Ljava/lang/Integer;
50: invokeinterface #232323, 2; //InterfaceMethod scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;
55: pop
56: iload_2
57: aload_0
58: invokevirtual #4f4f4f; //Method step:()I
61: iadd
62: istore_2
63: goto 5
66: return
Note 47--it invokes boxToInteger right before the call to Function1.apply (i.e. f(i)). That's certainly not the _only_ thing going on inside that loop, but there's no way to avoid a lot of work if you're creating a class every loop. If you had a specialized Function1 that understood primitive ints, you wouldn't need to box it, and that part of the expense would go away.
Now, inlining is considerable help when it can happen, but in your tests that's only half the issue. For example, testing the following:
- Raw while loop as you posted
- Indexer class like I posted in for loops, but with while in innermost loop
- Indexer class like I posted in for loops
- Indexer class with implicit goodies to make it act like Range
- Range
gives the following timings:
#00004 while=0.162 forIwhile=0.374 forI=0.363 forIWI=0.451 range=0.520
#00016 while=0.358 forIwhile=0.559 forI=0.765 forIWI=0.779 range=1.210
#00256 while=0.623 forIwhile=0.839 forI=1.210 forIWI=1.217 range=2.041
#65536 while=0.633 forIwhile=0.871 forI=1.268 forIWI=1.266 range=2.330
(Array length is on the left; all are chosen so that iterations*array.length == 16M.)
The main difference between range and the others is the boxing, which is almost half the time taken. All the implicit goodies do make your code look pretty without adding _any_ overhead. And innermost loops are, as always, by far the most important (except in the array.length = 4 case, since then the innermost loop is never called).
Anyway, for right now, this suggests the following optimization tips for high-performance Scala coding: write powerful classes with while loops as the innermost loop. Avoid object creations in tight loops. Compose the rest of your problem out of these highly-tuned primitives and enjoy all the expressiveness of Scala at minimal performance overhead.
--Rex
P.S. Writing an class that wraps an array is about as efficient as the Indexer idea that I showed above; using update and apply and implicits and such, you can write stuff like
for (a <- array in (0,5)) { a() = 5 - a }
which is often even clearer than all the explicit indexing.
P.P.S. Benchmark code follows:
object TestArraySpeed {
def t = scala.compat.Platform.currentTime
final case class Indexer(i0: Int, i1: Int) extends Iterator[Indexer] {
var i = i0-1
def hasNext = i+1<i1
def next = { i += 1 ; this }
def apply() = i
}
def rawWhile(iterations: Int, array: Array[Double]) = {
var sum = 0.0
var ii = 0
while (ii < iterations) {
var i = 0
while (i < array.length) {
array(i) = i / 1e8
i += 1
}
i = 8
while (i < array.length) {
var j = 1
while (j < 8) {
array(i) += array(i-j) * array(i)
j += 1
}
i += 1
}
i = 8
while (i < array.length) {
sum += array(i)*ii
i += 1
}
ii += 1
}
sum
}
def forIndexerWhile(iterations: Int, array: Array[Double]) = {
var sum = 0.0
for (ii <- Indexer(0,iterations)) {
for (i <- Indexer(0,array.length)) { array(i()) = i() / 1e8 }
for (i <- Indexer(8,array.length)) {
var j = 1
while (j < 8) {
array(i()) += array(i()-j)*array(i())
j += 1
}
}
for (i <- Indexer(8,array.length)) { sum += array(i())*ii() }
}
sum
}
def forIndexer(iterations: Int, array: Array[Double]) = {
var sum = 0.0
for (ii <- Indexer(0,iterations)) {
for (i <- Indexer(0,array.length)) { array(i()) = i() / 1e8 }
for (i <- Indexer(8,array.length)) {
for (j <- Indexer(1,8)) { array(i()) += array(i()-j())*array(i()) }
}
for (i <- Indexer(8,array.length)) { sum += array(i())*ii() }
}
sum
}
def forIndexerWithImplicits(iterations: Int, array: Array[Double]) = {
final class Upto(i: Int) { def upto(j: Int) = Indexer(i,j) }
implicit def intWithUpto(i: Int) = new Upto(i)
implicit def indexer2int(ix: Indexer) = ix.i
var sum = 0.0
for (ii <- 0 upto iterations) {
for (i <- 0 upto array.length) { array(i) = i / 1e8 }
for (i <- 8 upto array.length) {
for (j <- 1 upto 8) array(i) += array(i-j) * array(i)
}
for (i <- 8 upto array.length) sum += array(i)*ii
}
sum
}
def forRange(iterations: Int, array: Array[Double]) = {
var sum = 0.0
for (ii <- 0 until iterations) {
for (i <- 0 until array.length) { array(i) = i / 1e8 }
for (i <- 8 until array.length) {
for (j <- 1 until 8) array(i) += array(i-j) * array(i)
}
for (i <- 8 until array.length) sum += array(i)*ii
}
sum
}
def run() {
val sizes = List(4,16,256,65536)
val iter = sizes map ((16*1024*1024) / _)
val arrays = sizes map ( new Array[Double](_) )
for ((i,a) <- iter zip arrays) {
val t0 = t
val sumA = rawWhile(i,a)
val t1 = t
val sumB = forIndexerWhile(i,a)
val t2 = t
val sumC = forIndexer(i,a)
val t3 = t
val sumD = forIndexerWithImplicits(i,a)
val t4 = t
val sumE = forRange(i,a)
val t5 = t
assert(sumA==sumB && sumA==sumC && sumA==sumD && sumA==sumE)
printf("#%05d while=%.3f forIwhile=%.3f forI=%.3f forIWI=%.3f range=
a.length,1e-3*(t1-t0),1e-3*(t2-t1),1e-3*(t3-t2),1e-3*(t4-t3),1e-3*
}
}
def main(args: Array[String]) {
for (i <- 1 to 3) run()
}
}
Mon, 2010-01-25, 00:57
#28
Re: scala performance for numerical algorithms
On Jan 24, 2010, at 5:52 PM, Rex Kerr wrote:
It's not the all cost of calling f(i). It's the cost of boxing i for that call. Here's the bytecode for range as of 18987; hopefully they've pulled the if function out of the while loop since then.
public void foreach(scala.Function1);
Code:
0: aload_0
1: invokevirtual #4b4b4b; //Method start:()I
4: istore_2
5: aload_0
6: invokevirtual #4f4f4f; //Method step:()I
9: iconst_0
10: if_icmple 29
13: iload_2
14: aload_0
15: invokevirtual #5c5c5c; //Method limit:()I
18: if_icmpge 25
21: iconst_1
22: goto 42
25: iconst_0
26: goto 42
29: iload_2
30: aload_0
31: invokevirtual #5c5c5c; //Method limit:()I
34: if_icmple 41
37: iconst_1
38: goto 42
41: iconst_0
42: ifeq 66
45: aload_1
46: iload_2
47: invokestatic #2c2c2c; //Method scala/runtime/BoxesRunTime.boxToInteger:(I)Ljava/lang/Integer;
50: invokeinterface #232323, 2; //InterfaceMethod scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;
55: pop
56: iload_2
57: aload_0
58: invokevirtual #4f4f4f; //Method step:()I
61: iadd
62: istore_2
63: goto 5
66: return
Note 47--it invokes boxToInteger right before the call to Function1.apply (i.e. f(i)). That's certainly not the _only_ thing going on inside that loop, but there's no way to avoid a lot of work if you're creating a class every loop. If you had a specialized Function1 that understood primitive ints, you wouldn't need to box it, and that part of the expense would go away.
Ah, thanks for educating me. I forgot that the closure is going to be an implementation of Function, which has an Object signature under the covers.
I am only an onlooker with Scala so can't claim to know enough about the following: Would it be possible to create a specialization of Function1 for a specific primitive type? I had read that Scala 2.8 or a later version was going to allow specialization for concrete instances of a generic type.
I am getting the impression that Scala suffers from the same problem as Java in that there is no real support for primitives in generic types. I understand why this is ugly and difficult, but it can make a huge difference performance-wise. As poor as C++ is, it does allow for specialization and templating with primitive types. For better or worse unless this sort of thing is optimised away, is a problem for the numerical community.
Now this can be an inconvenience in Java when you want a collection of a primitive type, but is worse if the fundamentals of the language (ie, closures) allow no mechanism that does not avoid boxing and unboxing, unless we are confident that the VM can removing this overhead for most situations.
Now working exclusively with "first class" objects it is certainly "pretty" and consistent, but unless the JVM can optimise the boxing away presents a big problem for performance. (It does manage to optimise away some of the time. For instance found that Array[Double] access is pretty close to java performance).
Jon Harrup has a simple VM (HLVM) for OCaml built on top of LLVM, where he claims to have solved the boxing issue amongst other issues. He's done some performance tests that put it on par with C++ for the sort of benchmark where Java would do very poorly (i.e. many small objects created). He indicated to me that had to discard the JVM as an option because of issues such as this.
Personally I want to stick with the JVM due to the huge momentum and history behind it, although would be even happier with the CLR if it was "liberated". Short of fixing the issues in the JVM, in my view, the compiler should do the reductions where possible, emitting reduced AST or at a later stage reduced bytecode for common performance problems.
Jonathan
Mon, 2010-01-25, 01:27
#29
Re: scala performance for numerical algorithms
Scala has a specialize feature (right now available by annotating generic type parameters with @specialize and invoking Scala with the -Yspecialize option) that will create separate methods for primitive types and will call those when it knows that it has the primitive type. Hopefully this will be able to percolate backwards throughout much of the library so we can get partially around the boxing issue. The problem, which is obvious if you think about it for a little while, is that there are rather a few primitive types, and your most basic Function1 has an input and output type, which means that if you want to handle, say, 9 types (8 primitive + Object), then you have to create 81 copies of the method. This is code-bloat approaching the C++ level, except possibly even worse since you need to create every method that you *might ever* use.
I don't know whether they've investigated where they can collapse primitive types. For example, one could argue that outside of arrays one should only have
long
byte->short->char->int
float->double
boolean->Object
which would give "only" 16 copies, and in certain rare cases one might even then allow trinary specialization (64 copies). (Note that boolean->Object is inexpensive because you can point all falses at java.lang.Boolean.FALSE and all trues at java.lang.Boolean.TRUE. Thus there's no object creation. I'm not sure whether java.lang.Boolean.valueOf(x), which Scala uses for its boxing, already uses this trick.)
When using primitive arrays, there's no choice but to go for all 9 (or possibly 8, since short and char are just signed/unsigned versions of each other).
Anyway, I'm hopeful that @specialize will be solidified, pushed through the library in some form, and will generate a nice speedup for a subset of the numeric applications that you're concerned with.
As I demonstrated with my earlier IntUnitFunction example (from Jan 23), this can provide equal-to-for-loop performance in certain cases, but right now it's syntactically ugly. With @specialize in the library, the default syntax would end up being the same.
I don't know whether this is enough to reach Harrup's HVLM performance. There are a variety of other ideas about how to let the JVM overcome its small object performance penalty (e.g. when small classes are wrapping other small classes).
--Rex
On Sun, Jan 24, 2010 at 6:48 PM, Jonathan Shore <jonathan.shore@gmail.com> wrote:
I don't know whether they've investigated where they can collapse primitive types. For example, one could argue that outside of arrays one should only have
long
byte->short->char->int
float->double
boolean->Object
which would give "only" 16 copies, and in certain rare cases one might even then allow trinary specialization (64 copies). (Note that boolean->Object is inexpensive because you can point all falses at java.lang.Boolean.FALSE and all trues at java.lang.Boolean.TRUE. Thus there's no object creation. I'm not sure whether java.lang.Boolean.valueOf(x), which Scala uses for its boxing, already uses this trick.)
When using primitive arrays, there's no choice but to go for all 9 (or possibly 8, since short and char are just signed/unsigned versions of each other).
Anyway, I'm hopeful that @specialize will be solidified, pushed through the library in some form, and will generate a nice speedup for a subset of the numeric applications that you're concerned with.
As I demonstrated with my earlier IntUnitFunction example (from Jan 23), this can provide equal-to-for-loop performance in certain cases, but right now it's syntactically ugly. With @specialize in the library, the default syntax would end up being the same.
I don't know whether this is enough to reach Harrup's HVLM performance. There are a variety of other ideas about how to let the JVM overcome its small object performance penalty (e.g. when small classes are wrapping other small classes).
--Rex
On Sun, Jan 24, 2010 at 6:48 PM, Jonathan Shore <jonathan.shore@gmail.com> wrote:
Ah, thanks for educating me. I forgot that the closure is going to be an implementation of Function, which has an Object signature under the covers.
I am only an onlooker with Scala so can't claim to know enough about the following: Would it be possible to create a specialization of Function1 for a specific primitive type? I had read that Scala 2.8 or a later version was going to allow specialization for concrete instances of a generic type.
Mon, 2010-01-25, 02:17
#30
Re: scala performance for numerical algorithms
On Sun, Jan 24, 2010 at 07:23:04PM -0500, Rex Kerr wrote:
> The problem, which is obvious if you think about it for a little
> while, is that there are rather a few primitive types, and your most
> basic Function1 has an input and output type, which means that if you
> want to handle, say, 9 types (8 primitive + Object), then you have to
> create 81 copies of the method. This is code-bloat approaching the
> C++ level, except possibly even worse since you need to create every
> method that you *might ever* use.
You do not have to create every method that you might ever use. It is
designed to fall back on the generic implementation. According to the
specialization sid, "the plan is to specialize array-based collections
and FunctionN traits up to two parameters before the first release of
Scala 2.8."
Mon, 2010-01-25, 02:37
#31
Re: scala performance for numerical algorithms
On Sun, Jan 24, 2010 at 8:08 PM, Paul Phillips <paulp@improving.org> wrote:
Sorry, I didn't qualify my statement enough. I meant, "In order to get the performance improvement associated with primitives, you need to create every method that you might ever use."
--Rex
On Sun, Jan 24, 2010 at 07:23:04PM -0500, Rex Kerr wrote:
> This is code-bloat approaching the
> C++ level, except possibly even worse since you need to create every
> method that you *might ever* use.
You do not have to create every method that you might ever use. It is
designed to fall back on the generic implementation.
Sorry, I didn't qualify my statement enough. I meant, "In order to get the performance improvement associated with primitives, you need to create every method that you might ever use."
--Rex
Sat, 2010-02-06, 04:17
#32
Re: scala performance for numerical algorithms
Ismael Juma wrote:
>
> On Sun, 2010-01-24 at 13:57 +0100, Johannes Rudolph wrote:
>> Sure, but people needing fast loops would in many cases use arrays
>> anyway and it would be nice if you could tell them, that using the
>> array higher-order functions will optimize just fine.
>
> Definitely.
>
> Ismael
>
>
>
I think this is another instance where a good enough solution is knowing
what code constructs to favour when performance matters. Although a fast for
loop would be very nice. But in the meantime ***a reasonable community
effort to address issues like this in a single place would be a good idea -
a real coding standard, not another formatting style war.***
I'm just saying...
(But the future for the numerical community is CUDA or just maybe OpenCL and
that huge - say 50x - speed boost from Fermi and other hardware
accelerators.)
Cheers,
Viktor Klang
| "A complex system that works is invariably
| found to have evolved from a simple system
| that worked." - John Gall
Blog: klangism.blogspot.com
Twttr: twitter.com/viktorklang
Code: github.com/viktorklang
On Sat, Jan 23, 2010 at 7:54 PM, Jonathan Shore <jonathan.shore@gmail.com> wrote: