This page is no longer maintained — Please continue to the home page at www.scala-lang.org

scala performance for numerical algorithms

32 replies
Jonathan Shore
Joined: 2009-04-10,
User offline. Last seen 42 years 45 weeks ago.

Hi,
I  have been holding off on using Scala due to performance issues with "simple" "for comprehensions".   For the numerical component of what I do, tight loops are very important and they need to be as close to the metal as possible.   (i.e. imagine implementing a matrix operation or a convolution in scala).
I apologize in advance for reopening a topic (in fact one that I had posted on 10 months ago).    Here is the request on trac (now 16 months old):     
https://lampsvn.epfl.ch/trac/scala/ticket/1338
In the trac case comments one of the scala team says:  "Have you tried '-optimize'? It can help a lot. It's very unlikely this will move from library to compiler-generated loops".    
I am concerned with the last statement.
Now, I recognize that an optimisations of this sort is "inelegant" in that it does not present a generalization.   It only reduces the simple cases of a more general form -- but this is par for the course in compilers.   So perhaps the view is around the inelegant aspect of special casing an optimization?    I am quite doubtful that the -optimize will ever get to the while() level of performance without actually doing the remapping:
for (i <- 0 until N) {  <do something> >  }
TO:
val i = 0 while (i < N) { <do something> i += 1 }
I cannot stress enough how important tight loops are for numerical work.   Please tell me that -optimize is not the last word on performance here.    The special case solution for this class of for comprehensions does not appear to be difficult to implement for someone very familiar with the parser AST.     I and a growing list  people would be thrilled to see this fixed.
best regards
Jonathan Shore--http://tr8dr.wordpress.com/
Viktor Klang
Joined: 2008-12-17,
User offline. Last seen 1 year 27 weeks ago.
Re: scala performance for numerical algorithms
Jonathan, I'm thinking that there's be a lot of people who would appreciate a formalized SID on this topic.

Cheers,
Viktor Klang
| "A complex system that works is invariably
| found to have evolved from a simple system
| that worked." - John Gall

Blog: klangism.blogspot.com
Twttr: twitter.com/viktorklang
Code: github.com/viktorklang


On Sat, Jan 23, 2010 at 7:54 PM, Jonathan Shore <jonathan.shore@gmail.com> wrote:

Hi,
I  have been holding off on using Scala due to performance issues with "simple" "for comprehensions".   For the numerical component of what I do, tight loops are very important and they need to be as close to the metal as possible.   (i.e. imagine implementing a matrix operation or a convolution in scala).
I apologize in advance for reopening a topic (in fact one that I had posted on 10 months ago).    Here is the request on trac (now 16 months old):     
https://lampsvn.epfl.ch/trac/scala/ticket/1338
In the trac case comments one of the scala team says:  "Have you tried '-optimize'? It can help a lot. It's very unlikely this will move from library to compiler-generated loops".    
I am concerned with the last statement.
Now, I recognize that an optimisations of this sort is "inelegant" in that it does not present a generalization.   It only reduces the simple cases of a more general form -- but this is par for the course in compilers.   So perhaps the view is around the inelegant aspect of special casing an optimization?    I am quite doubtful that the -optimize will ever get to the while() level of performance without actually doing the remapping:
for (i <- 0 until N) {  <do something> >  }
TO:
val i = 0 while (i < N) { <do something> i += 1 }
I cannot stress enough how important tight loops are for numerical work.   Please tell me that -optimize is not the last word on performance here.    The special case solution for this class of for comprehensions does not appear to be difficult to implement for someone very familiar with the parser AST.     I and a growing list  people would be thrilled to see this fixed.
best regards
Jonathan Shore--http://tr8dr.wordpress.com/

ijuma
Joined: 2008-08-20,
User offline. Last seen 22 weeks 2 days ago.
Re: scala performance for numerical algorithms

Hi Jonathan,

On Sat, 2010-01-23 at 13:54 -0500, Jonathan Shore wrote:

> I am quite doubtful that the -optimize will ever get to the while()
> level of performance without actually doing the remapping:

Even though I would also like to see this optimisation, this is not the
best way to convince others to implement it. :)

Create a micro-benchmark suited to the JVM that compares a for
comprehension with -optimise (and possibly specialized) and a while loop
and show that the performance difference is still large. That is more
likely to have the desired effect.

Best,
Ismael

ichoran
Joined: 2009-08-14,
User offline. Last seen 2 years 3 weeks ago.
Re: scala performance for numerical algorithms
On Sat, Jan 23, 2010 at 2:00 PM, Ismael Juma <mlists@juma.me.uk> wrote:
On Sat, 2010-01-23 at 13:54 -0500, Jonathan Shore wrote:

> I am quite doubtful that the -optimize will ever get to the while()
> level of performance without actually doing the remapping:

Create a micro-benchmark suited to the JVM that compares a for
comprehension with -optimise (and possibly specialized) and a while loop
and show that the performance difference is still large.

I'd already done something much like this when I was working on a fast vector library.  For example, comparing:
  - while loop with primitives
  - custom IntRange class with dedicated foreach
  - regular Range (extends Projection[Int])
  - for loop
  - reduceLeft
gives the following answers

$ scalac -optimise DoNotUseForOrReduce.scala
$ scala DoNotUseForOrReduce
Time (seconds): while=0.56 custom=4.81 foreach=4.67 for=6.13 reduce=13.82
Time (seconds): while=0.56 custom=4.72 foreach=4.67 for=6.13 reduce=13.97
Time (seconds): while=0.56 custom=4.66 foreach=4.73 for=6.15 reduce=14.14
Time (seconds): while=0.56 custom=4.65 foreach=4.71 for=6.13 reduce=14.11
Time (seconds): while=0.56 custom=4.69 foreach=4.61 for=6.14 reduce=14.16
(etc.)

which shows that a bare while is 8-25x faster than anything else.  That was for 2.7.3; 2.8.0.r20326 is a bit better:

$ bin/scala DoNotUseForOrReduce
Time (seconds): while=0.56 custom=4.74 foreach=4.72 for=5.32 reduce=6.02
Time (seconds): while=0.56 custom=4.94 foreach=4.94 for=5.30 reduce=5.98
Time (seconds): while=0.56 custom=4.72 foreach=4.71 for=5.33 reduce=6.15
Time (seconds): while=0.56 custom=4.75 foreach=4.94 for=5.37 reduce=5.96
(etc.)

Now it's "only" 8-10x slower.  I don't see how to usefully -specialize this.

Code is below.

  --Rex

object DoNotUseForOrReduce {
  val N = 100000
  def t = System.currentTimeMillis

  case class IntRange(i0: Int, i1: Int) {
    def foreach(f: Int=>Unit) = {
      var i = i0
      while (i<=i1) {
        f(i)
        i += 1
      }
    }
  }

  def facA(m: Int) = {
    var i = 1
    var fac = 1
    while (i<=m) {
      fac *= i
      i += 1
    }
    fac
  }
  def facB(m: Int) = {
    var fac = 1
    IntRange(1,m).foreach( fac *= _ )
    fac
  }
  def facC(m: Int) = {
    var fac = 1
    (1 to m).foreach( fac *= _ )
    fac
  }
  def facD(m: Int) = {
    var fac = 1
    for (i <- 1 to m) { fac *= i; }
    fac
  }
  def facE(m: Int) = (1 to m).reduceLeft(_*_)

  def run() {
    val t0 = t
    var fA,fB,fC,fD,fE = 0
    for (i <- 1 to 100) fA += facA(N*i)
    val t1 = t
    for (i <- 1 to 100) fB += facB(N*i)
    val t2 = t
    for (i <- 1 to 100) fC += facC(N*i)
    val t3 = t
    for (i <- 1 to 100) fD += facD(N*i)
    val t4 = t
    for (i <- 1 to 100) fE += facE(N*i)
    val t5 = t
    printf("Time (seconds): while=%.2f custom=%.2f foreach=%.2f for=%.2f reduce=%.2f\n",
           1e-3*(t1-t0),1e-3*(t2-t1),1e-3*(t3-t2),1e-3*(t4-t3),1e-3*(t5-t4));
  }
  def main(args:Array[String]) {
    for (j <- 1 to 10) run()
  }
}

Jonathan Shore
Joined: 2009-04-10,
User offline. Last seen 42 years 45 weeks ago.
Re: scala performance for numerical algorithms
Ismael,
I had already posted a benchmark 10 months ago on this.   Running now with -optimize on the latest 2.8 nightly shows no improvement:
[*] timing of 50 iterations of "[for] array calc: 1000000" total: 7892 ms, cycle: 157858740 ns or 157.84 ms, memory: 18954K[*] timing of 50 iterations of "[while] array calc: 1000000" total: 1210 ms, cycle: 24206860 ns or 24.2 ms, memory: 0K
I see a 8x difference between these implementations. It appears not to be a big deal to fix this in the AST processing part of the compiler.   I just am not familiar enough with the AST to know how to match the situations that can be remapped and how to express the AST for the replacement.   Anyone dealing with the compiler on the Scala team could implement this within a couple of days.   I am familiar enough with compiler design and parsing to know that this is not a huge deal.
I'm sorry to sound impatient, but is *has* been 16 months.   I would like to hear something from the Scala team as to whether they plan to do this or not and when.  Otherwise I will, frankly, have to find another solution.
If Scala is to appeal to functional programming people in the scientific arena and/or as the "next java", fundamental performance *does* need to be addressed.
Thanks and sorry if this comes out the wrong way.
regards
Jonathan

On Jan 23, 2010, at 2:00 PM, Ismael Juma wrote:
Hi Jonathan,

On Sat, 2010-01-23 at 13:54 -0500, Jonathan Shore wrote:

I am quite doubtful that the -optimize will ever get to the while()
level of performance without actually doing the remapping:

Even though I would also like to see this optimisation, this is not the
best way to convince others to implement it. :)

Create a micro-benchmark suited to the JVM that compares a for
comprehension with -optimise (and possibly specialized) and a while loop
and show that the performance difference is still large. That is more
likely to have the desired effect.

Best,
Ismael



ichoran
Joined: 2009-08-14,
User offline. Last seen 2 years 3 weeks ago.
Re: scala performance for numerical algorithms
On Sat, Jan 23, 2010 at 2:50 PM, Rex Kerr <ichoran@gmail.com> wrote:
Now it's "only" 8-10x slower.  I don't see how to usefully -specialize this.

Maybe I should think for a few more seconds before saying something like that.

$ bin/scala DoNotUseForReduce
Time (seconds): while=0.56 custom=0.58 foreach=4.65 for=5.30 reduce=5.91
Time (seconds): while=0.56 custom=0.57 foreach=4.63 for=5.36 reduce=5.92
Time (seconds): while=0.56 custom=0.57 foreach=4.64 for=5.25 reduce=6.11
Time (seconds): while=0.56 custom=0.57 foreach=4.81 for=5.31 reduce=5.91
(etc.)

Code for custom (which could be wrapped into Range's foreach with @specialize in the library):

  abstract class IntUnitFunction {
    def apply(i: Int):Unit
  }
  case class IntRange(i0: Int, i1: Int) {
    def foreach(f: IntUnitFunction) = {
      var i = i0
      while (i<=i1) {
        f(i)
        i += 1
      }
    }
  }
  def facB(m: Int) = {
    var fac = 1
    IntRange(1,m).foreach( new IntUnitFunction{ def apply(i: Int) { fac*=i } } )
    fac
  }

--Rex
Jonathan Shore
Joined: 2009-04-10,
User offline. Last seen 42 years 45 weeks ago.
Re: scala performance for numerical algorithms
Hmm, I see what you are doing and seems like a good workaround when the work unit is a function or is large enough that function overhead does not matter.     However there are many algorithms that require tight nested loops over a very primitive operation (say on arrays).   At the moment the only performant way is to use while loops (i think?)
Jonathan
On Jan 23, 2010, at 3:04 PM, Rex Kerr wrote:
On Sat, Jan 23, 2010 at 2:50 PM, Rex Kerr <ichoran@gmail.com> wrote:
Now it's "only" 8-10x slower.  I don't see how to usefully -specialize this.

Maybe I should think for a few more seconds before saying something like that.

$ bin/scala DoNotUseForReduce
Time (seconds): while=0.56 custom=0.58 foreach=4.65 for=5.30 reduce=5.91
Time (seconds): while=0.56 custom=0.57 foreach=4.63 for=5.36 reduce=5.92
Time (seconds): while=0.56 custom=0.57 foreach=4.64 for=5.25 reduce=6.11
Time (seconds): while=0.56 custom=0.57 foreach=4.81 for=5.31 reduce=5.91
(etc.)

Code for custom (which could be wrapped into Range's foreach with @specialize in the library):

  abstract class IntUnitFunction {
    def apply(i: Int):Unit
  }
  case class IntRange(i0: Int, i1: Int) {
    def foreach(f: IntUnitFunction) = {
      var i = i0
      while (i<=i1) {
        f(i)
        i += 1
      }
    }
  }
  def facB(m: Int) = {
    var fac = 1
    IntRange(1,m).foreach( new IntUnitFunction{ def apply(i: Int) { fac*=i } } )
    fac
  }

--Rex

ichoran
Joined: 2009-08-14,
User offline. Last seen 2 years 3 weeks ago.
Re: scala performance for numerical algorithms
All I'm doing is integer multiplication!  An array access takes more time than that.  I don't have time right now to find my old posts on the topic, but I've previously benchmarked code where
  val x:Float[]
  val y:Float[]
  while (...)
is actually _slower_ by 20-30% (IIRC) than
  val xy:Float[]
  val b = BVecF( xy )
  b.foreach(...)
where BVecF wraps  an array of (x0,y0,x1,y1,x2,y2...).  The win on the single array access vs. two is bigger than the overhead of a (admittedly carefully optimized) foreach call.

  --Rex

On Sat, Jan 23, 2010 at 3:18 PM, Jonathan Shore <jonathan.shore@gmail.com> wrote:
Hmm, I see what you are doing and seems like a good workaround when the work unit is a function or is large enough that function overhead does not matter.     However there are many algorithms that require tight nested loops over a very primitive operation (say on arrays).   At the moment the only performant way is to use while loops (i think?)
Jonathan

ijuma
Joined: 2008-08-20,
User offline. Last seen 22 weeks 2 days ago.
Re: scala performance for numerical algorithms

Hi Jonathan,

On Sat, 2010-01-23 at 15:03 -0500, Jonathan Shore wrote:
> I had already posted a benchmark 10 months ago on this.

I know you did. But the comment on Trac was only a few weeks ago. It
only makes sense to say that performance is still bad after trying it
with a build more recent than that. You have done that below, good.

> I'm sorry to sound impatient, but is *has* been 16 months.

You make it sound like the Scala team has done nothing since then. ;) A
lot of improvements have been made in the last 16 months and many of
them related to performance (with more general solutions). I repeat that
I would also like to see the simple transformation you're asking and I
explain why here (search for ticket 1338):

http://blog.juma.me.uk/2009/10/26/new-jvm-options-and-scala-iteration-pe...

Also, I benchmarked something similar to the trick Rex mentions in
another thread in 2008:

> I would like to hear something from the Scala team as to whether
> they plan to do this or not and when. Otherwise I will, frankly, have
> to find another solution.
>
>
> If Scala is to appeal to functional programming people in the
> scientific arena and/or as the "next java", fundamental performance
> *does* need to be addressed.
>
>
> Thanks and sorry if this comes out the wrong way.
>
>
> regards
>
>
> Jonathan
>
>
>
> On Jan 23, 2010, at 2:00 PM, Ismael Juma wrote:
>
> > Hi Jonathan,
> >
> > On Sat, 2010-01-23 at 13:54 -0500, Jonathan Shore wrote:
> >
> > > I am quite doubtful that the -optimize will ever get to the
> > > while()
> > > level of performance without actually doing the remapping:
> >
> > Even though I would also like to see this optimisation, this is not
> > the
> > best way to convince others to implement it. :)
> >
> > Create a micro-benchmark suited to the JVM that compares a for
> > comprehension with -optimise (and possibly specialized) and a while
> > loop
> > and show that the performance difference is still large. That is
> > more
> > likely to have the desired effect.
> >
> > Best,
> > Ismael
> >
> >
> >
>

ijuma
Joined: 2008-08-20,
User offline. Last seen 22 weeks 2 days ago.
Re: scala performance for numerical algorithms

(sorry, I sent the previous email prematurely)

Hi Jonathan,

On Sat, 2010-01-23 at 15:03 -0500, Jonathan Shore wrote:
> I had already posted a benchmark 10 months ago on this.

I know you did. But the comment on Trac was only a few weeks ago. It
only makes sense to say that performance is still bad after trying it
with a build more recent than that. You have done that below, good.

> I'm sorry to sound impatient, but is *has* been 16 months.

You make it sound like the Scala team has done nothing since then. ;) A
lot of improvements have been made in the last 16 months and many of
them related to performance (with more general solutions). And the
specialized stuff is not finished yet as I understand.

I repeat that I would also like to see the simple transformation you're
asking and I explain why here (search for ticket 1338):

http://blog.juma.me.uk/2009/10/26/new-jvm-options-and-scala-iteration-pe...

Also, I benchmarked something similar to the trick Rex mentions in
another thread in 2008:

http://blog.juma.me.uk/2008/09/15/efficient-scala-with-primitive-collect...

Still, that doesn't always work because of the inlining issue mentioned
in the more recent blog entry.

> If Scala is to appeal to functional programming people in the
> scientific arena and/or as the "next java", fundamental performance
> *does* need to be addressed.

You can write Scala that is as fast as Java. It's just as ugly... or
even a bit uglier. ;)

Best,
Ismael

Jonathan Shore
Joined: 2009-04-10,
User offline. Last seen 42 years 45 weeks ago.
Re: scala performance for numerical algorithms

On Jan 23, 2010, at 4:01 PM, Ismael Juma wrote:

I know you did. But the comment on Trac was only a few weeks ago. It
only makes sense to say that performance is still bad after trying it
with a build more recent than that. You have done that below, good.

Yes.  I decided to post again in reference to that comment.  I ran the same tests again with -optimize and did not see much if any improvement.   



I'm sorry to sound impatient, but is *has* been 16 months.

You make it sound like the Scala team has done nothing since then. ;) A
lot of improvements have been made in the last 16 months and many of
them related to performance (with more general solutions). And the
specialized stuff is not finished yet as I understand.


Sorry if this came out wrong.  And I do realize that there is a lot of important improvements and new features in progress.   
That said, the comment in the trac entry was rather dismissive of the problem *and* went further to state that it was unlikely ever to be addressed.   I think this is "wrong" (or at a minimum is a major turnoff for people like me) and should be highlighted here in the user community.
As for the specialized stuff you mention, sounds interesting.      
There are some lessons to be learned from the experience the numerical community had with Java.   There was quite a bit of interest early on in the numerical community around using Java.   After a couple of years it "petered out" when it became apparent that the JVM and java was not ready.   The main concerns were:
- performance- performance (did I say that) ;)- lack of operator abstraction
In the years since, the JVM has approached the speed of C++ / C in many respects, even with array access.    I am effectively able to use java and get pretty much the same performance by C++ colleagues get at lower "development cost".   
Scala not only fixes one of the major inconveniences (lack of operator overloading) but is functional as well.   This should be a fantastic language for numerical people.    In that regard should not have performance surprises around fundamental operations.    These need to be addressed early on, especially in the early stages of momentum, or is likely to turn away adopters much as Java did in the beginning for a certain class of programmer.    i.e.  I want to see Scala succeed in my community and now they are starting to look at it.
I realize it will take some time to get "there".   But the dismissive comment in the trac seemed to imply that these considerations are not even on the radar.     




If Scala is to appeal to functional programming people in the
scientific arena and/or as the "next java", fundamental performance
*does* need to be addressed.

You can write Scala that is as fast as Java. It's just as ugly... or
even a bit uglier. ;)


For better or worse languages are chosen based on ergonomics, taste, performance, etc.   Scala is not far away from doing a most things well.  Why not fix something this fundamental early on.

Best,
Ismael


Jason Zaugg
Joined: 2009-05-18,
User offline. Last seen 38 weeks 5 days ago.
Re: scala performance for numerical algorithms

You might find this discussion interesting:
http://stackoverflow.com/questions/1257021/suitable-functional-language-...

On Sat, Jan 23, 2010 at 10:56 PM, Jonathan Shore
wrote:
>
> On Jan 23, 2010, at 4:01 PM, Ismael Juma wrote:
>
> I know you did. But the comment on Trac was only a few weeks ago. It
> only makes sense to say that performance is still bad after trying it
> with a build more recent than that. You have done that below, good.
>
> Yes.  I decided to post again in reference to that comment.  I ran the same
> tests again with -optimize and did not see much if any improvement.
>
>
>
> I'm sorry to sound impatient, but is *has* been 16 months.
>
> You make it sound like the Scala team has done nothing since then. ;) A
> lot of improvements have been made in the last 16 months and many of
> them related to performance (with more general solutions). And the
> specialized stuff is not finished yet as I understand.
>
>
> Sorry if this came out wrong.  And I do realize that there is a lot of
> important improvements and new features in progress.
> That said, the comment in the trac entry was rather dismissive of the
> problem *and* went further to state that it was unlikely ever to be
> addressed.   I think this is "wrong" (or at a minimum is a major turnoff for
> people like me) and should be highlighted here in the user community.
> As for the specialized stuff you mention, sounds interesting.
> There are some lessons to be learned from the experience the numerical
> community had with Java.   There was quite a bit of interest early on in the
> numerical community around using Java.   After a couple of years it "petered
> out" when it became apparent that the JVM and java was not ready.   The main
> concerns were:
> - performance
> - performance (did I say that) ;)
> - lack of operator abstraction
> In the years since, the JVM has approached the speed of C++ / C in many
> respects, even with array access.    I am effectively able to use java and
> get pretty much the same performance by C++ colleagues get at lower
> "development cost".
> Scala not only fixes one of the major inconveniences (lack of operator
> overloading) but is functional as well.   This should be a fantastic
> language for numerical people.    In that regard should not have performance
> surprises around fundamental operations.    These need to be addressed early
> on, especially in the early stages of momentum, or is likely to turn away
> adopters much as Java did in the beginning for a certain class of
> programmer.    i.e.  I want to see Scala succeed in my community and now
> they are starting to look at it.
> I realize it will take some time to get "there".   But the dismissive
> comment in the trac seemed to imply that these considerations are not even
> on the radar.
>
>
>
>
> If Scala is to appeal to functional programming people in the
>
> scientific arena and/or as the "next java", fundamental performance
>
> *does* need to be addressed.
>
> You can write Scala that is as fast as Java. It's just as ugly... or
> even a bit uglier. ;)
>
>
> For better or worse languages are chosen based on ergonomics, taste,
> performance, etc.   Scala is not far away from doing a most things well.
>  Why not fix something this fundamental early on.
>
> Best,
> Ismael
>
>
>

Jonathan Shore
Joined: 2009-04-10,
User offline. Last seen 42 years 45 weeks ago.
Re: scala performance for numerical algorithms

On Jan 23, 2010, at 4:01 PM, Ismael Juma wrote:
I repeat that I would also like to see the simple transformation you're
asking and I explain why here (search for ticket 1338):

mgarcia in the trac thread already points out where the transform can be done, i.e.:  
   private def makeFor(mapName: Name, flatMapName: Name, enums: List[Enumerator], body: Tree): Tree

The relevant file is: src/compiler/scala/tools/nsc/ast/parser/TreeBuilder.scala

I am not clear on mapName vs flatMapName, but the enumerator list in the case of the loop:
for (v  <-  0 until N) <body>
would be the Range object  Range (0, N, 1)  and "body" our <body>.    The transform would be as follows, under the following special case conditions:
1. there is only one enumerator2. the enumerator is a Range or something equivalent  (I would probably call this Sequence)
Loosely speaking the above for comprehension without optimization could be transformed into (I'm guessing that it ends up being worse, with a call to a closure/function representing the body on each iteration):
   val seq = new Range (0,N,1)  var i = 0   while (i < seq.length)   {      val v = seq(i)      <body>      i += 1   }
If the above two constraints are met can rather map to:
   val seq = new Range (0,N,1)  var v = seq.start    while (v < seq.end)   {      <body>      v += seq.step   }

Incidentally, I've found that the performance difference ranges between 2x to 8x depending on the size of inner loops, etc.   I'm testing on a MacPro quad core with latest release of Java 6, -server mode and Scala 2.8.0 nightly (as of today) with -optimize.
I just don't know the AST well enough to know how to do it without some weeks of effort to understand ...
Jonathan





Randall R Schulz
Joined: 2008-12-16,
User offline. Last seen 1 year 29 weeks ago.
Re: scala performance for numerical algorithms

On Saturday January 23 2010, Jonathan Shore wrote:
> ...
>
> I just don't know the AST well enough to know how to do it without
> some weeks of effort to understand ...

But probably less than 40+ or 64+ weeks? (10 and 16 months, resp.)

> Jonathan

Randall Schulz

Jonathan Shore
Joined: 2009-04-10,
User offline. Last seen 42 years 45 weeks ago.
Re: scala performance for numerical algorithms

On Jan 23, 2010, at 4:01 PM, Ismael Juma wrote:

http://blog.juma.me.uk/2009/10/26/new-jvm-options-and-scala-iteration-performance/

Ismael, my hat it off to you.  You've done some very detailed optimisation and performance analysis.   I think what you are testing, though, is a bit different (i.e. application of a function/body across a collection via foreach) but suffers from some of the same problems, namely:
- lack of inlining of function- cost of accessing the iterator
While it is possible that the function representing the loop body can be inlined, I find it less likely that the JVM will optimise the cost of accessing the iterator away.   For instance, my test has the following loops:

   for (ii <- 0 until iterations)   {      for (i <- 1 until array.length)         array(i) = i / 1000.0


      for (i <- 8 until array.length)      {         for (j <- 1 until 8) array(i) += array(i-j) * array(i)      }   }  You can see that there are some very tight, short-lived loops in there.    Deconstructing the test to:
   val seqA = new Range (0,iterations,1)   var iA = 0   while (iA < seqA.length)   {      val A = seqA(iA)      iA += 1

   

      val seqB = new Range (0,array.length,1)      var iB = 0      while (iB < seqB.length)      {         val B = seqB(iB)         array(B) = B / 1000.0         iB += 1      }


      val seqC = new Range (8,array.length,1)      var iC = 0      while (iC < seqC.length)      {         val C = seqC(iC)         iC += 1

     

         val seqD = new Range (1,8,1)         var iD = 0         while (iD < seqD.length)          {            val D = seqD(iD)            array(D) += array(C-D) * array(C)            iD += 1         }      }   }
The performance of the version below was nearly the same as the one above, showing that the cost of accessing the iterators dominated the loop time.   The same loop without iterators was 8x faster for large array and iterations.
So at least for trivial loop bodies, I don't think that inlining is the biggest problem.
With a special case mapping to a non-iterator, non-closure based approach, the performance is the same as Java (or C++ for that matter).
Jonathan


Jonathan Shore
Joined: 2009-04-10,
User offline. Last seen 42 years 45 weeks ago.
Re: scala performance for numerical algorithms

On Jan 23, 2010, at 6:33 PM, Randall R Schulz wrote:

> On Saturday January 23 2010, Jonathan Shore wrote:
>> ...
>>
>> I just don't know the AST well enough to know how to do it without
>> some weeks of effort to understand ...
>
> But probably less than 40+ or 64+ weeks? (10 and 16 months, resp.)
>

Wish I could but am way overcommitted as part of a nascent startup.

>
>> Jonathan
>
>
> Randall Schulz

extempore
Joined: 2008-12-17,
User offline. Last seen 35 weeks 3 days ago.
Re: scala performance for numerical algorithms

On Sat, Jan 23, 2010 at 06:23:40PM -0500, Jonathan Shore wrote:
> I just don't know the AST well enough to know how to do it without
> some weeks of effort to understand ...

I can do it. First you have to convince martin, then I'll do it(*).

(*) Always possible it's harder than I think for whatever reason, in
which case I reserve the right to not do it.

ichoran
Joined: 2009-08-14,
User offline. Last seen 2 years 3 weeks ago.
Re: scala performance for numerical algorithms
On Sat, Jan 23, 2010 at 6:57 PM, Jonathan Shore <jonathan.shore@gmail.com> wrote:
I think what you are testing, though, is a bit different (i.e. application of a function/body across a collection via foreach) but suffers from some of the same problems, namely:
- lack of inlining of function

The JVM does this pretty well (and anyway it's pretty cheap--a function call is typically a couple of cycles).
 
- cost of accessing the iterator

The JVM does this pretty well too, and it's pretty cheap too.  (A pointer dereference is also usually only a couple of cycles.)

The really expensive thing is object creation.  (My benchmarks have in the past indicated that it takes at least 12 cycles.)  And you have to do it in both of your tests because Range (which is not currently @specialized) extends Product[Int], which eventually extends Function1, which has a generic abstract apply.  That means that Range has to autobox its apply, which means you get object creation every time you access the iterator.

This is a problem that is solved by
(1) creating mutable iterators that wrap primitives, or
(2) @specializing Function1 so that Range doesn't need to autobox.

For an example of (1), try something like

case class Indexer(val i0: Int, val i1: Int) extends Iterator[Indexer] {
  var i = i0
  def hasNext = i<i1
  def next = { i += 1 ; this }
  def apply() = i
}

and redo your for loop with j <- Indexer(1,8) and such.

I bet it gets at least 4x faster (i.e. 2x or less slower than the raw while loops).

(I would try it myself, but I'm away from the appropriate machine at the moment.)

  --Rex
 
While it is possible that the function representing the loop body can be inlined, I find it less likely that the JVM will optimise the cost of accessing the iterator away.   For instance, my test has the following loops:

   for (ii <- 0 until iterations)    {      for (i <- 1 until array.length)         array(i) = i / 1000.0


      for (i <- 8 until array.length)       {         for (j <- 1 until 8) array(i) += array(i-j) * array(i)      }    }

ijuma
Joined: 2008-08-20,
User offline. Last seen 22 weeks 2 days ago.
Re: scala performance for numerical algorithms

Hi Paul,

On Sat, 2010-01-23 at 16:38 -0800, Paul Phillips wrote:
> On Sat, Jan 23, 2010 at 06:23:40PM -0500, Jonathan Shore wrote:
> > I just don't know the AST well enough to know how to do it without
> > some weeks of effort to understand ...
>
> I can do it.

I'd be very happy if I could remove a bunch of while loops from my
code. ;)

> First you have to convince martin, then I'll do it(*).

Has Martin mentioned what are his objections? I mean, I understand the
idea that EPFL has given priority to other things, but if someone was
willing to do it (you!), then I don't see any disadvantage. Specially
for arrays.

Best,
Ismael

ijuma
Joined: 2008-08-20,
User offline. Last seen 22 weeks 2 days ago.
Re: scala performance for numerical algorithms

On Sat, 2010-01-23 at 18:57 -0500, Jonathan Shore wrote:
> - lack of inlining of function
> - cost of accessing the iterator
>
> While it is possible that the function representing the loop body can
> be inlined, I find it less likely that the JVM will optimise the cost
> of accessing the iterator away.

I am not sure what you're saying here. You realise that a for
comprehension without a yield actually invokes foreach (as in my example
above) and an iterator doesn't have to be involved at all?

Best,
Ismael

ijuma
Joined: 2008-08-20,
User offline. Last seen 22 weeks 2 days ago.
Re: scala performance for numerical algorithms

Hi Rex,

On Sat, 2010-01-23 at 21:57 -0500, Rex Kerr wrote:
> On Sat, Jan 23, 2010 at 6:57 PM, Jonathan Shore
> wrote:
> I think what you are testing, though, is a bit different (i.e.
> application of a function/body across a collection via
> foreach) but suffers from some of the same problems, namely:
>
>
> - lack of inlining of function
>
> The JVM does this pretty well (and anyway it's pretty cheap--a
> function call is typically a couple of cycles).
>
> - cost of accessing the iterator
>
> The JVM does this pretty well too, and it's pretty cheap too. (A
> pointer dereference is also usually only a couple of cycles.)
>
> The really expensive thing is object creation.

This is a lot of truth here, but if you have a foreach that is called
with several different anonymous functions and the foreach itself is not
inlined (both which can happen in real-life code as shown in my blog
entry and easily missed by simple micro-benchmarks), then you end up
with a megamorphic method call for the anonymous function which causes
performance to go to the toilet. That is the biggest issue for which
there is no solution at the moment when it comes to the standard
library. For your own code, you could make your foreach final and add
@inline to it (you'd lose the ability to override it, of course).

John Rose comments briefly on the issue and how HotSpot could solve it:

http://groups.google.com/group/jvm-languages/msg/0c7edb6a45794d97

Best,
Ismael

Johannes Rudolph
Joined: 2008-12-17,
User offline. Last seen 29 weeks 20 hours ago.
Re: scala performance for numerical algorithms

Couldn't you cope with much of these issues by consequently inlining
stuff? I know it isn't working right now, because you can't inline
from external jars but if you could, you could make at least all the
higher-order methods from ArrayOps inlinable which should improve the
situation in many cases.

On Sun, Jan 24, 2010 at 7:09 AM, Ismael Juma wrote:
> Hi Rex,
>
> On Sat, 2010-01-23 at 21:57 -0500, Rex Kerr wrote:
>> On Sat, Jan 23, 2010 at 6:57 PM, Jonathan Shore
>> wrote:
>>         I think what you are testing, though, is a bit different (i.e.
>>         application of a function/body across a collection via
>>         foreach) but suffers from some of the same problems, namely:
>>
>>
>>         - lack of inlining of function
>>
>> The JVM does this pretty well (and anyway it's pretty cheap--a
>> function call is typically a couple of cycles).
>>
>>         - cost of accessing the iterator
>>
>> The JVM does this pretty well too, and it's pretty cheap too.  (A
>> pointer dereference is also usually only a couple of cycles.)
>>
>> The really expensive thing is object creation.
>
> This is a lot of truth here, but if you have a foreach that is called
> with several different anonymous functions and the foreach itself is not
> inlined (both which can happen in real-life code as shown in my blog
> entry and easily missed by simple micro-benchmarks), then you end up
> with a megamorphic method call for the anonymous function which causes
> performance to go to the toilet. That is the biggest issue for which
> there is no solution at the moment when it comes to the standard
> library. For your own code, you could make your foreach final and add
> @inline to it (you'd lose the ability to override it, of course).
>
> John Rose comments briefly on the issue and how HotSpot could solve it:
>
> http://groups.google.com/group/jvm-languages/msg/0c7edb6a45794d97
>
> Best,
> Ismael
>
>
>

ijuma
Joined: 2008-08-20,
User offline. Last seen 22 weeks 2 days ago.
Re: scala performance for numerical algorithms

Hi Johannes,

On Sun, 2010-01-24 at 13:11 +0100, Johannes Rudolph wrote:
> Couldn't you cope with much of these issues by consequently inlining
> stuff?

You mean inlining done by scalac, right? That only works for cases where
overriding isn't possible (which I hinted at in the previous message) or
if scalac had a whole program optimisation mode. While talking about the
general case, I neglected arrays and you make a good point below.

> I know it isn't working right now, because you can't inline
> from external jars

I wasn't aware of this. Do you have a reference? In the link below,
Iulian doesn't mention that as a problem.

> but if you could, you could make at least all the
> higher-order methods from ArrayOps inlinable which should improve the
> situation in many cases.

The array case is special as allowing overriding is not needed (or even
possible with the current approach as far as I can see). So, it should
be possible to allow inlining and it should help quite a bit.

The last time this came up, Iulian mentioned that the optimiser had
issues identifying the ArrayOps leaf subclasses as practically final
(they are non-final classes defined inside an object) and he said that
he intended fix that:

http://article.gmane.org/gmane.comp.lang.scala.internals/1830

Not sure if that has been committed since. I do wonder why they're not
simply made final though.

Still, that is just the array case, it would be nice to have something
for the rest of the collections.

Best,
Ismael

Johannes Rudolph
Joined: 2008-12-17,
User offline. Last seen 29 weeks 20 hours ago.
Re: scala performance for numerical algorithms

On Sun, Jan 24, 2010 at 1:47 PM, Ismael Juma wrote:
> Hi Johannes,
>
> On Sun, 2010-01-24 at 13:11 +0100, Johannes Rudolph wrote:
>> Couldn't you cope with much of these issues by consequently inlining
>> stuff?
>
> You mean inlining done by scalac, right? That only works for cases where
> overriding isn't possible (which I hinted at in the previous message) or
> if scalac had a whole program optimisation mode. While talking about the
> general case, I neglected arrays and you make a good point below.
>
>>  I know it isn't working right now, because you can't inline
>> from external jars
>
> I wasn't aware of this. Do you have a reference? In the link below,
> Iulian doesn't mention that as a problem.

I thought someone once said that, but if I looked into the code and it
looks like it should work.

>> but if you could, you could make at least all the
>> higher-order methods from ArrayOps inlinable which should improve the
>> situation in many cases.
>
> The array case is special as allowing overriding is not needed (or even
> possible with the current approach as far as I can see). So, it should
> be possible to allow inlining and it should help quite a bit.
>
> The last time this came up, Iulian mentioned that the optimiser had
> issues identifying the ArrayOps leaf subclasses as practically final
> (they are non-final classes defined inside an object) and he said that
> he intended fix that:
>
> http://article.gmane.org/gmane.comp.lang.scala.internals/1830
>
> Not sure if that has been committed since. I do wonder why they're not
> simply made final though.

Yes, that's what I thought, too. And thanks for the pointer to that discussion.

> Still, that is just the array case, it would be nice to have something
> for the rest of the collections.

Sure, but people needing fast loops would in many cases use arrays
anyway and it would be nice if you could tell them, that using the
array higher-order functions will optimize just fine.

ijuma
Joined: 2008-08-20,
User offline. Last seen 22 weeks 2 days ago.
Re: scala performance for numerical algorithms

On Sun, 2010-01-24 at 13:57 +0100, Johannes Rudolph wrote:
> Sure, but people needing fast loops would in many cases use arrays
> anyway and it would be nice if you could tell them, that using the
> array higher-order functions will optimize just fine.

Definitely.

Ismael

Jonathan Shore
Joined: 2009-04-10,
User offline. Last seen 42 years 45 weeks ago.
Re: scala performance for numerical algorithms

You're right. I'm just showing my ignorance with Scala. I've been on the sidelines seeing how it progresses perf-wise before jumping in.

On Jan 24, 2010, at 12:52 AM, Ismael Juma wrote:

> On Sat, 2010-01-23 at 18:57 -0500, Jonathan Shore wrote:
>> - lack of inlining of function
>> - cost of accessing the iterator
>>
>> While it is possible that the function representing the loop body can
>> be inlined, I find it less likely that the JVM will optimise the cost
>> of accessing the iterator away.
>
> I am not sure what you're saying here. You realise that a for
> comprehension without a yield actually invokes foreach (as in my example
> above) and an iterator doesn't have to be involved at all?
>
> Best,
> Ismael
>
>

Jonathan Shore
Joined: 2009-04-10,
User offline. Last seen 42 years 45 weeks ago.
Re: scala performance for numerical algorithms

On Jan 24, 2010, at 1:09 AM, Ismael Juma wrote:

This is a lot of truth here, but if you have a foreach that is called
with several different anonymous functions and the foreach itself is not
inlined (both which can happen in real-life code as shown in my blog
entry and easily missed by simple micro-benchmarks), then you end up
with a megamorphic method call for the anonymous function which causes
performance to go to the toilet. That is the biggest issue for which
there is no solution at the moment when it comes to the standard
library. For your own code, you could make your foreach final and add
@inline to it (you'd lose the ability to override it, of course).


hmm, if I understand this correctly, I would introduce my own class or specialization of Range and annotate with @inline on the class or foreach method?    If I'm having performance issues with Range in the context of the series of "simple" loops:
   for (ii <- 0 until iterations)   {      for (i <- 1 until array.length)         array(i) = i / 1000.0
      for (i <- 8 until array.length)      {         for (j <- 1 until 8) array(i) += array(i-j) * array(i)      }   }

Looking at the Range code:
  override def foreach(f: Int => Unit) {    if (step > 0) {      var i = this.start      val until = if (inInterval(end)) end + 1 else end
      while (i < until) {        f(i)        i += step      }    } else {      ...      }    }  }

, it would seem to imply that my performance issue is either coming from one or both of the following:
- the cost of creating the Range object in inner loops- the cost of calling f(i) when the JVM misses the opportunity to inline.
I'll have to read your blog entry again.  It is not clear to me in what situations we can expect inlining and in which situations not.
In summary though,  unless can find a way to specialize range and make the overhead "go away", I still think we need to have a mapping to while for the simple cases, not relying on the JRE to be extra clever.    The alternative would be to add in the old Java / C++ for loop into the language and allow for more concise specification of loops (unnecessary for sure).  
Major kudos to Phil or whomever implements!  
Jonathan

ichoran
Joined: 2009-08-14,
User offline. Last seen 2 years 3 weeks ago.
Re: scala performance for numerical algorithms
On Sun, Jan 24, 2010 at 9:42 AM, Jonathan Shore <jonathan.shore@gmail.com> wrote:

Looking at the Range code:
  override def foreach(f: Int => Unit) {    if (step > 0) {      var i = this.start      val until = if (inInterval(end)) end + 1 else end
      while (i < until) {        f(i)        i += step      }    } else {      ...      }    }  }

, it would seem to imply that my performance issue is either coming from one or both of the following:
- the cost of creating the Range object in inner loops - the cost of calling f(i) when the JVM misses the opportunity to inline.

It's not the all cost of calling f(i).  It's the cost of boxing i for that call.  Here's the bytecode for range as of 18987; hopefully they've pulled the if function out of the while loop since then.

public void foreach(scala.Function1);
  Code:
   0:    aload_0
   1:    invokevirtual    #4b4b4b; //Method start:()I
   4:    istore_2
   5:    aload_0
   6:    invokevirtual    #4f4f4f; //Method step:()I
   9:    iconst_0
   10:    if_icmple    29
   13:    iload_2
   14:    aload_0
   15:    invokevirtual    #5c5c5c; //Method limit:()I
   18:    if_icmpge    25
   21:    iconst_1
   22:    goto    42
   25:    iconst_0
   26:    goto    42
   29:    iload_2
   30:    aload_0
   31:    invokevirtual    #5c5c5c; //Method limit:()I
   34:    if_icmple    41
   37:    iconst_1
   38:    goto    42
   41:    iconst_0
   42:    ifeq    66
   45:    aload_1
   46:    iload_2
   47:    invokestatic    #2c2c2c; //Method scala/runtime/BoxesRunTime.boxToInteger:(I)Ljava/lang/Integer;
   50:    invokeinterface    #232323,  2; //InterfaceMethod scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;
   55:    pop
   56:    iload_2
   57:    aload_0
   58:    invokevirtual    #4f4f4f; //Method step:()I
   61:    iadd
   62:    istore_2
   63:    goto    5
   66:    return

Note 47--it invokes boxToInteger right before the call to Function1.apply (i.e. f(i)). That's certainly not the _only_ thing going on inside that loop, but there's no way to avoid a lot of work if you're creating a class every loop.  If you had a specialized Function1 that understood primitive ints, you wouldn't need to box it, and that part of the expense would go away.

Now, inlining is considerable help when it can happen, but in your tests that's only half the issue.  For example, testing the following:

  - Raw while loop as you posted
  - Indexer class like I posted in for loops, but with while in innermost loop
  - Indexer class like I posted in for loops
  - Indexer class with implicit goodies to make it act like Range
  - Range

gives the following timings:

#00004  while=0.162  forIwhile=0.374  forI=0.363  forIWI=0.451  range=0.520
#00016  while=0.358  forIwhile=0.559  forI=0.765  forIWI=0.779  range=1.210
#00256  while=0.623  forIwhile=0.839  forI=1.210  forIWI=1.217  range=2.041
#65536  while=0.633  forIwhile=0.871  forI=1.268  forIWI=1.266  range=2.330

(Array length is on the left; all are chosen so that iterations*array.length == 16M.)

The main difference between range and the others is the boxing, which is almost half the time taken.  All the implicit goodies do make your code look pretty without adding _any_ overhead.  And innermost loops are, as always, by far the most important (except in the array.length = 4 case, since then the innermost loop is never called).

Anyway, for right now, this suggests the following optimization tips for high-performance Scala coding: write powerful classes with while loops as the innermost loop.  Avoid object creations in tight loops.  Compose the rest of your problem out of these highly-tuned primitives and enjoy all the expressiveness of Scala at minimal performance overhead.

  --Rex

P.S. Writing an class that wraps an array is about as efficient as the Indexer idea that I showed above; using update and apply and implicits and such, you can write stuff like
  for (a <- array in (0,5)) { a() = 5 - a }
which is often even clearer than all the explicit indexing.

P.P.S. Benchmark code follows:

object TestArraySpeed {
  def t = scala.compat.Platform.currentTime

  final case class Indexer(i0: Int, i1: Int) extends Iterator[Indexer] {
    var i = i0-1
    def hasNext = i+1<i1
    def next = { i += 1 ; this }
    def apply() = i
  } 

  def rawWhile(iterations: Int, array: Array[Double]) = {
    var sum = 0.0
    var ii = 0
    while (ii < iterations) {
      var i = 0
      while (i < array.length) {
        array(i) = i / 1e8
        i += 1  
      }
      i = 8
      while (i < array.length) {
        var j = 1
        while (j < 8) {
          array(i) += array(i-j) * array(i)
          j += 1
        }
        i += 1
      }
      i = 8
      while (i < array.length) {
        sum += array(i)*ii
        i += 1
      }
      ii += 1
    } 
    sum
  } 

  def forIndexerWhile(iterations: Int, array: Array[Double]) = {
    var sum = 0.0
    for (ii <- Indexer(0,iterations)) {
      for (i <- Indexer(0,array.length)) { array(i()) = i() / 1e8 }
      for (i <- Indexer(8,array.length)) {
        var j = 1
        while (j < 8) {
          array(i()) += array(i()-j)*array(i())
          j += 1
        }
      }
      for (i <- Indexer(8,array.length)) { sum += array(i())*ii() }
    }
    sum
  }
  
  def forIndexer(iterations: Int, array: Array[Double]) = {
    var sum = 0.0
    for (ii <- Indexer(0,iterations)) {
      for (i <- Indexer(0,array.length)) { array(i()) = i() / 1e8 }
      for (i <- Indexer(8,array.length)) {
        for (j <- Indexer(1,8)) { array(i()) += array(i()-j())*array(i()) }
      }
      for (i <- Indexer(8,array.length)) { sum += array(i())*ii() }
    }
    sum
  }
 
  def forIndexerWithImplicits(iterations: Int, array: Array[Double]) = {
    final class Upto(i: Int) { def upto(j: Int) = Indexer(i,j) }
    implicit def intWithUpto(i: Int) = new Upto(i)
    implicit def indexer2int(ix: Indexer) = ix.i
    var sum = 0.0
    for (ii <- 0 upto iterations) {
      for (i <- 0 upto array.length) { array(i) = i / 1e8 }
      for (i <- 8 upto array.length) {
        for (j <- 1 upto 8) array(i) += array(i-j) * array(i)
      }
      for (i <- 8 upto array.length) sum += array(i)*ii
    }
    sum
  }

  def forRange(iterations: Int, array: Array[Double]) = {
    var sum = 0.0
    for (ii <- 0 until iterations) {
      for (i <- 0 until array.length) { array(i) = i / 1e8 }
      for (i <- 8 until array.length) {
        for (j <- 1 until 8) array(i) += array(i-j) * array(i)
      }
      for (i <- 8 until array.length) sum += array(i)*ii
    }
    sum
  }
 
  def run() {
    val sizes = List(4,16,256,65536)
    val iter = sizes map ((16*1024*1024) / _)
    val arrays = sizes map ( new Array[Double](_) )
    for ((i,a) <- iter zip arrays) {
      val t0 = t
      val sumA = rawWhile(i,a)
      val t1 = t
      val sumB = forIndexerWhile(i,a)
      val t2 = t
      val sumC = forIndexer(i,a)
      val t3 = t
      val sumD = forIndexerWithImplicits(i,a)
      val t4 = t
      val sumE = forRange(i,a)
      val t5 = t
      assert(sumA==sumB && sumA==sumC && sumA==sumD && sumA==sumE)
      printf("#%05d  while=%.3f  forIwhile=%.3f  forI=%.3f  forIWI=%.3f  range=
             a.length,1e-3*(t1-t0),1e-3*(t2-t1),1e-3*(t3-t2),1e-3*(t4-t3),1e-3*
    } 
  }

  def main(args: Array[String]) {
    for (i <- 1 to 3) run()
  }
}

Jonathan Shore
Joined: 2009-04-10,
User offline. Last seen 42 years 45 weeks ago.
Re: scala performance for numerical algorithms


On Jan 24, 2010, at 5:52 PM, Rex Kerr wrote:

It's not the all cost of calling f(i).  It's the cost of boxing i for that call.  Here's the bytecode for range as of 18987; hopefully they've pulled the if function out of the while loop since then.

public void foreach(scala.Function1);
  Code:
   0:    aload_0
   1:    invokevirtual    #4b4b4b; //Method start:()I
   4:    istore_2
   5:    aload_0
   6:    invokevirtual    #4f4f4f; //Method step:()I
   9:    iconst_0
   10:    if_icmple    29
   13:    iload_2
   14:    aload_0
   15:    invokevirtual    #5c5c5c; //Method limit:()I
   18:    if_icmpge    25
   21:    iconst_1
   22:    goto    42
   25:    iconst_0
   26:    goto    42
   29:    iload_2
   30:    aload_0
   31:    invokevirtual    #5c5c5c; //Method limit:()I
   34:    if_icmple    41
   37:    iconst_1
   38:    goto    42
   41:    iconst_0
   42:    ifeq    66
   45:    aload_1
   46:    iload_2
   47:    invokestatic    #2c2c2c; //Method scala/runtime/BoxesRunTime.boxToInteger:(I)Ljava/lang/Integer;
   50:    invokeinterface    #232323,  2; //InterfaceMethod scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;
   55:    pop
   56:    iload_2
   57:    aload_0
   58:    invokevirtual    #4f4f4f; //Method step:()I
   61:    iadd
   62:    istore_2
   63:    goto    5
   66:    return

Note 47--it invokes boxToInteger right before the call to Function1.apply (i.e. f(i)). That's certainly not the _only_ thing going on inside that loop, but there's no way to avoid a lot of work if you're creating a class every loop.  If you had a specialized Function1 that understood primitive ints, you wouldn't need to box it, and that part of the expense would go away.


Ah, thanks for educating me.   I forgot that the closure is going to be an implementation of Function, which has an Object signature under the covers. 
I am only an onlooker with Scala so can't claim to know enough about the following:   Would it be possible to create a specialization of Function1 for a specific primitive type?   I had read that Scala 2.8 or a later version was going to allow specialization for concrete instances of a generic type. 
I am getting the impression that Scala suffers from the same problem as Java in that there is no real support for primitives in generic types.   I understand why this is ugly and difficult, but it can make a huge difference performance-wise.    As poor as C++ is, it does allow for specialization and templating with primitive types.   For better or worse unless this sort of thing is optimised away, is a problem for the numerical community.   
Now this can be an inconvenience in Java when you want a collection of a primitive type, but is worse if the fundamentals of the language (ie, closures) allow no mechanism that does not avoid boxing and unboxing, unless we are confident that the VM can removing this overhead for most situations.   
Now working exclusively with "first class" objects  it is certainly "pretty" and consistent, but unless the JVM can optimise the boxing away presents a big problem for performance.   (It does manage to optimise away some of the time.   For instance found that Array[Double] access is pretty close to java performance).
Jon Harrup has a simple VM (HLVM) for OCaml built on top of LLVM, where he claims to have solved the boxing issue amongst other issues.   He's done some performance tests that put it on par with C++ for the sort of benchmark where Java would do very poorly (i.e. many small objects created).   He indicated to me that had to discard the JVM as an option because of issues such as this.   
Personally I want to stick with the JVM due to the huge momentum and history behind it, although would be even happier with the CLR if it was "liberated".    Short of fixing the issues in the JVM,  in my view, the compiler should do the reductions where possible, emitting reduced AST or at a later stage reduced bytecode for common performance problems. 
Jonathan
ichoran
Joined: 2009-08-14,
User offline. Last seen 2 years 3 weeks ago.
Re: scala performance for numerical algorithms
Scala has a specialize feature (right now available by annotating generic type parameters with @specialize and invoking Scala with the -Yspecialize option) that will create separate methods for primitive types and will call those when it knows that it has the primitive type.  Hopefully this will be able to percolate backwards throughout much of the library so we can get partially around the boxing issue.  The problem, which is obvious if you think about it for a little while, is that there are rather a few primitive types, and your most basic Function1 has an input and output type, which means that if you want to handle, say, 9 types (8 primitive + Object), then you have to create 81 copies of the method.  This is code-bloat approaching the C++ level, except possibly even worse since you need to create every method that you *might ever* use.

I don't know whether they've investigated where they can collapse primitive types.  For example, one could argue that outside of arrays one should only have
 
  long
  byte->short->char->int
  float->double
  boolean->Object

which would give "only" 16 copies, and in certain rare cases one might even then allow trinary specialization (64 copies).  (Note that boolean->Object is inexpensive because you can point all falses at java.lang.Boolean.FALSE and all trues at java.lang.Boolean.TRUE.  Thus there's no object creation.  I'm not sure whether java.lang.Boolean.valueOf(x), which Scala uses for its boxing, already uses this trick.)

When using primitive arrays, there's no choice but to go for all 9 (or possibly 8, since short and char are just signed/unsigned versions of each other).

Anyway, I'm hopeful that @specialize will be solidified, pushed through the library in some form, and will generate a nice speedup for a subset of the numeric applications that you're concerned with.

As I demonstrated with my earlier IntUnitFunction example (from Jan 23), this can provide equal-to-for-loop performance in certain cases, but right now it's syntactically ugly.  With @specialize in the library, the default syntax would end up being the same.

I don't know whether this is enough to reach Harrup's HVLM performance.  There are a variety of other ideas about how to let the JVM overcome its small object performance penalty (e.g. when small classes are wrapping other small classes).

  --Rex

On Sun, Jan 24, 2010 at 6:48 PM, Jonathan Shore <jonathan.shore@gmail.com> wrote:

Ah, thanks for educating me.   I forgot that the closure is going to be an implementation of Function, which has an Object signature under the covers. 
I am only an onlooker with Scala so can't claim to know enough about the following:   Would it be possible to create a specialization of Function1 for a specific primitive type?   I had read that Scala 2.8 or a later version was going to allow specialization for concrete instances of a generic type. 

extempore
Joined: 2008-12-17,
User offline. Last seen 35 weeks 3 days ago.
Re: scala performance for numerical algorithms

On Sun, Jan 24, 2010 at 07:23:04PM -0500, Rex Kerr wrote:
> The problem, which is obvious if you think about it for a little
> while, is that there are rather a few primitive types, and your most
> basic Function1 has an input and output type, which means that if you
> want to handle, say, 9 types (8 primitive + Object), then you have to
> create 81 copies of the method. This is code-bloat approaching the
> C++ level, except possibly even worse since you need to create every
> method that you *might ever* use.

You do not have to create every method that you might ever use. It is
designed to fall back on the generic implementation. According to the
specialization sid, "the plan is to specialize array-based collections
and FunctionN traits up to two parameters before the first release of
Scala 2.8."

ichoran
Joined: 2009-08-14,
User offline. Last seen 2 years 3 weeks ago.
Re: scala performance for numerical algorithms
On Sun, Jan 24, 2010 at 8:08 PM, Paul Phillips <paulp@improving.org> wrote:
On Sun, Jan 24, 2010 at 07:23:04PM -0500, Rex Kerr wrote:
> This is code-bloat approaching the
> C++ level, except possibly even worse since you need to create every
> method that you *might ever* use.

You do not have to create every method that you might ever use.  It is
designed to fall back on the generic implementation.

Sorry, I didn't qualify my statement enough.  I meant, "In order to get the performance improvement associated with primitives, you need to create every method that you might ever use."

  --Rex
jonathan mawson
Joined: 2010-01-28,
User offline. Last seen 42 years 45 weeks ago.
Re: scala performance for numerical algorithms

Ismael Juma wrote:
>
> On Sun, 2010-01-24 at 13:57 +0100, Johannes Rudolph wrote:
>> Sure, but people needing fast loops would in many cases use arrays
>> anyway and it would be nice if you could tell them, that using the
>> array higher-order functions will optimize just fine.
>
> Definitely.
>
> Ismael
>
>
>

I think this is another instance where a good enough solution is knowing
what code constructs to favour when performance matters. Although a fast for
loop would be very nice. But in the meantime ***a reasonable community
effort to address issues like this in a single place would be a good idea -
a real coding standard, not another formatting style war.***

I'm just saying...

(But the future for the numerical community is CUDA or just maybe OpenCL and
that huge - say 50x - speed boost from Fermi and other hardware
accelerators.)

Copyright © 2012 École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland