our spectacular builder regression

10 replies

Fri, 2010-04-23, 20:04

extempore

Joined: 2008-12-17,

I had better plans for this morning. (OK they were the same plans,
fixing stuff, but I was hoping to fix stuff I already knew about.)

Yesterday ijuma demonstrated that a Set creation which had been near
instantaneous in 2.7 was taking many minutes to complete. It took me a
strangely long time to figure out but it's a good thing I did. A doozy,
this one. This might be what underlies other performance complaints I
have heard recently, netbeans using lots more memory or something along
that line.

Not long ago we, innocently enough, altered '+' to behave consistently
on all collections. That means mutable.Set(1,2) + 3 creates a new set,
and because it is mutable, there's no sharing. That's fine, except that
mutable sets were still using AddingBuilder: oops. To create ijuma's
800K element Set required first creating Sets of size 1,2,...,400K,...
799,999 and 800,000, since each element is added with '+'.

Although I already created GrowingBuilder at some point, this was not a
one line fix because mutable/immutable sets didn't yet have separate
factories, as until recently they could all get away with claiming to be
AddingBuilderable. I remedied this by creating and using
ImmutableSetFactory and MutableSetFactory, and given the severity of the
issue and the obviousness-to-me of the right fix I intend to check it in
shortly unless someone says otherwise.

Fri, 2010-04-23, 22:17

odersky

Joined: 2008-07-29,

Re: our spectacular builder regression

Hi Paul,

Good catch. The sooner this is fixed the better.

-- Martin

On Fri, Apr 23, 2010 at 9:04 PM, Paul Phillips <paulp@improving.org> wrote:

I had better plans for this morning. (OK they were the same plans,
fixing stuff, but I was hoping to fix stuff I already knew about.)

Yesterday ijuma demonstrated that a Set creation which had been near
instantaneous in 2.7 was taking many minutes to complete. It took me a
strangely long time to figure out but it's a good thing I did. A doozy,
this one. This might be what underlies other performance complaints I
have heard recently, netbeans using lots more memory or something along
that line.

Not long ago we, innocently enough, altered '+' to behave consistently
on all collections. That means mutable.Set(1,2) + 3 creates a new set,
and because it is mutable, there's no sharing. That's fine, except that
mutable sets were still using AddingBuilder: oops. To create ijuma's
800K element Set required first creating Sets of size 1,2,...,400K,...
799,999 and 800,000, since each element is added with '+'.

Although I already created GrowingBuilder at some point, this was not a
one line fix because mutable/immutable sets didn't yet have separate
factories, as until recently they could all get away with claiming to be
AddingBuilderable. I remedied this by creating and using
ImmutableSetFactory and MutableSetFactory, and given the severity of the
issue and the obviousness-to-me of the right fix I intend to check it in
shortly unless someone says otherwise.

--
Paul Phillips | Christ died for our sins. Dare we make his martyrdom
Apatheist | meaningless by not committing them?
Empiricist | -- Jules Feiffer
i'll ship a pulp |----------* http://www.improving.org/paulp/ *----------

Fri, 2010-04-23, 22:37

extempore

Joined: 2008-12-17,

Re: our spectacular builder regression

OK, the obvious borkage is better now. It will feel incomplete unless
we put something in place which lets us enforce it in code. Right now
there is this trait:

trait HasNewBuilder[+A, +Repr] {
/** The builder that builds instances of Repr */
protected[this] def newBuilder: Builder[A, Repr]
}

For a while I started down the path of creating some more, such as:

trait HasNewAddingBuilder[+A, +Repr] extends HasNewBuilder[A, Repr] {
protected[this] def newBuilder: AddingBuilder[A, Repr]
}

But along the way I realized there's a whole little parallel universe of
collections with signatures like this:

def newBuilder[A](implicit ord: Ordering[A]): Builder[A, CC[A]]

And since I wanted to capture SortedSet, but couldn't bring myself to
start writing traits like HasNewSortedAddingBuilder, I pressed pause.

Do you have a sense of the right direction to come from?

Fri, 2010-04-23, 22:57

odersky

Joined: 2008-07-29,

Re: our spectacular builder regression

I am not sure what's best to do. Agreed that it would be nice to catch this, but we need to carefully evaluate the overhead of encoding more preperties in types.

Cheers

-- Martin

On Fri, Apr 23, 2010 at 11:33 PM, Paul Phillips <paulp@improving.org> wrote:

OK, the obvious borkage is better now. It will feel incomplete unless
we put something in place which lets us enforce it in code. Right now
there is this trait:

trait HasNewBuilder[+A, +Repr] {
/** The builder that builds instances of Repr */
protected[this] def newBuilder: Builder[A, Repr]
}

For a while I started down the path of creating some more, such as:

trait HasNewAddingBuilder[+A, +Repr] extends HasNewBuilder[A, Repr] {
protected[this] def newBuilder: AddingBuilder[A, Repr]
}

But along the way I realized there's a whole little parallel universe of
collections with signatures like this:

def newBuilder[A](implicit ord: Ordering[A]): Builder[A, CC[A]]

And since I wanted to capture SortedSet, but couldn't bring myself to
start writing traits like HasNewSortedAddingBuilder, I pressed pause.

Do you have a sense of the right direction to come from?

--
Paul Phillips | On two occasions, I have been asked, 'Mr. Babbage, if you
In Theory | put into the machine wrong figures, will the right answers
Empiricist | come out?' I am not able to rightly apprehend the kind of
up hill, pi pals! | confusion of ideas that could provoke such a question.

Fri, 2010-04-23, 23:17

extempore

Joined: 2008-12-17,

hasCheapSize or ... ?

It seems like we could make some big improvements in Array performance
if we used sizeHint more. (Or maybe not and hotspot magically optimizes
the other 17 arrays away -- at this point people could tell me hotspot
can go back in time and improve java 1.2 performance and I'd have to
accept it at least provisionally.)

Unfortunately I am not aware of any way to find out if an arbitrary
collection has a cheap size method, which makes it exceedingly difficult
to do in a straightforward way. I do not expect "hasDefiniteSize" is
reuseable for that purpose: it correlates with "hasCheapSize", but not
perfectly, so no good.

Here is an example of where we could have a single array allocation
instead of 18:

scala> (1 to 1000000).toArray map (_ + 1)
mkArray(16)
mkArray(32)
mkArray(64)
mkArray(128)
mkArray(256)
mkArray(512)
mkArray(1024)
mkArray(2048)
mkArray(4096)
mkArray(8192)
mkArray(16384)
mkArray(32768)
mkArray(65536)
mkArray(131072)
mkArray(262144)
mkArray(524288)
mkArray(1048576)
mkArray(1000000)

Fri, 2010-04-23, 23:27

dcsobral

Joined: 2009-04-23,

Re: hasCheapSize or ... ?

How about a method taking a "default" size and returning either it or the actual size? Anywhere you need an initial size but can be improved with the actual size you just call this method.
def actualSizeOr(n: Int) = n // size not cheap override def actualSizeOr(n: Int) = size // cheap size

On Fri, Apr 23, 2010 at 7:11 PM, Paul Phillips <paulp@improving.org> wrote:

It seems like we could make some big improvements in Array performance
if we used sizeHint more. (Or maybe not and hotspot magically optimizes
the other 17 arrays away -- at this point people could tell me hotspot
can go back in time and improve java 1.2 performance and I'd have to
accept it at least provisionally.)

Unfortunately I am not aware of any way to find out if an arbitrary
collection has a cheap size method, which makes it exceedingly difficult
to do in a straightforward way. I do not expect "hasDefiniteSize" is
reuseable for that purpose: it correlates with "hasCheapSize", but not
perfectly, so no good.

Here is an example of where we could have a single array allocation
instead of 18:

scala> (1 to 1000000).toArray map (_ + 1)
mkArray(16)
mkArray(32)
mkArray(64)
mkArray(128)
mkArray(256)
mkArray(512)
mkArray(1024)
mkArray(2048)
mkArray(4096)
mkArray(8192)
mkArray(16384)
mkArray(32768)
mkArray(65536)
mkArray(131072)
mkArray(262144)
mkArray(524288)
mkArray(1048576)
mkArray(1000000)

--
Paul Phillips | The important thing here is that the music is not in
Future Perfect | the piano. And knowledge and edification is not in the
Empiricist | computer. The computer is simply an instrument whose
pp: i haul pills | music is ideas. -- Alan Kay

--
Daniel C. Sobral

I travel to the future all the time.

Sat, 2010-04-24, 14:27

odersky

Joined: 2008-07-29,

Re: hasCheapSize or ... ?

On Sat, Apr 24, 2010 at 12:11 AM, Paul Phillips <paulp@improving.org> wrote:

It seems like we could make some big improvements in Array performance
if we used sizeHint more. (Or maybe not and hotspot magically optimizes
the other 17 arrays away -- at this point people could tell me hotspot
can go back in time and improve java 1.2 performance and I'd have to
accept it at least provisionally.)

Unfortunately I am not aware of any way to find out if an arbitrary
collection has a cheap size method, which makes it exceedingly difficult
to do in a straightforward way. I do not expect "hasDefiniteSize" is
reuseable for that purpose: it correlates with "hasCheapSize", but not
perfectly, so no good.

Here is an example of where we could have a single array allocation
instead of 18:

scala> (1 to 1000000).toArray map (_ + 1)
mkArray(16)
mkArray(32)
mkArray(64)
mkArray(128)
mkArray(256)
mkArray(512)
mkArray(1024)
mkArray(2048)
mkArray(4096)
mkArray(8192)
mkArray(16384)
mkArray(32768)
mkArray(65536)
mkArray(131072)
mkArray(262144)
mkArray(524288)
mkArray(1048576)
mkArray(1000000)

An easy way to achieve this is to specialize toArray on IndexedSeq's, where we know that size is cheap.

Cheers

-- Martin

--
Paul Phillips | The important thing here is that the music is not in
Future Perfect | the piano. And knowledge and edification is not in the
Empiricist | computer. The computer is simply an instrument whose
pp: i haul pills | music is ideas. -- Alan Kay

Sat, 2010-04-24, 16:07

extempore

Joined: 2008-12-17,

Re: hasCheapSize or ... ?

On Sat, Apr 24, 2010 at 03:17:11PM +0200, martin odersky wrote:
> An easy way to achieve this is to specialize toArray on IndexedSeq's,
> where we know that size is cheap.

I'm not seeing how to take much advantage of that. The problem is that
not only are most of the methods which would benefit from a size hint
implemented on TraversableLike or TraversableFactory, the hint has to
appear in the middle of the method. Here for instance is fill:

def fill[A](n: Int)(elem: => A): CC[A] = {
val b = newBuilder[A]
var i = 0
while (i < n) {
b += elem
i += 1
}
b.result
}

I can't inject a size hint without duplicating the entire method. In
the map example it's not toArray which needs modification but map
itself:

def map[B, That](f: A => B)(implicit bf: CanBuildFrom[Repr, B, That]): That = {
val b = bf(repr)
// size hint would go here
for (x <- self)
if (p(x)) b += f(x)
b.result
}

...unless there's some other mechanism for size hinting I'm missing. In
theory CanBuildFrom could return a pre-size-hinted builder from the
apply in the first line of map:

def apply(from: From): Builder[Elem, To]

...but not with the existing interface, because there's no way for that
apply to know whether the builder is to be used for size-preserving map
or another method like filter (probably unwise to preallocate the size
of the original collection.)

The only way I see out that doesn't lead to either duplicated code or
missed opportunities for optimization is a collections method which
communicates the necessary info. To my eye GenericCompanion or
TraversableFactory would be the place for it.

Sat, 2010-04-24, 17:07

ijuma

Joined: 2008-08-20,

Re: hasCheapSize or ... ?

On Fri, Apr 23, 2010 at 11:11 PM, Paul Phillips wrote:
> Unfortunately I am not aware of any way to find out if an arbitrary
> collection has a cheap size method, which makes it exceedingly difficult
> to do in a straightforward way.

This would also simplify https://lampsvn.epfl.ch/trac/scala/ticket/2972 .

Best,
Ismael

Sun, 2010-04-25, 13:47

odersky

Joined: 2008-07-29,

Re: Re: hasCheapSize or ... ?

I a

On Sat, Apr 24, 2010 at 5:02 PM, Paul Phillips <paulp@improving.org> wrote:

On Sat, Apr 24, 2010 at 03:17:11PM +0200, martin odersky wrote:
> An easy way to achieve this is to specialize toArray on IndexedSeq's,
> where we know that size is cheap.

I'm not seeing how to take much advantage of that. The problem is that
not only are most of the methods which would benefit from a size hint
implemented on TraversableLike or TraversableFactory, the hint has to
appear in the middle of the method. Here for instance is fill:

def fill[A](n: Int)(elem: => A): CC[A] = {
val b = newBuilder[A]
var i = 0
while (i < n) {
b += elem
i += 1
}
b.result
}

I can't inject a size hint without duplicating the entire method. In
the map example it's not toArray which needs modification but map
itself:

But for fill you should always inject the sizeHint!

def map[B, That](f: A => B)(implicit bf: CanBuildFrom[Repr, B, That]): That = {
val b = bf(repr)
// size hint would go here
for (x <- self)
if (p(x)) b += f(x)
b.result
}

That's harder because we'd need to add a sizeHint to a CanBuildFrom object . It can be done
but it's not clear we want to do this.

If it helps we could add a

def sizeHint: Option[Int]

method to Traversable. which would return the traversable's size, if it is cheap to compute.
Orelse add a hasCheapSize. Or else equate hasCheapSize with .isInstanceOf[IndexedSeq],
that would be the simplest change from an API point of view.

Cheers

-- Martin

Sun, 2010-04-25, 18:17

#10

extempore

Joined: 2008-12-17,

Re: Re: hasCheapSize or ... ?

On Sun, Apr 25, 2010 at 02:45:09PM +0200, martin odersky wrote:
> But for fill you should always inject the sizeHint!

Heh, I guess a method taking an Int size argument wasn't the best
example of the phenomenon.

> If it helps we could add a
>
> def sizeHint: Option[Int]
>
> method to Traversable. which would return the traversable's size, if
> it is cheap to compute.

That's the interface I was thinking too. We could do the instance check
and since it doesn't involve any new methods it'd be easy to refine
later; but it feels like a shortcut because although IndexedSeq implies
cheap size, it'd be easy enough for any other collection to have cheap
size. Like ijuma points out, #2972 (more efficient toSeq methods) is a
pretty good reason to do it too.

Scala Main Menu

our spectacular builder regression

Scala Quick Links

Featured News

User login