This page is no longer maintained — Please continue to the home page at www.scala-lang.org

Re: Code review this script

14 replies
Seth Tisue
Joined: 2008-12-16,
User offline. Last seen 34 weeks 3 days ago.

>>>>> "Daniel" == Daniel Wellman writes:

Daniel> Here's the code for Stream.fromIterator:

Daniel> def fromIterator[A](it: Iterator[A]): Stream[A] = if
Daniel> (it.hasNext) cons(it.next, fromIterator(it)) else empty

Daniel> If I read this correctly, this recursively adds all the
Daniel> elements of an Iterator to a new Stream object. So this forces
Daniel> the evaluation of the entire Iterator, is this correct?

No; the second argument to cons is lazy.

I think Johannes was mistaken in his reply. It's true that Stream
caches results, but if you don't keep a reference to the head of the
stream, then the cached results become eligible for garbage collection.
So in fact, Stream is very good to use when you want to avoid keeping the
entire input in memory.

For example, this:

Stream.from(1).map(_ => new Array[Int](10000000)).foreach(println)

runs in constant space because we are not keeping a reference to the
head of the stream.

Whereas if you substitute List for Stream:

(1 to 1000).toList.map(_ => new Array[Int](10000000)).foreach(println)

you'll run out of heap before anything is printed. (You can verify this
by pasting these examples in the interpreter.)

Seth Tisue
Joined: 2008-12-16,
User offline. Last seen 34 weeks 3 days ago.
Re: Code review this script

>>>>> "Johannes" == Johannes Rudolph writes:

Johannes> When I wrote this, I had the last discussion about the Stream
Johannes> class in mind.

This one, or some other?
http://www.nabble.com/Question-on-usage-of-Stream-td21321117.html

Johannes> What I took away then, was that it generally
Johannes> isn't possible to use Stream in a way which releases the
Johannes> already visited elements

I think "generally isn't possible" is much too strong. But you do need
to be careful in code that forces the stream. Until you force the
stream, everything's fine. See below.

Johannes> the only memory-safe way to use a Stream is to never safe the
Johannes> Stream into a local variable.

I think you only need to follow that advice in the part of your code
that forces the stream.

I verified experimentally that this test code:

val s = Stream.from(1).map(_ => new Array[Int](10000000))
s.foreach(println)

quickly runs out of heap, and so does this:

def foo(s:Stream[Array[Int]]) = s.foreach(println)
foo(Stream.from(1).map(_ => new Array[Int](10000000)))

so it seems your advice is good in these cases, because foreach forces
the stream.

In Daniel's code we see:

val output = parse(Stream.fromIterator(inputFile.getLines))
output.foreach(print)

I think he should follow your advice and change it to:

parse(Stream.fromIterator(inputFile.getLines)).foreach(print)

And he should be fine. It's OK for the parse method to store a Stream
in a local variable as long as that method doesn't force the stream. I
tried this test code just now, which is based on Daniel's code:

def foo(s: Stream[Int]): Stream[Array[Int]] = {
if (s.isEmpty) Stream.empty
else {
val allPairs = s.zip(s.tail)
for((x,y) <- allPairs if (x + y) % 3 == 0)
yield new Array[Int](10000000)
}
}
val it = Stream.from(1).elements
foo(Stream.fromIterator(it)).foreach(println)

and it runs fine. (I'm using 2.7.3.)

Johannes> This aside, I do not understand, how Daniel's or your new
Johannes> example in the bug ticket, Florian, is supposed to work. You
Johannes> are holding a reference to the Stream in a local variable and
Johannes> it is still in scope, why should it be gc'd?

Daniel's example, I think we've now covered.

Florian's new example is different, I think, and I interpret his
observations differently than he does. I re-closed
http://lampsvn.epfl.ch/trac/scala/ticket/692 and added a comment
explaining my reasoning.

Johannes Rudolph
Joined: 2008-12-17,
User offline. Last seen 29 weeks 22 hours ago.
Re: Code review this script

On Sun, Mar 29, 2009 at 3:06 PM, Seth Tisue wrote:
>>>>>> "Johannes" == Johannes Rudolph writes:
>  Johannes> When I wrote this, I had the last discussion about the Stream
>  Johannes> class in mind.
> This one, or some other?
> http://www.nabble.com/Question-on-usage-of-Stream-td21321117.html
Can't remember actually.

>  Johannes> What I took away then, was that it generally
>  Johannes> isn't possible to use Stream in a way wh\ich releases the
>  Johannes> already visited elements
>
> I think "generally isn't possible" is much too strong.  But you do need
> to be careful in code that forces the stream.  Until you force the
> stream, everything's fine.  See below.
Yeah, I contradicted myself on that; was too tired this morning...

>  Johannes> This aside, I do not understand, how Daniel's or your new
>  Johannes> example in the bug ticket, Florian, is supposed to work. You
>  Johannes> are holding a reference to the Stream in a local variable and
>  Johannes> it is still in scope, why should it be gc'd?
>
> Daniel's example, I think we've now covered.
I think so, too.

> Florian's new example is different, I think, and I interpret his
> observations differently than he does.  I re-closed
> http://lampsvn.epfl.ch/trac/scala/ticket/692 and added a comment
> explaining my reasoning.
I agree again.

BTW: Could the Hotspot JIT-compiler eliminate references by inlining
and optimizing some code? Makes the JVM spec/JLS any promises if an
object is to be retained if it seems to be in a local variable (but is
actually optimized away), or is reachability calculation unrelated to
JIT?

Johannes

-----------------------------------------------
Johannes Rudolph
http://virtual-void.net

ijuma
Joined: 2008-08-20,
User offline. Last seen 22 weeks 3 days ago.
Re: Code review this script

Hi,

On Sun, 2009-03-29 at 16:58 +0100, Johannes Rudolph wrote:
> BTW: Could the Hotspot JIT-compiler eliminate references by inlining
> and optimizing some code? Makes the JVM spec/JLS any promises if an
> object is to be retained if it seems to be in a local variable (but is
> actually optimized away), or is reachability calculation unrelated to
> JIT?

The following might be helpful in answering the question (also read the
comments):

http://blog.juma.me.uk/2008/10/11/local-variables-scope-in-hotspot/

Best,
Ismael

Rich Dougherty
Joined: 2008-08-20,
User offline. Last seen 2 years 38 weeks ago.
Re: Code review this script

On Mon, Mar 30, 2009 at 4:58 AM, Johannes Rudolph
wrote:
> BTW: Could the Hotspot JIT-compiler eliminate references by inlining
> and optimizing some code? Makes the JVM spec/JLS any promises if an
> object is to be retained if it seems to be in a local variable (but is
> actually optimized away), or is reachability calculation unrelated to
> JIT?

Hi folks

When we were discussing a similar issue last year, it seemed to me
that the JIT was involved. The space requirements of Stream changed
when I disabled HotSpot.

http://www.nabble.com/-scala--Stream-garbage-collection-to14560260.html#...

Rich

--
http://blog.richdougherty.com/

Johannes Rudolph
Joined: 2008-12-17,
User offline. Last seen 29 weeks 22 hours ago.
Re: Code review this script

Thx, your post and the linked bugtracker entry answers my question very well. I guessed someone from the list would know it...

Johannes

On Mar 29, 2009 8:19 PM, "Ismael Juma" <mlists@juma.me.uk> wrote:

Hi,

On Sun, 2009-03-29 at 16:58 +0100, Johannes Rudolph wrote: > BTW: Could the Hotspot JIT-compiler el...

The following might be helpful in answering the question (also read the
comments):

http://blog.juma.me.uk/2008/10/11/local-variables-scope-in-hotspot/

Best,
Ismael

Daniel Wellman
Joined: 2009-02-02,
User offline. Last seen 1 year 48 weeks ago.
Re: Code review this script

Thank you all for your feedback and advice!

First off, I'd like to confirm what you mean when you say "force". I think
it means "force evaluation of the full Stream", and it's the method from
Projection.

Why does foreach "force" the evaluation of the stream? It it a doing
Java-style for-loop (that needs to know the size of the list being looped
over?)

I think I understand why this one has a leak:

Seth Tisue wrote:
>
> val s = Stream.from(1).map(_ => new Array[Int](10000000))
> s.foreach(println)
>

I think it's because the val s holds a reference to the Stream object.

But why exactly does the Stream hold on to a reference to the head? I saw
that Stream defines a method "head", but I wasn't sure what was holding on
to that reference.

Seth Tisue wrote:
>
> def foo(s:Stream[Array[Int]]) = s.foreach(println)
> foo(Stream.from(1).map(_ => new Array[Int](10000000)))
>

Once I understand why foreach forces the stream, then I think I'll
understand why this example has a leak.

This discussion has been really helpful for me, thank you!

Cheers,
Dan

Daniel Wellman
Joined: 2009-02-02,
User offline. Last seen 1 year 48 weeks ago.
Re: Code review this script

Also, would it help if I rewrote the parse() function to take a second
argument, a function? e.g.

def parse(lines: Stream[String])(f: String => Unit): Stream[String] = {
if (lines.isEmpty) return Stream.empty;
else {
val allPairs = lines.zip(lines.tail)

for ((DataFormat(x), (DataFormat(y))) <- allPairs if x sameTimeAs y)
f((x + y).toString + "\n")
}
}

(That may not compile, I just jotted it out in this text editor -- I think I
may need to put it inside the for() block instead of being in the "yield"
position?)

Then I'd call it with something like:

LoggerFormatter.parse(Stream.fromIterator(inputFile.getLines)) { println }

or

val fileWriter = new FileWriter(outputFile)
LoggerFormatter.parse(Stream.fromIterator(inputFile.getLines)) {
fileWriter.write(_) }
fileWriter.close

I think I like this format better. Then I could just give the function an
Iterator, and clients wouldn't care that it used a Stream (purely from an
aesthetic perspective, not having to put all that code in the method call
argument).

Dan

Dan

Seth Tisue
Joined: 2008-12-16,
User offline. Last seen 34 weeks 3 days ago.
Re: Code review this script

>>>>> "Daniel" == Daniel Wellman writes:

Daniel> Thank you all for your feedback and advice!
Daniel> First off, I'd like to confirm what you mean when you say
Daniel> "force". I think it means "force evaluation of the full
Daniel> Stream"

Yes.

Daniel> and it's the method from Projection.

I wasn't referring specifically to that method. But yes, that
method does force evaluation of the whole stream; so does foreach.

Daniel> Why does foreach "force" the evaluation of the stream? It it a
Daniel> doing Java-style for-loop (that needs to know the size of the
Daniel> list being looped over?)

foreach is imperative, it says "do this operation RIGHT NOW on each
element". For the operation to be performed, the elements have to be
computed.

Daniel> But why exactly does the Stream hold on to a reference to the
Daniel> head?

Well, the Stream object *is* the head of the stream.

Maybe we're having a terminology confusion here between "head" referring
to the beginning of the stream, and "head" referring to the first data
item in the stream? Informally, you might see the word "head" used to
refer to either.

Seth Tisue
Joined: 2008-12-16,
User offline. Last seen 34 weeks 3 days ago.
Re: Code review this script

>>>>> "Daniel" == Daniel Wellman writes:

Daniel> Also, would it help if I rewrote the parse() function to take a
Daniel> second argument, a function?

Well, that works, but it's less flexible, because then client code
that uses parse() must always read the entire file. If you return an
Iterator or a Stream, client code can decide for itself how much of the
input to consume. In your particular program you might not care,
of course.

Daniel Wellman
Joined: 2009-02-02,
User offline. Last seen 1 year 48 weeks ago.
Re: Code review this script

Thanks for pointing out the confusion in meaning of the term 'head', I think
it's where I'm getting confused. Let's look at an example:

Seth Tisue wrote:
>
> It's true that Stream caches results, but if you don't keep a reference to
> the head of the
> stream, then the cached results become eligible for garbage collection.
> So in fact, Stream is very good to use when you want to avoid keeping the
> entire input in memory.
>

So what does head mean in this context? The object itself (a Stream), or
the first data element in the Stream?

> foreach is imperative, it says "do this operation RIGHT NOW on each
> element". For the operation to be performed, the elements have to be
> computed.
>

So that means that

val s = Stream.from(1).map(_ => new Array[Int](10000000))
s.foreach(println)

A reference to "s" exists. Foreach evaluates every item in the list, but
since there's a lingering reference to "s", all the items in the list are
held on to while evaluating the loop.

But for this one:

def foo(s:Stream[Array[Int]]) = s.foreach(println)
foo(Stream.from(1).map(_ => new Array[Int](10000000)))

I don't see how a reference to the stream is held in this case, probably
because I don't understand how this gets evaluated. I think what's
happening is that a call to foo() is recognized, the contents of the
arguments to foo() are evaluated, creating a stream, then that stream is
mapped to yield a new array. This result is then supplied to the foo()
method, which then iterates through each item and prints it.

Is the leaky reference the local variable 's' as defined in foo? A
reference to the Stream object still exists because it's being held during
the evaluation of the function?

Thanks for helping me clarify this,
Dan

Seth Tisue
Joined: 2008-12-16,
User offline. Last seen 34 weeks 3 days ago.
Re: Code review this script

>>>>> "Daniel" == Daniel Wellman writes:

Daniel> Thanks for pointing out the confusion in meaning of the term
Daniel> 'head', I think it's where I'm getting confused. Let's look at
Daniel> an example:

Daniel> Seth Tisue wrote:
>> It's true that Stream caches results, but if you don't keep a
>> reference to the head of the stream, then the cached results become
>> eligible for garbage collection. So in fact, Stream is very good to
>> use when you want to avoid keeping the entire input in memory.

Daniel> So what does head mean in this context? The object itself (a
Daniel> Stream), or the first data element in the Stream?

The object itself.

>> foreach is imperative, it says "do this operation RIGHT NOW on each
>> element". For the operation to be performed, the elements have to
>> be computed.

Daniel> So that means that
Daniel> val s = Stream.from(1).map(_ => new Array[Int](10000000))
Daniel> s.foreach(println)

Daniel> A reference to "s" exists. Foreach evaluates every item in the
Daniel> list, but since there's a lingering reference to "s", all the
Daniel> items in the list are held on to while evaluating the loop.

Yes. (Though it would be more precise not to say "there's a lingering
reference to s", but rather "s contains a lingering reference to the
Stream".)

Daniel> But for this one:

Daniel> def foo(s:Stream[Array[Int]]) = s.foreach(println)
Daniel> foo(Stream.from(1).map(_ => new Array[Int](10000000)))

Daniel> I don't see how a reference to the stream is held in this case
Daniel> probably because I don't understand how this gets evaluated. I
Daniel> think what's happening is that a call to foo() is recognized,
Daniel> the contents of the arguments to foo() are evaluated, creating
Daniel> a stream, then that stream is mapped to yield a new array.

To yield a new stream of arrays, you mean. But note that at most one
array has been created thus far, since map is lazy. (I say "at most
one" because streams in Scala aren't fully lazy. The head is strict;
the tail is lazy.)

Daniel> This result is then supplied to the foo() method, which then
Daniel> iterates through each item and prints it.

Yes. Note that each array will not be created until immediately before
being printed.

Daniel> Is the leaky reference the local variable 's' as defined in
Daniel> foo? A reference to the Stream object still exists because
Daniel> it's being held during the evaluation of the function?

Yes, exactly.

ounos
Joined: 2008-12-29,
User offline. Last seen 3 years 44 weeks ago.
Re: Code review this script
2009/3/29 Johannes Rudolph <johannes.rudolph@googlemail.com>

BTW: Could the Hotspot JIT-compiler eliminate references by inlining
and optimizing some code? Makes the JVM spec/JLS any promises if an
object is to be retained if it seems to be in a local variable (but is
actually optimized away), or is reachability calculation unrelated to
JIT?

Johannes


Hotspot will reuse the register at first opportunity, so if the object is only referenced from a local field, it's finalization may run earlier than expected - even during a method to that object! This is a potential problem only for objects which have clean-up actions associated with them (e.g. via finalize() or by polling a ReferenceQueue). There is also some ongoing work to support something like "keepAlive(reference)" to solve the problem. )

I have also some info on this issue towards the bottom of this (the css is currently broken, sorry)
http://code-o-matic.blogspot.com/2009/01/subtleties-of-phantomreference-and.html

Dimitris
Florian Hars
Joined: 2008-12-18,
User offline. Last seen 42 years 45 weeks ago.
Re: Code review this script

Daniel Wellman schrieb:
> Also, would it help if I rewrote the parse() function to take a second
> argument, a function? e.g.
>
> def parse(lines: Stream[String]) ...

No, this cannot help, as per Seth's re-closing of bug 692, you are keeping
a reference to the head of the stream once you write a function that takes
a stream as an argument and so will eventually hold the whole stream in
memory. So, essentially, Johannes is right and Seth is wrong. For anything
but one-liner toy code, Streams are totally useless if you want to avoid to
hold the whole input in memory.

- Florian,

Seth Tisue
Joined: 2008-12-16,
User offline. Last seen 34 weeks 3 days ago.
Re: Code review this script

>>>>> "Florian" == Florian Hars writes:

Florian> Daniel Wellman schrieb:
>> Also, would it help if I rewrote the parse() function to take a
>> second argument, a function? e.g.
>>
>> def parse(lines: Stream[String]) ...

Florian> No, this cannot help, as per Seth's re-closing of bug 692, you
Florian> are keeping a reference to the head of the stream once you
Florian> write a function that takes a stream as an argument and so
Florian> will eventually hold the whole stream in memory.

Yes, you're right -- because Daniel's proposed change would also cause
that same function to evaluate the whole stream (since the for loop
he wrote desugars to a call to Stream.foreach).

Florian> So, essentially, Johannes is right and Seth is wrong. For
Florian> anything but one-liner toy code, Streams are totally useless
Florian> if you want to avoid to hold the whole input in memory.

I wouldn't describe the situation in nearly such drastic terms, but
whatever. I guess we can all agree you have to be careful.

Copyright © 2012 École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland