This page is no longer maintained — Please continue to the home page at www.scala-lang.org

REPL: null-values in parallel ranges?

170 replies
Aleksandar Prokopec
Joined: 2010-04-04,
User offline. Last seen 42 years 45 weeks ago.
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti

I agree with this comment, Daniel. We now have a way to prevent (old)
code from using a parallel collection unless it explicitly states that.

Alex

> Now that old code can't be passed a parallel collection, I think it is
> a mistake to let programmers who are explicitly going for parallel
> code to get away with incorrect code. If a parallel sequence ever
> promises a sequential foreach, it will never be reverted.
>
> So, what is the _gain_ of making foreach sequential? That it will
> allow people who explicitly ask for a parallel collection to do
> non-parallel things with it without asking for a non-parallel
> collection first? I fail to see the use in that.
>
> It seems people are afraid that people trying to write parallel code
> will make mistakes, and that the solution to that is make their code
> non-parallel.
>
>

Johannes Rudolph 2
Joined: 2010-02-12,
User offline. Last seen 42 years 45 weeks ago.
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti

On Fri, Apr 8, 2011 at 5:18 AM, Bill Venners wrote:
> I think .par meaning parallel everything provides the clearest code.
> That for expression I showed earlier would do exactly what it looks
> like it does, run that for loop in parallel. There's value in that.
> And "coll.par.sfor" look like it means "a parallel collection with a
> sequential foreach" to me, which is exactly what it would mean.

Here, I'm not so sure if "the looks" are relevant here. Because, I
don't think that's the real use case. Let me say first, that I've
pretty much agreed with all of your well-reasoned contributions to
this discussion. Particularly, these points:
* even if all others operations are parallel, `foreach` seems like a
special case because it is likely that a concurrent foreach will
produce all sorts of problems
* however, there are still good use cases for both a sequential and
parallel version of `foreach`
* the question is now what `foreach` should do by default and how to
access the other version

That said, I don't see anyone using this particular construct:
`coll.par.sfor`. Why should anyone use it? Only if one wants to use
the same collection for both parallel operations and a sequential
foreach, but then again you could just use coll.seq before doing a
sequential foreach (if coll is not known to be sequential).
At the point where you would make the collection parallel you will
often not know how it is used later on. You know if you need a
sequential `foreach` when you are writing the loop body, not
necessarily when you are creating the collection. That seems like the
basic flaw in your suggestion: `sfor` looks like it would depend on
the collection if a parallel `foreach` is safe but in most (all?)
cases it depends on the loop body.

So, even without `sfor`, the rules how to handle `foreach` are pretty
simple: If your method accepts an AnySeq (or AnyWhatever) to be as
generic as possible and uses `foreach` to loop over the elements (or a
for comprehension), you have to make sure the loop body can be
executed concurrently. If it isn't safe to execute concurrently (or
you don't know), you have to use `coll.seq` to make it sequential
before. This way, the one actually writing the loop over an
AnyCollection is the one to make sure everything works concurrently.

In short:
AnyWhatever.foreach is like the proposed 'pforeach', i.e. possibly
executed in parallel
AnyWhatever.seq.foreach is like the old foreach: always executed sequential
AnyWhatever.par.foreach is always executed parallel

Do we really need more?

Bill Venners
Joined: 2008-12-18,
User offline. Last seen 31 weeks 5 days ago.
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti

Hi Johannes,

On Fri, Apr 8, 2011 at 1:35 AM, Johannes Rudolph
wrote:
> On Fri, Apr 8, 2011 at 5:18 AM, Bill Venners wrote:
>> I think .par meaning parallel everything provides the clearest code.
>> That for expression I showed earlier would do exactly what it looks
>> like it does, run that for loop in parallel. There's value in that.
>> And "coll.par.sfor" look like it means "a parallel collection with a
>> sequential foreach" to me, which is exactly what it would mean.
>
> Here, I'm not so sure if "the looks" are relevant here. Because, I
> don't think that's the real use case. Let me say first, that I've
> pretty much agreed with all of your well-reasoned contributions to
> this discussion. Particularly, these points:
>  * even if all others operations are parallel, `foreach` seems like a
> special case because it is likely that a concurrent foreach will
> produce all sorts of problems
>  * however, there are still good use cases for both a sequential and
> parallel version of `foreach`
>  * the question is now what `foreach` should do by default and how to
> access the other version
>
Agreed.

> That said, I don't see anyone using this particular construct:
> `coll.par.sfor`. Why should anyone use it? Only if one wants to use
> the same collection for both parallel operations and a sequential
> foreach, but then again you could just use coll.seq before doing a
> sequential foreach (if coll is not known to be sequential).
> At the point where you would make the collection parallel you will
> often not know how it is used later on. You know if you need a
> sequential `foreach` when you are writing the loop body, not
> necessarily when you are creating the collection. That seems like the
> basic flaw in your suggestion: `sfor` looks like it would depend on
> the collection if a parallel `foreach` is safe but in most (all?)
> cases it depends on the loop body.
>
Well in that design .par.sfor would give you a collection on which all
functional higher-order methods execute in parallel but foreach
executes sequentially. My guess is that the functional methods would
be sticky. If I call map on such a collection I'd get the same kind
back, all parallel but with a sequential foreach. So then many
functional data transformations later when you used a foreach it would
be sequential.

I'm guessing one of the styles that people will want to use with
parallel collections is parallel functional style methods with
sequential foreach, because it is easier and less error prone.

> So, even without `sfor`, the rules how to handle `foreach` are pretty
> simple: If your method accepts an AnySeq (or AnyWhatever) to be as
> generic as possible and uses `foreach` to loop over the elements (or a
> for comprehension), you have to make sure the loop body can be
> executed concurrently. If it isn't safe to execute concurrently (or
> you don't know), you have to use `coll.seq` to make it sequential
> before. This way, the one actually writing the loop over an
> AnyCollection is the one to make sure everything works concurrently.
>
> In short:
> AnyWhatever.foreach is like the proposed 'pforeach', i.e. possibly
> executed in parallel
> AnyWhatever.seq.foreach is like the old foreach: always executed sequential
> AnyWhatever.par.foreach is always executed parallel
>
> Do we really need more?
>
You can say .seq anytime you want a sequential for loop. But if
usually that's what you want it would be more convenient to get it by
default. Plus it is better to have the safer thing the default,
because that would make it harder for users to make mistakes.

But certainly if .par gives parallel everything in 2.9, including
foreach, you wouldn't need something like .sfor in 2.9. If it turned
out to be useful after people get real experience, it could be added
later. But once .par gives parallel everything it can't be changed,
and right there is the big decision for 2.9 I think.

I'm worried that if .par gives parallel everything it makes parallel
collections harder and more risky to use, and I think that trumps the
inconsistency. Scala collections are already inconsistent because
Scala itself leans towards the functional style, but allows the
imperative style. Every method in collections except foreach is a
functional style method, but foreach is imperative style. Foreach is
the odd man out and is non-uniform. Because of that difference, using
parallel foreach is an order of magnitude more difficult and
error-prone than using a parallel any other method on collections.

Parallel collections are power tools. They will make you more
productive at making things perform well, but can also give you
concurrency bugs when you make a mistake. Again this is because Scala
supports both functional and imperative styles. When you hand someone
a chain saw, you'd lock the chain so it is less likely that person
will accidentally pull the trigger and cut themselves when they grab
it. Let them unlock it. If you pass someone a gun, you'd make sure the
safety was on first. If you pass them a pocket knife, you'd close it
first. Let them open it when they are ready. I think when you pass
someone a parallel collection when they invoke .par, foreach should be
sequential. That means that .par would mean functional parallel -- all
the functional style methods run in parallel, foreach does not, unless
the user flips off the safety switch.

A pfor method could be something you call any collection (it could be
declared on AnyTraversable alongside seq and par) that gives you that
parallel foreach. So a use case would look like:

for (i <- (1 until 10000).pfor)
a(i) = a(i) * scale

That's pretty clear code. Downside is this code would be misleading:

for (i <- (1 until 10000).par)
a(i) = a(i) * scale

People might write that thinking the for loop would run in parallel.
People reading the code might think that. This would make people less
productive slightly, but I think not as less productive is they would
be if they accidentally get a parallel for loop when it looks like
sequential one.

So on AnyTraversable:

.seq would give you a fully sequential collection
.par would give you a functional parallel collection (foreach still sequential)
.pfor would give you a fully parallel collection (including foreach)

Alternatively, pfor could just give you the same thing with just a
parallel foreach. Then to get a fully parallel one you'd say .par.pfor
or .pfor.par, etc. But I don't see a use case for something with a
parallel foreach and sequential everything else, so I'd just have pfor
go fully parallel. pfor doesn't look like that so maybe there's a
better name.

Bill
----
Bill Venners
Artima, Inc.
http://www.artima.com

Seth Tisue
Joined: 2008-12-16,
User offline. Last seen 34 weeks 3 days ago.
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti

>>>>> "Bill" == Bill Venners writes:

Bill> But certainly if .par gives parallel everything in 2.9, including
Bill> foreach, you wouldn't need something like .sfor in 2.9. If it
Bill> turned out to be useful after people get real experience, it
Bill> could be added later. But once .par gives parallel everything it
Bill> can't be changed, and right there is the big decision for 2.9 I
Bill> think.

".par gives parallel everything" is consistent and simple, and did I
mention consistent? I'm not seeing the case for doing it any other way.

As you point out:

Bill> Downside is this code would be misleading:
Bill> for (i <- (1 until 10000).par) a(i) = a(i) * scale

For this not to run in parallel would verge on bizarre, IMO.

Parallel is parallel.

Bill Venners
Joined: 2008-12-18,
User offline. Last seen 31 weeks 5 days ago.
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti

Hi Seth,

On Fri, Apr 8, 2011 at 11:59 AM, Seth Tisue wrote:
>>>>>> "Bill" == Bill Venners writes:
>
>  Bill> But certainly if .par gives parallel everything in 2.9, including
>  Bill> foreach, you wouldn't need something like .sfor in 2.9. If it
>  Bill> turned out to be useful after people get real experience, it
>  Bill> could be added later. But once .par gives parallel everything it
>  Bill> can't be changed, and right there is the big decision for 2.9 I
>  Bill> think.
>
> ".par gives parallel everything" is consistent and simple, and did I
> mention consistent?  I'm not seeing the case for doing it any other way.
>
Simple to understand, yes, but to use?

> As you point out:
>
>  Bill> Downside is this code would be misleading:
>  Bill> for (i <- (1 until 10000).par) a(i) = a(i) * scale
>
> For this not to run in parallel would verge on bizarre, IMO.
>
> Parallel is parallel.
>
Yes, it sure looks like that should run in parallel. If .par means
fully parallel, then another method could mean "functional methods in
parallel and foreach sequential". fpar perhaps. 2.9 could provide just
.par and if experience shows people could use a second method it could
be added later.

Bill

> --
> Seth Tisue | Northwestern University | http://tisue.net
> lead developer, NetLogo: http://ccl.northwestern.edu/netlogo/
>

Ismael Juma 2
Joined: 2011-01-22,
User offline. Last seen 42 years 45 weeks ago.
Re: Re: REPL: null-values in parallel ranges?

On Mon, Apr 4, 2011 at 3:45 PM, Aleksandar Prokopec
wrote:
>> I have a question. If I do seq.par.map(identity), will the resulting
>> Seq be == to the source Seq (i.e. the order of the Seq will be
>> maintained even if multiple operations are executed in parallel)?
>
> Yes, unless there is a bug somewhere I've missed, it most definitely should.

You guessed correctly that the reason for my question was that I had
seen different behaviour. My suspicion started when I saw Paul's fix
for the REPL that mentioned foreach, but looking at the code I was
only saw map, take and mkString being used. While trying to create a
test case, I ended up getting NPEs and AIOOBEs while simply using map
and mkString on a Seq created from calling par on a Range. So the
wrong order I was seeing was actually just a symptom, but the problem
seems to be deeper (unless I am missing something). See the details in
the following ticket:

https://lampsvn.epfl.ch/trac/scala/ticket/4459

Best,
Ismael

Johannes Rudolph 2
Joined: 2010-02-12,
User offline. Last seen 42 years 45 weeks ago.
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti

On Fri, Apr 8, 2011 at 7:00 PM, Bill Venners wrote:
> I'm guessing one of the styles that people will want to use with
> parallel collections is parallel functional style methods with
> sequential foreach, because it is easier and less error prone.

I understand where you are getting at, actually that's what I thought
before myself. But I changed my mind and I think we now disagree
whether one has to worry about programmers who deliberately used
'.par' to make their collection parallel and then fail to understand
the consequences of using `foreach` on this collection. Especially
when the choice is to give up consistency, instead, I'd wager for
consistency. Shooting yourself in the foot with concurrency is easy
enough anyways.

The rules I posted before were for the other cases (which I'm more
worried about to get things right from the beginning): the cases where
the person calling `.par` is not the one writing the code using the
collection in the end. Here we should have some advice how to handle
those cases correctly without running into the parallel foreach trap.

BTW: Given we had `.pfor`: In a library method you want to loop over a
collection passed into the method. Now you write the loop body and
know it is thread-safe. Still you don't want to make the traversal
parallel in every case (as parallel calculations come with a price)
but you want the caller of the method to decide which that is. How
would you do that elegantly with `.pfor`?

dcsobral
Joined: 2009-04-23,
User offline. Last seen 38 weeks 5 days ago.
Re: Re: REPL: null-values in parallel ranges?
On Sun, Apr 10, 2011 at 08:47, Ismael Juma <ismael@juma.me.uk> wrote:

You guessed correctly that the reason for my question was that I had
seen different behaviour. My suspicion started when I saw Paul's fix
for the REPL that mentioned foreach, but looking at the code I was
only saw map, take and mkString being used. While trying to create a
test case, I ended up getting NPEs and AIOOBEs while simply using map
and mkString on a Seq created from calling par on a Range. So the
wrong order I was seeing was actually just a symptom, but the problem
seems to be deeper (unless I am missing something). See the details in
the following ticket:

https://lampsvn.epfl.ch/trac/scala/ticket/4459


That code won't compile once the hierarchy changes are committed, btw.

-- Daniel C. Sobral

I travel to the future all the time.
Ismael Juma 2
Joined: 2011-01-22,
User offline. Last seen 42 years 45 weeks ago.
Re: Re: REPL: null-values in parallel ranges?


On 10 Apr 2011 17:26, "Daniel Sobral" <dcsobral@gmail.com> wrote:
> That code won't compile once the hierarchy changes are committed, btw.

I don't think that matters. After the refactoring,  change Seq to AnySeq.

Best,
Ismael

Bill Venners
Joined: 2008-12-18,
User offline. Last seen 31 weeks 5 days ago.
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti

Hi Johannes,

On Sun, Apr 10, 2011 at 9:24 AM, Johannes Rudolph
wrote:
> On Fri, Apr 8, 2011 at 7:00 PM, Bill Venners wrote:
>> I'm guessing one of the styles that people will want to use with
>> parallel collections is parallel functional style methods with
>> sequential foreach, because it is easier and less error prone.
>
> I understand where you are getting at, actually that's what I thought
> before myself. But I changed my mind and I think we now disagree
> whether one has to worry about programmers who deliberately used
> '.par' to make their collection parallel and then fail to understand
> the consequences of using `foreach` on this collection. Especially
> when the choice is to give up consistency, instead, I'd wager for
> consistency. Shooting yourself in the foot with concurrency is easy
> enough anyways.
>
There may still be a design that gives you both consistency,
convenience, and safety that hasn't been suggested yet. That would be
ideal.

> The rules I posted before were for the other cases (which I'm more
> worried about to get things right from the beginning): the cases where
> the person calling `.par` is not the one writing the code using the
> collection in the end. Here we should have some advice how to handle
> those cases correctly without running into the parallel foreach trap.
>
> BTW: Given we had `.pfor`: In a library method you want to loop over a
> collection passed into the method. Now you write the loop body and
> know it is thread-safe. Still you don't want to make the traversal
> parallel in every case (as parallel calculations come with a price)
> but you want the caller of the method to decide which that is. How
> would you do that elegantly with `.pfor`?
>
AnySeq would mean a collection that could be either parallel or
sequential, and whose foreach could be parallel or sequential. So that
does let the caller decide. I'm not sure how often people will want to
use AnySeq, AnyMap, etc., on APIs versus letting the parallelism be an
implementation decision taken to improve performance. I'm certainly
not planning on changing the collection types up to AnyX in APIs any
time soon, though I will try and see what kind of performance benefits
I can get from them internally.

The trouble with pfor, though, is not just that it makes .par
inconsistent but that it also complicates the space of possibilities.
If you do take an AnySeq, you may want to make sure it has a
sequential for. So you'd want to be able to call .sfor on it. So
you'd need both sfor and pfor. I'm guessing we might want to use the
Parallel marker trait to recognize a parallel collection in a pattern
match. Well you might reasonably want to be able to recognize a
parallel collection with a sequential foreach also. So you'd need
ParallelForeach and SequentialForeach traits? There are too many
combinations.

Perhaps the best best design seen so far is simply the original one
EPFL had at some point prior to 2.9.0.RC1. Foreach is always
sequential, and parallel collections also have a pforeach that gives
you a parallel foreach. Now the space is just two kinds of collection,
sequential and parallel. Though the contract of foreach still would
allow for parallel collections, as it always has, foreach would always
be sequential on any collection coming from the Scala API.

To those worried about inconsistency of .par, I'd suggest you look at
your existing code and count how many uses of these methods are
already ready to be run in parallel. In my code I'd guess more than 99
out of 100 times my uses of functional-style methods like map,
flatMap, and filter, etc., are already ready to be run in parallel.
But on foreach, more than 99 times out of 100 times it would NOT be
ready to be run in parallel. I think that's not just because I haven't
been thinking about it. It's because that's how I want to use foreach.
99 times out of 100 I actually want a sequential foreach no matter
whether the rest of the collection is sequential or parallel. So to
force people to say .seq .seq .seq .seq .seq .seq .seq in their for
loops 99 times just so they don't have to say it the one time they
want a parallel for lop is inconvenient as well as error prone. Or put
another way, a .par that gives you a parallel foreach is inconsistent
with how people will most often want to *use* parallel collections.

I've seen three examples shown so far that justify the need for a
parallel foreach. There are others, but these are cherry picked. How
often do you think you'd need to use a parallel foreach to scale an
array?

for (i <- (0 until a.length).par)
a(i) = a(i) * scale

Probably in less than 1 out of a 100 for loops. And frankly, except on
large arrays for which you actually have evidence it matters to
performance, it would probably be premature optimization to do it that
way instead of:

a map (_ * scale)

How often would you feel the need to do an exists like this:

@volatile var flag = false
for (i <- (0 until a.length).par)
if (a(i) < 0) flag = true

Instead of:

a exists (_ < 0)

The earlier one is a performance optimization, which if you don't have
actual evidence of a performance problem I'd call a premature
optimization. It is more complicated to read than the latter, and more
error prone. Or how often would you feel the need to do this:

@volatile var sum = 0
for (i <- (0 until a.length).par)
sum += a(i)

Instead of:

a.sum

Again it is a performance optimization that is more verbose and error
prone than the functional alternative. And speaking of error prone,
are you sure you'd want to run that for loop in parallel?

This seems useful:

val cmap = new java.util.concurrent.ConcurrentHashMap[Int, Int]
for (i <- (0 until n).par)
cmap.put(i, i)

But looking at your existing code again, how often have you been
populating a concurrent collection like this? From time to time, but
likely less than 1 in a 100 for loops. Moreover, if you don't actually
have evidence there's a performance problem, I'd say a sequential
population would be just as good:

val cmap = new java.util.concurrent.ConcurrentHashMap[Int, Int]
for (i <- 0 until n)
cmap.put(i, i)

That said, one thing that bugged me about pforeach is I couldn't use a
parallel foreach via a for expression. But truth is, making pforeach a
tad uglier might be good from human factors perspective. It would
discourage its use. But on the other hand, if it is just being used
now and then when there is evidence of a performance problem, this
code isn't too bad looking:

(0 until a.length) pforeach { i =>
a(i) = a(i) * scale
}

Moreover, the pforeach design doesn't mean you absolutely can't get a
parallel foreach out of a for expression, just that you couldn't get
one out of the collections coming from the Scala API. The 2.8
collections design makes it quite straightforward to create your own
collection implementations, and the contract for foreach allows a
parellel implementation. If you're doing some image processing where
you need to do a bunch of scaling on large arrays, with just a few
lines of code you could create your own AnySeq on which foreach
forwards to pforeach. Then you could do this:

val a = MyParallelForeachAnySeq(originalA)
for (i <- 0 until a.length)
a(i) = a(i) * scale

So it would be possible. But requiring people to say pforeach or write
their own wrapper class to get a parallel for loop would be enough
resistence to push people towards using sequential foreach most of the
time, and I think that's important, because people make mistakes.
Making a parallel foreach the default for .par is kind of like putting
the pilot eject button right next to the intercom button on an
airplane cockpit. You probably want to put the eject button behind a
little door at the very least.

Everything has software in it nowadays, and over time more and more
apps will be running on multi-core CPUs, so I expect languages like
Scala, Clojure, Fortress, etc., are going to be used more and more.
The more Scala gets used, the greater chance that programming errors
will do serious harm. Scala code could show up in an airplane someday,
or an nuclear power plant, or a spacecraft, or a chain saw. Scala code
will be written by all kinds of programmers, including ones who don't
quite understand threads well. Even the programmers that do understand
threads well are human and will from time to time make mistakes. If
the convenience argument doesn't sway you, then hopefully the safety
one will.

Bill

>
> --
> Johannes
>
> -----------------------------------------------
> Johannes Rudolph
> http://virtual-void.net
>

Bill Venners
Joined: 2008-12-18,
User offline. Last seen 31 weeks 5 days ago.
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti

Hi Johannes,

On Sun, Apr 10, 2011 at 9:24 AM, Johannes Rudolph
wrote:
> On Fri, Apr 8, 2011 at 7:00 PM, Bill Venners wrote:
>> I'm guessing one of the styles that people will want to use with
>> parallel collections is parallel functional style methods with
>> sequential foreach, because it is easier and less error prone.
>
> I understand where you are getting at, actually that's what I thought
> before myself. But I changed my mind and I think we now disagree
> whether one has to worry about programmers who deliberately used
> '.par' to make their collection parallel and then fail to understand
> the consequences of using `foreach` on this collection. Especially
> when the choice is to give up consistency, instead, I'd wager for
> consistency. Shooting yourself in the foot with concurrency is easy
> enough anyways.
>
There may still be a design that gives you both consistency,
convenience, and safety that hasn't been suggested yet. That would be
ideal.

> The rules I posted before were for the other cases (which I'm more
> worried about to get things right from the beginning): the cases where
> the person calling `.par` is not the one writing the code using the
> collection in the end. Here we should have some advice how to handle
> those cases correctly without running into the parallel foreach trap.
>
> BTW: Given we had `.pfor`: In a library method you want to loop over a
> collection passed into the method. Now you write the loop body and
> know it is thread-safe. Still you don't want to make the traversal
> parallel in every case (as parallel calculations come with a price)
> but you want the caller of the method to decide which that is. How
> would you do that elegantly with `.pfor`?
>
AnySeq would mean a collection that could be either parallel or
sequential, and whose foreach could be parallel or sequential. So that
does let the caller decide. I'm not sure how often people will want to
use AnySeq, AnyMap, etc., on APIs versus letting the parallelism be an
implementation decision taken to improve performance. I'm certainly
not planning on changing the collection types up to AnyX in APIs any
time soon, though I will try and see what kind of performance benefits
I can get from them internally.

The trouble with pfor, though, is not just that it makes .par
inconsistent but that it also complicates the space of possibilities.
If you do take an AnySeq, you may want to make sure it has a
sequential for. So you'd want to be able to call .sfor on it. So
you'd need both sfor and pfor. I'm guessing we might want to use the
Parallel marker trait to recognize a parallel collection in a pattern
match. Well you might reasonably want to be able to recognize a
parallel collection with a sequential foreach also. So you'd need
ParallelForeach and SequentialForeach traits? There are too many
combinations.

Perhaps the best best design seen so far is simply the original one
EPFL had at some point prior to 2.9.0.RC1. Foreach is always
sequential, and parallel collections also have a pforeach that gives
you a parallel foreach. Now the space is just two kinds of collection,
sequential and parallel. Though the contract of foreach still would
allow for parallel collections, as it always has, foreach would always
be sequential on any collection coming from the Scala API.

To those worried about inconsistency of .par, I'd suggest you look at
your existing code and count how many uses of these methods are
already ready to be run in parallel. In my code I'd guess more than 99
out of 100 times my uses of functional-style methods like map,
flatMap, and filter, etc., are already ready to be run in parallel.
But on foreach, more than 99 times out of 100 times it would NOT be
ready to be run in parallel. I think that's not just because I haven't
been thinking about it. It's because that's how I want to use foreach.
99 times out of 100 I actually want a sequential foreach no matter
whether the rest of the collection is sequential or parallel. So to
force people to say .seq .seq .seq .seq .seq .seq .seq in their for
loops 99 times just so they don't have to say it the one time they
want a parallel for lop is inconvenient as well as error prone. Or put
another way, a .par that gives you a parallel foreach is inconsistent
with how people will most often want to *use* parallel collections.

I've seen three examples shown so far that justify the need for a
parallel foreach. There are others, but these are cherry picked. How
often do you think you'd need to use a parallel foreach to scale an
array?

for (i <- (0 until a.length).par)
a(i) = a(i) * scale

Probably in less than 1 out of a 100 for loops. And frankly, except on
large arrays for which you actually have evidence it matters to
performance, it would probably be premature optimization to do it that
way instead of:

a map (_ * scale)

How often would you feel the need to do an exists like this:

@volatile var flag = false
for (i <- (0 until a.length).par)
if (a(i) < 0) flag = true

Instead of:

a exists (_ < 0)

The earlier one is a performance optimization, which if you don't have
actual evidence of a performance problem I'd call a premature
optimization. It is more complicated to read than the latter, and more
error prone. Or how often would you feel the need to do this:

@volatile var sum = 0
for (i <- (0 until a.length).par)
sum += a(i)

Instead of:

a.sum

Again it is a performance optimization that is more verbose and error
prone than the functional alternative. And speaking of error prone,
are you sure you'd want to run that for loop in parallel?

This seems useful:

val cmap = new java.util.concurrent.ConcurrentHashMap[Int, Int]
for (i <- (0 until n).par)
cmap.put(i, i)

But looking at your existing code again, how often have you been
populating a concurrent collection like this? From time to time, but
likely less than 1 in a 100 for loops. Moreover, if you don't actually
have evidence there's a performance problem, I'd say a sequential
population would be just as good:

val cmap = new java.util.concurrent.ConcurrentHashMap[Int, Int]
for (i <- 0 until n)
cmap.put(i, i)

That said, one thing that bugged me about pforeach is I couldn't use a
parallel foreach via a for expression. But truth is, making pforeach a
tad uglier might be good from human factors perspective. It would
discourage its use. But on the other hand, if it is just being used
now and then when there is evidence of a performance problem, this
code isn't too bad looking:

(0 until a.length) pforeach { i =>
a(i) = a(i) * scale
}

Moreover, the pforeach design doesn't mean you absolutely can't get a
parallel foreach out of a for expression, just that you couldn't get
one out of the collections coming from the Scala API. The 2.8
collections design makes it quite straightforward to create your own
collection implementations, and the contract for foreach allows a
parellel implementation. If you're doing some image processing where
you need to do a bunch of scaling on large arrays, with just a few
lines of code you could create your own AnySeq on which foreach
forwards to pforeach. Then you could do this:

val a = MyParallelForeachAnySeq(originalA)
for (i <- 0 until a.length)
a(i) = a(i) * scale

So it would be possible. But requiring people to say pforeach or write
their own wrapper class to get a parallel for loop would be enough
resistence to push people towards using sequential foreach most of the
time, and I think that's important, because people make mistakes.
Making a parallel foreach the default for .par is kind of like putting
the pilot eject button right next to the intercom button on an
airplane cockpit. You probably want to put the eject button behind a
little door at the very least.

Everything has software in it nowadays, and over time more and more
apps will be running on multi-core CPUs, so I expect languages like
Scala, Clojure, Fortress, etc., are going to be used more and more.
The more Scala gets used, the greater chance that programming errors
will do serious harm. Scala code could show up in an airplane someday,
or an nuclear power plant, or a spacecraft, or a chain saw. Scala code
will be written by all kinds of programmers, including ones who don't
quite understand threads well. Even the programmers that do understand
threads well are human and will from time to time make mistakes. If
the convenience argument doesn't sway you, then hopefully the safety
one will.

Bill

>
> --
> Johannes
>
> -----------------------------------------------
> Johannes Rudolph
> http://virtual-void.net
>

ichoran
Joined: 2009-08-14,
User offline. Last seen 2 years 3 weeks ago.
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
On Sun, Apr 10, 2011 at 4:34 PM, Bill Venners <bill@artima.com> wrote:

I've seen three examples shown so far that justify the need for a
parallel foreach. There are others, but these are cherry picked. How
often do you think you'd need to use a parallel foreach to scale an
array?

for (i <- (0 until a.length).par)
 a(i) = a(i) * scale

Probably in less than 1 out of a 100 for loops. And frankly, except on
large arrays for which you actually have evidence it matters to
performance, it would probably be premature optimization to do it that
way instead of:

a map (_ * scale)

Right, so what you're saying is that we don't need parallel collections for this at all unless we do performance testing.  Sequential is fine.
 
How often would you feel the need to do an exists like this:

@volatile var flag = false
for (i <- (0 until a.length).par)
 if (a(i) < 0) flag = true

Instead of:

a exists (_ < 0)

The earlier one is a performance optimization, which if you don't have
actual evidence of a performance problem I'd call a premature
optimization.

I don't think the earlier one is a performance optimization, not without a break anyway.

But again, your point is that parallel collections aren't necessary.
 
It is more complicated to read than the latter, and more
error prone. Or how often would you feel the need to do this:

@volatile var sum = 0
for (i <- (0 until a.length).par)
 sum += a(i)

Instead of:

a.sum

Again it is a performance optimization that is more verbose and error
prone than the functional alternative. And speaking of error prone,
are you sure you'd want to run that for loop in parallel?

I'm sure I wouldn't.  += isn't atomic, and even if it were, the accumulating is likely to be the bottleneck so you wouldn't speed it up anyway.  a.par.sum could properly divide-and-conquer the problem.  But why use parallel collections if you're not performance-bound?
 
On the other hand, if it is just being used
now and then when there is evidence of a performance problem, this
code isn't too bad looking:

(0 until a.length) pforeach { i =>
 a(i) = a(i) * scale
}

Yet again, why use parallel collections _at all_ unless there's evidence of a performance problem?

Here's something I surely would rather not do slowly sequentially:

for (datum <- data.par) {
  writeToFile( horriblySlowConversionToXML( datum ) )
}

Or maybe this:

for (imagename <- names.par) {
  writeNewImage( processImage( readOldImage( imagename ) ) )
}

Right now, you can do these things with futures, but it's awkward.  A simple for loop would be much easier.  Thus, I disagree that sequential for loops are the overwhelmingly common use case.  When you have sequential collections, of course your for loops look sequential.  With parallel collections, different use cases arise.

Again, and I don't think I can stress this point enough, there is no reason to use a parallel collection in any form unless one has some sort of performance issue.  How often are you going to have a performance problem when you use "map" but not when you use "foreach"?  If you don't have a performance problem, why are you using a parallel collection with its concomitant issues with debugging, memory usage, profiling, reasoning about order of operations, and so on?

The principle, I think, should be: when you need parallel collections, know that you are using them, and then be careful to do it sensibly.  Maybe a compiler plugin that that warned on using mutable structures in closures would help for those people who are still learning what one has to know when writing parallel code.

  --Rex

Aleksandar Prokopec
Joined: 2010-04-04,
User offline. Last seen 42 years 45 weeks ago.
Re: Re: REPL: null-values in parallel ranges?

You're right, it shouldn't matter. I'll take a look at this.

Cheers,
Alex

6D-0KQdt-jygPFwFA [at] mail [dot] gmail [dot] com" type="cite">

On 10 Apr 2011 17:26, "Daniel Sobral" <dcsobral [at] gmail [dot] com" rel="nofollow">dcsobral@gmail.com> wrote:
> That code won't compile once the hierarchy changes are committed, btw.

I don't think that matters. After the refactoring,  change Seq to AnySeq.

Best,
Ismael


Bill Venners
Joined: 2008-12-18,
User offline. Last seen 31 weeks 5 days ago.
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti

Hi Rex,

> Right now, you can do these things with futures, but it's awkward.  A simple
> for loop would be much easier.  Thus, I disagree that sequential for loops
> are the overwhelmingly common use case.  When you have sequential
> collections, of course your for loops look sequential.  With parallel
> collections, different use cases arise.
>
> Again, and I don't think I can stress this point enough, there is no reason
> to use a parallel collection in any form unless one has some sort of
> performance issue.  How often are you going to have a performance problem
> when you use "map" but not when you use "foreach"?  If you don't have a
> performance problem, why are you using a parallel collection with its
> concomitant issues with debugging, memory usage, profiling, reasoning about
> order of operations, and so on?
>
I'm wondering this may be a main source of the different opinions
here. If very pointed solutions to performance problems is the main
way parallel collections will be used, then I agree that .par giving a
parallel foreach is fine. I'm imagining people will want to use
parallel collections more generally because the future is more and
more cores, and Scala will make it pretty easy to use them. Not
without risk, but pretty low risk. I think if you're careful, you can
avoid side effects when using parallel collections, other than with
foreach.

For example, imagine you're writing a library routine that takes a
Seq[Int] and does a half a page of of code processing on it, then
returns a Seq[String] say. This will be used by different apps. You're
not sure if it is a bottleneck, and you're not sure how big the Seqs
you get will usually be, but sometimes it could be large. You might
think you'd be shirking your duties to not run at least large datasets
in parallel. Maybe you'd check the size at the top of your routine. If
the size is greater than some threshold, you call .par on what was
passed, otherwise not, and the rest of your method uses that.

val anySeq = if (seq.size > 100) seq.par else seq

At the end you call .seq on what you return.

The parallelness of the collection passed in would be sticky. Each
time you transform a sequential collection you'd get another
sequential one as a result. But each time you transform a parallel
collection you'd get a parallel one as a result. So in the parallel
case, the entire body of your routine would be doing things in
parallel, and in there you may want to sprinkle a few for loops. Why?
Because they make for clearer code (or easier to write code) than
recursion, less error prone than a while loop, etc. If you need to
write .seq on them to make sure they execute sequentially, it's a pain
and error prone.

So that's where I'm coming from. I'm expecting that in the age of
multicore the parallel collections (excluding foreach) will be used
more generally than just as pointed performance optimization. But that
a parallel foreach should be used only as a performance optimization
because it is so much more difficult to get right. Some may consider
it premature optimization to use parallel collections more generally,
but the more cores that show up I expect some programmers will use
them more generally like this. This may be how we need to program in
the multi-core world.

And by the way, one other thing I realized this morning that would be
nice about having a different method name like pforeach or pareach is
that you can more easily search for them when you want to inspect the
code for potential errors.

Bill
----
Bill Venners
Artima, Inc.
http://www.artima.com

dcsobral
Joined: 2009-04-23,
User offline. Last seen 38 weeks 5 days ago.
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
On Mon, Apr 11, 2011 at 14:49, Bill Venners <bill@artima.com> wrote:
Hi Rex,

> Right now, you can do these things with futures, but it's awkward.  A simple
> for loop would be much easier.  Thus, I disagree that sequential for loops
> are the overwhelmingly common use case.  When you have sequential
> collections, of course your for loops look sequential.  With parallel
> collections, different use cases arise.
>
> Again, and I don't think I can stress this point enough, there is no reason
> to use a parallel collection in any form unless one has some sort of
> performance issue.  How often are you going to have a performance problem
> when you use "map" but not when you use "foreach"?  If you don't have a
> performance problem, why are you using a parallel collection with its
> concomitant issues with debugging, memory usage, profiling, reasoning about
> order of operations, and so on?
>
I'm wondering this may be a main source of the different opinions
here. If very pointed solutions to performance problems is the main
way parallel collections will be used, then I agree that .par giving a
parallel foreach is fine. I'm imagining people will want to use
parallel collections more generally because the future is more and
more cores, and Scala will make it pretty easy to use them. Not
without risk, but pretty low risk. I think if you're careful, you can
avoid side effects when using parallel collections, other than with
foreach.

For that matter, I'm more inclined to work as you say than to resort to parallel only as optimization. However, I do not agree with how you follow it up.
 
The parallelness of the collection passed in would be sticky. Each
time you transform a sequential collection you'd get another
sequential one as a result. But each time you transform a parallel
collection you'd get a parallel one as a result. So in the parallel
case, the entire body of your routine would be doing things in
parallel, and in there you may want to sprinkle a few for loops. Why?
Because they make for clearer code (or easier to write code) than
recursion, less error prone than a while loop, etc. If you need to
write .seq on them to make sure they execute sequentially, it's a pain
and error prone.

And this is where I disagree with. If the goal is to get used to write multicore code, then having the code depend on sequential behavior is a problem. That dependency is what is error prone, not the parallelism. To paraphrase Miyagi, if you do guess-so parallelism, you are screwed. Either do it, or don't.

Or, in other words, if you need code to be sequential, mark it so. If you mark code to be parallel, make sure it truly is.

--
Daniel C. Sobral

I travel to the future all the time.
Matthew Pocock 3
Joined: 2010-07-30,
User offline. Last seen 42 years 45 weeks ago.
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
I'm with Daniel on this one. If you've chosen to use a signature that admits parallel collections, rather than exercising your right to restrict it to the sequential implementations then you should expect parallelism and code accordingly. If this is too tricky, then explicitly convert your parallel collection to a sequential one for the operations that you want to execute sequentially, or re-write your signature to that only the sequential collections are admitted.
Matthew

On 11 April 2011 22:01, Daniel Sobral <dcsobral@gmail.com> wrote:

And this is where I disagree with. If the goal is to get used to write multicore code, then having the code depend on sequential behavior is a problem. That dependency is what is error prone, not the parallelism. To paraphrase Miyagi, if you do guess-so parallelism, you are screwed. Either do it, or don't.

Or, in other words, if you need code to be sequential, mark it so. If you mark code to be parallel, make sure it truly is.

--
Daniel C. Sobral

I travel to the future all the time.



--
Matthew Pocockmailto: turingatemyhamster@gmail.comgchat: turingatemyhamster@gmail.com msn: matthew_pocock@yahoo.co.ukirc.freenode.net: drdozer(0191) 2566550
bmaso
Joined: 2009-10-04,
User offline. Last seen 2 years 40 weeks ago.
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti


On Mon, Apr 11, 2011 at 2:01 PM, Daniel Sobral <dcsobral@gmail.com> wrote:
On Mon, Apr 11, 2011 at 14:49, Bill Venners <bill@artima.com> wrote:
Hi Rex,

> Right now, you can do these things with futures, but it's awkward.  A simple
> for loop would be much easier.  Thus, I disagree that sequential for loops
> are the overwhelmingly common use case.  When you have sequential
> collections, of course your for loops look sequential.  With parallel
> collections, different use cases arise.
>
> Again, and I don't think I can stress this point enough, there is no reason
> to use a parallel collection in any form unless one has some sort of
> performance issue.  How often are you going to have a performance problem
> when you use "map" but not when you use "foreach"?  If you don't have a
> performance problem, why are you using a parallel collection with its
> concomitant issues with debugging, memory usage, profiling, reasoning about
> order of operations, and so on?
>
I'm wondering this may be a main source of the different opinions
here. If very pointed solutions to performance problems is the main
way parallel collections will be used, then I agree that .par giving a
parallel foreach is fine. I'm imagining people will want to use
parallel collections more generally because the future is more and
more cores, and Scala will make it pretty easy to use them. Not
without risk, but pretty low risk. I think if you're careful, you can
avoid side effects when using parallel collections, other than with
foreach.

For that matter, I'm more inclined to work as you say than to resort to parallel only as optimization. However, I do not agree with how you follow it up.
 
The parallelness of the collection passed in would be sticky. Each
time you transform a sequential collection you'd get another
sequential one as a result. But each time you transform a parallel
collection you'd get a parallel one as a result. So in the parallel
case, the entire body of your routine would be doing things in
parallel, and in there you may want to sprinkle a few for loops. Why?
Because they make for clearer code (or easier to write code) than
recursion, less error prone than a while loop, etc. If you need to
write .seq on them to make sure they execute sequentially, it's a pain
and error prone.

And this is where I disagree with. If the goal is to get used to write multicore code, then having the code depend on sequential behavior is a problem. That dependency is what is error prone, not the parallelism. To paraphrase Miyagi, if you do guess-so parallelism, you are screwed. Either do it, or don't.


Mr Miyagi said that? Was that in the Karate Kid I or II? I'm vaguely remembering something like that from the "wax-on, wax-off" sequence...but maybe you're paraphrasing?

For sure "Either do it, or don't" you are confusing with Yoda: "Do or do not! There is no try." You really need to get your pop cultural references straight!

Brian Maso
 
Or, in other words, if you need code to be sequential, mark it so. If you mark code to be parallel, make sure it truly is.

--
Daniel C. Sobral

I travel to the future all the time.



--
Best regards,
Brian Maso
(949) 395-8551
brian@blumenfeld-maso.com

dcsobral
Joined: 2009-04-23,
User offline. Last seen 38 weeks 5 days ago.
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti
On Mon, Apr 11, 2011 at 18:27, Brian Maso <brian@blumenfeld-maso.com> wrote:


Mr Miyagi said that? Was that in the Karate Kid I or II? I'm vaguely remembering something like that from the "wax-on, wax-off" sequence...but maybe you're paraphrasing?

For sure "Either do it, or don't" you are confusing with Yoda: "Do or do not! There is no try." You really need to get your pop cultural references straight!

http://www.imdb.com/title/tt0087538/quotes

Search for "road".

--
Daniel C. Sobral

I travel to the future all the time.
Ismael Juma 2
Joined: 2011-01-22,
User offline. Last seen 42 years 45 weeks ago.
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti

On Mon, Apr 11, 2011 at 6:49 PM, Bill Venners wrote:
> Some may consider it premature optimization to use parallel collections more generally,
> but the more cores that show up I expect some programmers will use
> them more generally like this. This may be how we need to program in
> the multi-core world.

Something to consider is that not all applications are "take-over the
machine" type. Just because there are many cores doesn't mean that
your app should use them all as they may be shared among many
applications (in some cases via a virtualisation layer).

Best,
Ismael

Bill Venners
Joined: 2008-12-18,
User offline. Last seen 31 weeks 5 days ago.
Re: Re: REPL: null-values in parallel ranges - Seq.foreach acti

Hi Daniel,

On Mon, Apr 11, 2011 at 3:25 PM, Daniel Sobral wrote:
> On Mon, Apr 11, 2011 at 18:27, Brian Maso wrote:
>>
>>
>> Mr Miyagi said that? Was that in the Karate Kid I or II? I'm vaguely
>> remembering something like that from the "wax-on, wax-off" sequence...but
>> maybe you're paraphrasing?
>>
>> For sure "Either do it, or don't" you are confusing with Yoda: "Do or do
>> not! There is no try." You really need to get your pop cultural references
>> straight!
>
> http://www.imdb.com/title/tt0087538/quotes
>
> Search for "road".
>
Ah, Daniel-san, you misunderstand what road mean. Left side of road is
imperative style. Program sequentially on left side. Safe. Right side
of road is functional style. Program parallel on right side. Safe.
Walk down middle, program parallel in functional style except for
foreach, which you program parallel in imperative style, sooner or
later get squished like grape.

A .par that provides a sequential foreach is not guess-so parallelism,
it is functional-style parallelism. The safe kind.

I think it is interesting that everyone else in this discussion seems
to prefer consistency in the sense that all methods on a collection
produced by .par are parallel. That is nice, but produces an
inconsistency in how people use the collections, because foreach is
already very inconsistent with the other methods. In the parallel code
you'd either need to call .seq or actually make the passed-in function
thread safe. In sequential code you don't need to worry about it. I
think that's a worse inconsistency than the one you're concerned
about.

With one .par you enter the world of parallel programming, but the
code thereafter looks exactly the same as sequential code. If .par
gives a parallel foreach, then every time I do a for without a yield
I'll feel a nudge to check to make sure I'm sure I'm sure this isn't
parallel code, because otherwise I need a .seq. That makes Scala
programming less smooth, and will produce errors. The two styles of
programming look identical after the .par.

I think what I'm hoping for as a Scala user is that I can pretty much
think and program the same way when using sequential and parallel
collections: program in a functional style when using the functional
style methods; program in an imperative style when using the lone
imperative style method, foreach.

There's no such thing as transparent parallelism. We do need to be
more careful when programming parallel that our passed-in functions
have no side effects. But it isn't that hard to pass in side-effect
free functions in sequential code for non-foreach methods. If you
remove the special case of foreach is different between sequential and
parallel, then I think programming in Scala post 2.9 will be a lot
more pleasant and less error prone.

Bill
----
Bill Venners
Artima, Inc.
http://www.artima.com

Copyright © 2012 École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland