This page is no longer maintained — Please continue to the home page at www.scala-lang.org

Rethinking filter

10 replies
odersky
Joined: 2008-07-29,
User offline. Last seen 45 weeks 6 days ago.

We have a number of open tickets which all want that

for (x <- xs if guard) body

should expand to

for (x <- xs if (guard) body

rather than the current expansion

for (x <- xs filter (x => guard)) body

The motivations are

(1) you might want to write a guard that depends on side effects in body.
(2) you might want to avoid constructing a secondary list or other
data structure containing the filtered elements.

One problem is how to integrate this with for-yield collections. It
would be unacceptable to have different behaviors of guards depending
on whether we are in a for-loop or a for-yield.
The other problem is how to do this for all kinds of generators,
collections and others, in a uniform way.

Unlike in Haskell, there's no unit element that we can use in a
for-expression. So the only alternative seems to be to make use of
filterForeach, filterMap, filterFlatMap operations in the generator
classes. That is, for-expression expansion would still work as
described by Jorge:
>>>>

for (x <- c) f(x)

simply translates to:

c.foreach(x => f(x))

Likewise:

for (x <- c) yield f(x)

simply translates to:

c.map(x => f(x))

Something with a guard:

for (x <- c if p(x)) yield f(x)

translates to:

c.filter(x => p(x)).map(x => f(x))

And a nested for:

for (x1 <- c1; x2 <- c2) yield f(x1, x2)

translates to:

c1.flatMap(x1 => c2.map(x2 => f(x1, x2)))

<<<

But, there would be a post-processing step, where

c1.filter(p).map(f) is rewritten to c1.filterMap(p, f)
c1.filter(p).flatMap(f) is rewritten to c1.filterFlatMap(p, f)
c1.filter(p).foreach(f) is rewritten to c1.filterForeach(p, f)

These rewritings would only happen for filters/maps/flatMaps generated
in for-expressions and only if the c1 type had defined the filterXXX
operations. We don't want to demand that for every generator type,
because that would make it too complicated to define generators. But
collections would uniformly have those operations, and implement them
with lazy filters. For instance, in class TraversableLike:

def filterForeach[U](p: A => Boolean, f: A => U): Unit =
this foreach (x => if (p(x)) f(x))

def filterMap[That](p: A => Boolean, f: A => B)(implicit bf:
BuilderFactory[B, That, Repr]): Repr = {
val b = bf(repr)
this foreach (x => if (p(x)) b += f(x))
b.result
}

To deal with multiple filters, we should also extend the rewriting rules to

c1.filter(p).filterMap(q, f) is rewritten to c1.filterMap(p && q, f)
c1.filter(p).filterFlatMap(q, f) is rewritten to c1.filterFlapMap(p && q, f)
c1.filter(p).filterForeach(f) is rewritten to c1.filterForeach(p && q, f)

and apply rewritings as long as possible.

Question: Should we do this? Advantages I can see: (1) More efficient
operations through some amount of deforestation. (2) Less surprises
for users expecting imperative behavior.
Disadvantages: (1) It complicates the rules, and possibly the behavior
because now filterMap, etc would decide whether filter was strict or
lazy. (2) It would break some code.

Btw, one could also consider to go further with deforestation, for
instance with rules like this:

c1.flatMap(x => f(x).map(g)) rewrites to c1.mapFlatMap(g, f)

Cheers

Eastsun 2
Joined: 2008-12-20,
User offline. Last seen 42 years 45 weeks ago.
Re: Re thinking filter

+1
I'm looking for this so long time

It would make for expression more powerful and useful, and encourage us to
use for expression rather than map,filter,... etc.

Martin Odersky wrote:
>
> We have a number of open tickets which all want that
>
> for (x <- xs if guard) body
>
> should expand to
>
> for (x <- xs if (guard) body
>
> rather than the current expansion
>
> for (x <- xs filter (x => guard)) body
>
> The motivations are
>
> (1) you might want to write a guard that depends on side effects in body.
> (2) you might want to avoid constructing a secondary list or other
> data structure containing the filtered elements.
>
> One problem is how to integrate this with for-yield collections. It
> would be unacceptable to have different behaviors of guards depending
> on whether we are in a for-loop or a for-yield.
> The other problem is how to do this for all kinds of generators,
> collections and others, in a uniform way.
>
> Unlike in Haskell, there's no unit element that we can use in a
> for-expression. So the only alternative seems to be to make use of
> filterForeach, filterMap, filterFlatMap operations in the generator
> classes. That is, for-expression expansion would still work as
> described by Jorge:
>>>>>
>
> for (x <- c) f(x)
>
> simply translates to:
>
> c.foreach(x => f(x))
>
> Likewise:
>
> for (x <- c) yield f(x)
>
> simply translates to:
>
> c.map(x => f(x))
>
> Something with a guard:
>
> for (x <- c if p(x)) yield f(x)
>
> translates to:
>
> c.filter(x => p(x)).map(x => f(x))
>
> And a nested for:
>
> for (x1 <- c1; x2 <- c2) yield f(x1, x2)
>
> translates to:
>
> c1.flatMap(x1 => c2.map(x2 => f(x1, x2)))
>
> <<<
>
> But, there would be a post-processing step, where
>
> c1.filter(p).map(f) is rewritten to c1.filterMap(p, f)
> c1.filter(p).flatMap(f) is rewritten to c1.filterFlatMap(p, f)
> c1.filter(p).foreach(f) is rewritten to c1.filterForeach(p, f)
>
> These rewritings would only happen for filters/maps/flatMaps generated
> in for-expressions and only if the c1 type had defined the filterXXX
> operations. We don't want to demand that for every generator type,
> because that would make it too complicated to define generators. But
> collections would uniformly have those operations, and implement them
> with lazy filters. For instance, in class TraversableLike:
>
> def filterForeach[U](p: A => Boolean, f: A => U): Unit =
> this foreach (x => if (p(x)) f(x))
>
> def filterMap[That](p: A => Boolean, f: A => B)(implicit bf:
> BuilderFactory[B, That, Repr]): Repr = {
> val b = bf(repr)
> this foreach (x => if (p(x)) b += f(x))
> b.result
> }
>
> To deal with multiple filters, we should also extend the rewriting rules
> to
>
> c1.filter(p).filterMap(q, f) is rewritten to c1.filterMap(p && q, f)
> c1.filter(p).filterFlatMap(q, f) is rewritten to c1.filterFlapMap(p &&
> q, f)
> c1.filter(p).filterForeach(f) is rewritten to c1.filterForeach(p && q,
> f)
>
> and apply rewritings as long as possible.
>
> Question: Should we do this? Advantages I can see: (1) More efficient
> operations through some amount of deforestation. (2) Less surprises
> for users expecting imperative behavior.
> Disadvantages: (1) It complicates the rules, and possibly the behavior
> because now filterMap, etc would decide whether filter was strict or
> lazy. (2) It would break some code.
>
> Btw, one could also consider to go further with deforestation, for
> instance with rules like this:
>
> c1.flatMap(x => f(x).map(g)) rewrites to c1.mapFlatMap(g, f)
>
> Cheers
>

Jesper Nordenberg
Joined: 2008-12-27,
User offline. Last seen 42 years 45 weeks ago.
Re: Rethinking filter

martin odersky wrote:
> We have a number of open tickets which all want that
>
> for (x <- xs if guard) body
>
> should expand to
>
> for (x <- xs if (guard) body

(I assume this should read "for (x <- xs) if (guard) body")

>
> rather than the current expansion
>
> for (x <- xs filter (x => guard)) body
>
> The motivations are
>
> (1) you might want to write a guard that depends on side effects in body.
> (2) you might want to avoid constructing a secondary list or other
> data structure containing the filtered elements.

I'm against this suggestion if these are the only reasons for change.
It's not hard to manually move the guard inside the for-body.

Regarding the yield-case, the only motivation I can see is performance,
but it's not worth complicating the rewrite rules for that. It's better
to put more work into the optimizer in that case.

/Jesper Nordenberg

loverdos
Joined: 2008-11-18,
User offline. Last seen 2 years 27 weeks ago.
Re: Rethinking filter
Hi Martin,

On Thu, Oct 15, 2009 at 13:31, martin odersky <martin.odersky@epfl.ch> wrote:
We have a number of open tickets which all want that

 for (x <- xs if guard) body

should expand to

 for (x <- xs if (guard) body

rather than the current expansion

 for (x <- xs filter (x => guard)) body

The motivations are

(1) you might want to write a guard that depends on side effects in body.
(2) you might want to avoid constructing a secondary list or other
data structure containing the filtered elements.

One problem is how to integrate this with for-yield collections. It
would be unacceptable to have different behaviors of guards depending
on whether we are in a for-loop or a for-yield.
The other problem is how to do this for all kinds of generators,
collections and others, in a uniform way.

Unlike in Haskell, there's no unit element that we can use in a
for-expression. So the only alternative seems to be to make use of
filterForeach, filterMap, filterFlatMap operations in the generator
classes. That is, for-expression expansion would still work as
described by Jorge:
>>>>

 for (x <- c) f(x)

simply translates to:

 c.foreach(x => f(x))

Likewise:

 for (x <- c) yield f(x)

simply translates to:

 c.map(x => f(x))

Something with a guard:

 for (x <- c if p(x)) yield f(x)

translates to:

 c.filter(x => p(x)).map(x => f(x))

And a nested for:

 for (x1 <- c1; x2 <- c2) yield f(x1, x2)

translates to:

 c1.flatMap(x1 => c2.map(x2 => f(x1, x2)))

<<<

But, there would be a post-processing step, where

 c1.filter(p).map(f)   is rewritten to c1.filterMap(p, f)
 c1.filter(p).flatMap(f)   is rewritten to c1.filterFlatMap(p, f)
 c1.filter(p).foreach(f)   is rewritten to c1.filterForeach(p, f)

These rewritings would only happen for filters/maps/flatMaps generated
in for-expressions and only if the c1 type had defined the filterXXX
operations. We don't want to demand that for every generator type,
because that would make it too complicated to define generators. But
collections would uniformly have those operations, and implement them
with lazy filters. For instance, in class TraversableLike:

 def filterForeach[U](p: A => Boolean, f: A => U): Unit =
    this foreach (x => if (p(x)) f(x))

 def filterMap[That](p: A => Boolean, f: A => B)(implicit bf:
BuilderFactory[B, That, Repr]): Repr = {
    val b = bf(repr)
    this foreach (x => if (p(x)) b += f(x))
    b.result
 }

To deal with multiple filters, we should also extend the rewriting rules to

 c1.filter(p).filterMap(q, f)   is rewritten to c1.filterMap(p && q, f)
 c1.filter(p).filterFlatMap(q, f)   is rewritten to c1.filterFlapMap(p && q, f)
 c1.filter(p).filterForeach(f)   is rewritten to c1.filterForeach(p && q, f)

and apply rewritings as long as possible.

Question: Should we do this? Advantages I can see: (1) More efficient
operations through some amount of deforestation. (2) Less surprises
for users expecting imperative behavior.
Disadvantages: (1) It complicates the rules, and possibly the behavior
because now filterMap, etc would decide whether filter was strict or
lazy. (2) It would break some code.

Btw, one could also consider to go further with deforestation, for
instance with rules like this:

 c1.flatMap(x => f(x).map(g))  rewrites to    c1.mapFlatMap(g, f)

Cheers

Iulian Dragos 2
Joined: 2009-02-10,
User offline. Last seen 42 years 45 weeks ago.
Re: Re: Rethinking filter


On Thu, Oct 15, 2009 at 1:47 PM, Jesper Nordenberg <megagurka@yahoo.com> wrote:

The motivations are

(1) you might want to write a guard that depends on side effects in body.
(2) you might want to avoid constructing a secondary list or other
data structure containing the filtered elements.

I'm against this suggestion if these are the only reasons for change. It's not hard to manually move the guard inside the for-body.

Regarding the yield-case, the only motivation I can see is performance, but it's not worth complicating the rewrite rules for that. It's better to put more work into the optimizer in that case.


The optimizer needs to finds out that 'filter' is side-effect free, that map builds the exact same collection type, that the predicate and mapping function are again side-effect free and can be reordered... Maybe when an effect type-system is in place this should be done, but I wouldn't hold my breath.

If it was only for efficiency, I'd agree with you, but having guards that rely on side-effects in the mapping function is more intuitive for most people. The current behavior is a source of subtle bugs. And no matter who does the rewrite, the optimizer or the 'rewrite' rules, some complexity (and arguably, a lot more if we rely on the optimizer) will appear somewhere. Speccing it allows the programmer to rely on it, and actually simplifies the implementation. The added complexity in the specification doesn't seem that great to me, so I'm in favor of this change.

cheers,
iulian
 

/Jesper Nordenberg




--
« Je déteste la montagne, ça cache le paysage »
Alphonse Allais
Jesper Nordenberg
Joined: 2008-12-27,
User offline. Last seen 42 years 45 weeks ago.
Re: Rethinking filter

Iulian Dragos wrote:
> The optimizer needs to finds out that 'filter' is side-effect free, that
> map builds the exact same collection type, that the predicate and
> mapping function are again side-effect free and can be reordered...
> Maybe when an effect type-system is in place this should be done, but I
> wouldn't hold my breath.

Yes, an effect tracking system is required to make this "stream fusion"
work properly (although it doesn't have to be integrated with the type
system).

> If it was only for efficiency, I'd agree with you, but having guards
> that rely on side-effects in the mapping function is more intuitive for
> most people. The current behavior is a source of subtle bugs. And no
> matter who does the rewrite, the optimizer or the 'rewrite' rules, some
> complexity (and arguably, a lot more if we rely on the optimizer) will
> appear somewhere. Speccing it allows the programmer to rely on it, and
> actually simplifies the implementation. The added complexity in the
> specification doesn't seem that great to me, so I'm in favor of this change.

Well, I think the key question is: are functions passed to filter, map
and flatMap allowed to have side effects? And, if so, what restrictions
apply and what properties can be relied upon? If we start by specifying
this, I think the implementations details will fall out naturally.

/Jesper Nordenberg

Seth Tisue
Joined: 2008-12-16,
User offline. Last seen 34 weeks 3 days ago.
Re: Re: Rethinking filter

>>>>> "Jesper" == Jesper Nordenberg writes:

>> The motivations are
>> (1) you might want to write a guard that depends on side effects in
>> body. (2) you might want to avoid constructing a secondary list or
>> other data structure containing the filtered elements.

Jesper> I'm against this suggestion if these are the only reasons for
Jesper> change. It's not hard to manually move the guard inside the
Jesper> for-body.

Agree.

(1) doesn't seem important enough to do for 2.8, given everything else
that's going on, especially given that only imperative code (and rather
tricky imperative code, too) benefits from the additional complication.

As for (2), I'd point out that it's not hard to add ".view" to avoid
constructing intermediates.

Jorge Ortiz
Joined: 2008-12-16,
User offline. Last seen 29 weeks 4 days ago.
Re: Re: Rethinking filter
I agree with Seth and Jesper.

I rather like the simplicity of the current transformation. The "combine if it looks like this AND this method is defined" rule seems overly complicated and a can of worms. Would it apply the filterXXX methods based on the static type or the run-time type? How "deep" does the deforestation go?

That said, I haven't seen the open tickets regarding this issue. Which tickets are they? Perhaps they contain compelling arguments for change.

--j

On Thu, Oct 15, 2009 at 6:47 AM, Seth Tisue <seth@tisue.net> wrote:
>>>>> "Jesper" == Jesper Nordenberg <megagurka@yahoo.com> writes:

 >> The motivations are
 >> (1) you might want to write a guard that depends on side effects in
 >> body.  (2) you might want to avoid constructing a secondary list or
 >> other data structure containing the filtered elements.

 Jesper> I'm against this suggestion if these are the only reasons for
 Jesper> change.  It's not hard to manually move the guard inside the
 Jesper> for-body.

Agree.

(1) doesn't seem important enough to do for 2.8, given everything else
that's going on, especially given that only imperative code (and rather
tricky imperative code, too) benefits from the additional complication.

As for (2), I'd point out that it's not hard to add ".view" to avoid
constructing intermediates.

--
Seth Tisue @ Northwestern University / http://tisue.net
lead developer, NetLogo: http://ccl.northwestern.edu/netlogo/

ewilligers
Joined: 2008-08-20,
User offline. Last seen 3 years 17 weeks ago.
Re: Rethinking filter

Eastsun wrote:
> +1

> It would make for expression more powerful and useful, and encourage us to
> use for expression rather than map,filter,... etc.

+1

filterMap is an important performance win.

odersky
Joined: 2008-07-29,
User offline. Last seen 45 weeks 6 days ago.
Re: Re: Rethinking filter

I have become convinced that filter in for comprehensions needs to be lazy. Why?
Consider this program fragment:

var limit = 2
for (i <- List(0, 1, 2, 3) if i <= limit; j <- 0 until limit) yield
{ limit = 4; (i, j) }

What does it return?

List((0,0), (0,1), (1,0), (1,1), (1,2), (1,3), (2,0), (2,1), (2,2), (2,3))

So you see that the first index is computed with the previously given
limit. It is not affected by the assignment in the yield. No surprise
because we know filter is strict. However, the second index does go up
to 3 after the first loop iteration, so it does take the changed limit
into account. If you look at the translation it's clear why it has to
be so:

List(0, 1, 2, 3).filter(_ <= limit).flatMap(i => (0 unitl limit).map(j => {
limit = 4;
(i, j)
}))

You see that the second map is in the closure passed to the first
flatMap, so its evaluation gets interleaved with the flatMap
iteration. But the filter is in the generator of that flatMap, so gets
computed before the flatMap is even started.

But for someone who has not 100% internalized that translation (and
who has?), it's very weird indeed that _some_ parts of the for
comprehension are evaluated before the body is started, but other
parts are interleaved.

So I think we need a better translation of `for'. I have found a
scheme which is different from filterMap, filterFlatMap, filterForeach
and which looks simpler. We postulate a new method withFilter which
like filter takes a predicate as argument. The `for' translation is
exactly as it is now, but using withFilter instead of filter. This
means that one has the guarantee that the result of withFilter is
further taken apart using one of map, flatMap, forEach, or another
withFilter. For instance, the for expression above would translate to:

List(0, 1, 2, 3).withFilter(_ <= limit).flatMap(i => (0 unitl
limit).map(j => {
limit = 4;
(i, j)
}))

It is recommended (but not enforced) that withFilter is lazy. That is,
it should return a proxy object with methods for map, flatMap,
foreach, and withFilter. Each of these would combine a guard with the
original operation. I append an implementation of withFilter for class
TraversableLike which would be inherited by all collection classes.

To ensure a smooth transition, I propose that if the `for' translation
scheme cannot find the `withFilter' method, it uses `filter' instead,
but emits a deprecated warning.

I believe this proposal has a lot going for it: It's not significantly
more complex than the previous translation, it leads to more intuitive
behavior, and it can be a win in performance because it reduces the
number of intermediate structures that are built.

The only possible downside is that it might break some code. But I
doubt that would be very likely, as code that relies on a filter
seeing a snapshot of the state before the loop body is executed seems
rather weird.

I think this also settles the case for Range. Range should be strict,
and we do not need an exception for its filter anymore.

Cheers

dcsobral
Joined: 2009-04-23,
User offline. Last seen 38 weeks 5 days ago.
Re: Re: Rethinking filter
I like this much better than the previous proposals. It doesn't change "filter" in unexpected ways, makes guards more intuitive, it is rather simple and has performance benefits.   And I have no queasy feelings about it, which I had about the other proposals, in which I found myself grudgingly agreeing that there was a problem, and yet disliked the proposed alternative.   So, +1.

On Fri, Oct 16, 2009 at 7:24 AM, martin odersky <martin.odersky@epfl.ch> wrote:
I have become convinced that filter in for comprehensions needs to be lazy. Why?
Consider this program fragment:

 var limit = 2
 for (i <- List(0, 1, 2, 3) if i <= limit; j <- 0 until limit) yield
{ limit = 4; (i, j) }

What does it return?

 List((0,0), (0,1), (1,0), (1,1), (1,2), (1,3), (2,0), (2,1), (2,2), (2,3))

So you see that the first index is computed with the previously given
limit. It is not affected by the assignment in the yield. No surprise
because we know filter is strict. However, the second index does go up
to 3 after the first loop iteration, so it does take the changed limit
into account. If you look at the translation it's clear why it has to
be so:

 List(0, 1, 2, 3).filter(_ <= limit).flatMap(i => (0 unitl limit).map(j => {
   limit = 4;
   (i, j)
 }))

You see that the second map is in the closure passed to the first
flatMap, so its evaluation gets interleaved with the flatMap
iteration. But the filter is in the generator of that flatMap, so gets
computed before the flatMap is even started.

But for someone who has not 100% internalized that translation (and
who has?), it's very weird indeed that _some_ parts of the for
comprehension are evaluated before the body is started, but other
parts are interleaved.

So I think we need a better translation of `for'. I have found a
scheme which is different from filterMap, filterFlatMap, filterForeach
and which looks simpler. We postulate a new method withFilter which
like filter takes a predicate as argument. The `for' translation is
exactly as it is now, but using withFilter instead of filter. This
means that one has the guarantee that the result of withFilter is
further taken apart using one of map, flatMap, forEach, or another
withFilter. For instance, the for expression above would translate to:

 List(0, 1, 2, 3).withFilter(_ <= limit).flatMap(i => (0 unitl
limit).map(j => {
   limit = 4;
   (i, j)
 }))

It is recommended (but not enforced) that withFilter is lazy. That is,
it should return a proxy object with methods for map, flatMap,
foreach, and withFilter. Each of these would combine a guard with the
original operation. I append an implementation of withFilter for class
TraversableLike which would be inherited by all collection classes.

To ensure a smooth transition, I propose that if the `for' translation
scheme cannot find the `withFilter'  method, it uses `filter' instead,
but emits a deprecated warning.

I believe this proposal has a lot going for it: It's not significantly
more complex than the previous translation, it leads to more intuitive
behavior, and it can be a win in performance because it reduces the
number of intermediate structures that are built.

The only possible downside is that it might break some code. But I
doubt that would be very likely, as code that relies on a filter
seeing a snapshot of the state before the loop body is executed seems
rather weird.

I think this also settles the case for Range. Range should be strict,
and we do not need an exception for its filter anymore.

Cheers

 -- Martin

Here's the proposed implementation of withFilter in class TraversableLike:

 def withFilter(p: A => Boolean) = new WithFilter(p)

 class WithFilter(p: A => Boolean) {

   def map[B, That](f: A => B)(implicit bf: BuilderFactory[B, That,
Repr]): That = {
     val b = bf(repr)
     for (x <- self)
       if (p(x)) b += f(x)
     b.result
   }

   def flatMap[B, That](f: A => Traversable[B])(implicit bf:
BuilderFactory[B, That, Repr]): That = {
     val b = bf(repr)
     for (x <- self)
       if (p(x)) b ++= f(x)
     b.result
   }

   def foreach[U](f: A => U): Unit =
     for (x <- self)
       if (p(x)) f(x)

   def withFilter(q: A => Boolean): WithFilter =
     new WithFilter(x => p(x) && q(x))
 }



--
Daniel C. Sobral

Something I learned in academia: there are three kinds of academic reviews: review by name, review by reference and review by value.

Copyright © 2012 École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland