Dealing with map and friends
However, there is another class of methods in collections that are
not dealt with yet. These methods do not always return the collection
type exactly. They might return the same kind of collection, but
with a different element type. The classical example of this is the
map method. If s is a Seq[Int], and f is a function from
Int to String, then s.map(f) would return a
Seq[String]. So the element type changes between the receiver and
the result, but the kind of collection stays the same.
There are a number of other methods that behave like map. For some
of them you would expect this (e.g., flatMap, collect), but for
others you might not. For instance, the append method, ++, also might
return a result of different type as its arguments--appending a list of
String to a list of Int would give a list of Any. How should these
methods be adapted to RNA strands? Ideally we'd expect that mapping
bases to bases over an RNA strand would yield again an RNA strand:
scala> val rna = RNA(A, U, G, G, T) |
rna: RNA = RNA(A, U, G, G, T) |
|
scala> rna map { case A => T case b => b } |
res7: RNA = RNA(T, U, G, G, T)
|
Likewise, appending two RNA strands with ++ should yield again another RNA strand:
scala> rna ++ rna |
res8: RNA = RNA(A, U, G, G, T, A, U, G, G, T)
|
On the other hand, mapping bases to some other type over an RNA strand
cannot yield another RNA strand because the new elements have the
wrong type. It has to yield a sequence instead. In the same vein
appending elements that are not of type Base to an RNA strand can
yield a general sequence, but it cannot yield another RNA strand.
scala> rna map Base.toInt |
res2: IndexedSeq[Int] = Vector(0, 3, 2, 2, 1) |
|
scala> rna ++ List("missing", "data") |
res3: IndexedSeq[java.lang.Object] = |
Vector(A, U, G, G, T, missing, data)
|
This is what you'd expect in the ideal case. But this is not what the
RNA2 class provides. In fact,
if you ran the first two examples above with instances of this class
you would obtain:
scala> val rna2 = RNA2(A, U, G, G, T) |
rna2: RNA2 = RNA2(A, U, G, G, T) |
|
scala> rna2 map { case A => T case b => b } |
res0: IndexedSeq[Base] = Vector(T, U, G, G, T) |
|
scala> rna2 ++ rna2 |
res1: IndexedSeq[Base] = Vector(A, U, G, G, T, A, U, G, G, T)
|
So the result of map and ++ is never an RNA strand, even if the
element type of the generated collection is a Base.
To see how to do better, it pays to have a close look at the signature
of the map method (or of ++, which has a similar signature).
The map method is originally defined in class scala.collection.TraversableLike
with the following signature:
def map[B, That](f: A => B) |
(implicit cbf: CanBuildFrom[Repr, B, That]): That
|
Here A is the type of elements of the collection, and Repr is the
type of the collection itself, that is, the second type parameter that gets
passed to implementation classes such as TraversableLike and
IndexedSeqLike. The map method takes two more type parameters, B and
That. The B parameter stands for the result type of the mapping function,
which is also the element type of the new collection. The That
appears as the result type of map, so it represents the type of the new collection
that gets created.
How is the That type determined? In fact it is linked to the other types by an
implicit parameter cbf, of type CanBuildFrom[Repr, B, That].
These CanBuildFrom implicits are defined by the individual collection classes. In essence, an implicit value of type CanBuildFrom[From, Elem, To] says: "Here is a way, given a collection of type
From, to build with elements of type Elem a collection of type To."
final class RNA private (val groups: Array[Int], val length: Int) |
extends IndexedSeq[Base] with IndexedSeqLike[Base, RNA] { |
|
import RNA._ |
|
// Mandatory re-implementation of `newBuilder` in `IndexedSeq` |
override protected[this] def newBuilder: Builder[Base, RNA] = |
RNA.newBuilder |
|
// Mandatory implementation of `apply` in `IndexedSeq` |
def apply(idx: Int): Base = { |
if (idx < 0 || length <= idx) |
throw new IndexOutOfBoundsException |
Base.fromInt(groups(idx / N) >> (idx % N * S) & M) |
} |
|
// Optional re-implementation of foreach, |
// to make it more efficient. |
override def foreach[U](f: Base => U): Unit = { |
var i = 0 |
var b = 0 |
while (i < length) { |
b = if (i % N == 0) groups(i / N) else b >>> S |
f(Base.fromInt(b & M)) |
i += 1 |
} |
} |
}
|
RNA strands class, final version.
object RNA { |
|
private val S = 2 // number of bits in group |
private val M = (1 << S) - 1 // bitmask to isolate a group |
private val N = 32 / S // number of groups in an Int |
|
def fromSeq(buf: Seq[Base]): RNA = { |
val groups = new Array[Int]((buf.length + N - 1) / N) |
for (i <- 0 until buf.length) |
groups(i / N) |= Base.toInt(buf(i)) << (i % N * S) |
new RNA(groups, buf.length) |
} |
|
def apply(bases: Base*) = fromSeq(bases) |
|
def newBuilder: Builder[Base, RNA] = |
new ArrayBuffer mapResult fromSeq |
|
implicit def canBuildFrom: CanBuildFrom[RNA, Base, RNA] = |
new CanBuildFrom[RNA, Base, RNA] { |
def apply(): Builder[Base, RNA] = newBuilder |
def apply(from: RNA): Builder[Base, RNA] = newBuilder |
} |
}
|
RNA companion object--final version.
Now the behavior of map and ++ on RNA2 sequences becomes
clearer. There is no CanBuildFrom instance that creates RNA2
sequences, so the next best available CanBuildFrom was found in the
companion object of the inherited trait IndexedSeq. That implicit creates IndexedSeqs,
and that's what you saw when applying map to rna2.
To address this shortcoming, you need to define an implicit instance of
CanBuildFrom in the companion object of the RNA class.
That instance should have type CanBuildFrom[RNA, Base, RNA].
Hence, this instance states that, given an RNA strand and a new
element type Base, you can build another collection which is again an
RNA strand. The two listings above on class RNA and its companion object
show the details. Compared to
class RNA2 there are two important differences. First, the newBuilder
implementation has moved from the RNA class to its companion
object. The newBuilder method in class RNA simply forwards to
this definition. Second, there is now an implicit CanBuildFrom
value in object RNA. To create such an object you need to define
two apply methods in the CanBuildFrom trait. Both create
a new builder for an RNA collection, but they differ in their
argument list. The apply() method simply creates a new builder
of the right type. By contrast, the apply(from) method
takes the original collection as argument. This can be
useful to adapt the dynamic type of builder's return type to be the
same as the dynamic type of the receiver. In the case of RNA this
does not come into play because RNA is a final class, so any receiver
of static type RNA also has RNA as its dynamic type. That's why
apply(from) also simply calls newBuilder, ignoring its argument.
That is it. The final RNA class implements
all collection methods at their natural types. Its implementation
requires a little bit of protocol. In essence, you need to know where
to put the newBuilder factories and the canBuildFrom implicits.
On the plus side, with relatively little code you get a large number
of methods automatically defined. Also, if you don't intend to do bulk
operations like take, drop, map, or ++ on your collection
you can choose to not go the extra length and stop at the
implementation shown in for class RNA1.
The discussion so far centered on the minimal amount of definitions
needed to define new sequences with methods that obey certain types.
But in practice you might also want to add new functionality to your
sequences or to override existing methods for better efficiency. An
example of this is the overridden foreach method in class RNA.
foreach is an important method in its own right because it
implements loops over collections. Furthermore, many other collection
methods are implemented in terms of foreach. So it makes sense to
invest some effort optimizing the method's implementation. The standard
implementation of foreach in IndexedSeq will simply select every
i'th element of the collection using apply, where i ranges
from 0 to the collection's length minus one. So this standard
implementation selects an array element and unpacks a base from it once
for every element in an RNA strand. The overriding foreach in class
RNA is smarter than that. For every selected array element it
immediately applies the given function to all bases contained in
it. So the effort for array selection and bit unpacking is much reduced.
Next: Integrating new sets and maps