The Architecture of Scala Collections -- Dealing with map and friends

However, there is another class of methods in collections that are not dealt with yet. These methods do not always return the collection type exactly. They might return the same kind of collection, but with a different element type. The classical example of this is the map method. If s is a Seq[Int], and f is a function from Int to String, then s.map(f) would return a Seq[String]. So the element type changes between the receiver and the result, but the kind of collection stays the same.

There are a number of other methods that behave like map. For some of them you would expect this (e.g., flatMap, collect), but for others you might not. For instance, the append method, ++, also might return a result of different type as its arguments--appending a list of String to a list of Int would give a list of Any. How should these methods be adapted to RNA strands? Ideally we'd expect that mapping bases to bases over an RNA strand would yield again an RNA strand:

scala> val rna = RNA(A, U, G, G, T)

rna: RNA = RNA(A, U, G, G, T)

scala> rna map { case A => T case b => b }

res7: RNA = RNA(T, U, G, G, T)

scala> rna ++ rna

res8: RNA = RNA(A, U, G, G, T, A, U, G, G, T)

scala> rna map Base.toInt

res2: IndexedSeq[Int] = Vector(0, 3, 2, 2, 1)

scala> rna ++ List("missing", "data")

res3: IndexedSeq[java.lang.Object] =

Vector(A, U, G, G, T, missing, data)

scala> val rna2 = RNA2(A, U, G, G, T)

rna2: RNA2 = RNA2(A, U, G, G, T)

scala> rna2 map { case A => T case b => b }

res0: IndexedSeq[Base] = Vector(T, U, G, G, T)

scala> rna2 ++ rna2

res1: IndexedSeq[Base] = Vector(A, U, G, G, T, A, U, G, G, T)

def map[B, That](f: A => B)

(implicit cbf: CanBuildFrom[Repr, B, That]): That

How is the That type determined? In fact it is linked to the other types by an implicit parameter cbf, of type CanBuildFrom[Repr, B, That]. These CanBuildFrom implicits are defined by the individual collection classes. In essence, an implicit value of type CanBuildFrom[From, Elem, To] says: "Here is a way, given a collection of type From, to build with elements of type Elem a collection of type To."

final class RNA private (val groups: Array[Int], val length: Int)

extends IndexedSeq[Base] with IndexedSeqLike[Base, RNA] {

import RNA._

// Mandatory re-implementation of `newBuilder` in `IndexedSeq`

override protected[this] def newBuilder: Builder[Base, RNA] =

RNA.newBuilder

// Mandatory implementation of `apply` in `IndexedSeq`

def apply(idx: Int): Base = {

if (idx < 0 || length <= idx)

throw new IndexOutOfBoundsException

Base.fromInt(groups(idx / N) >> (idx % N * S) & M)

}

// Optional re-implementation of foreach,

// to make it more efficient.

override def foreach[U](f: Base => U): Unit = {

var i = 0

var b = 0

while (i < length) {

b = if (i % N == 0) groups(i / N) else b >>> S

f(Base.fromInt(b & M))

i += 1

}

object RNA {

private val S = 2 // number of bits in group

private val M = (1 << S) - 1 // bitmask to isolate a group

private val N = 32 / S // number of groups in an Int

def fromSeq(buf: Seq[Base]): RNA = {

val groups = new Array[Int]((buf.length + N - 1) / N)

for (i <- 0 until buf.length)

groups(i / N) |= Base.toInt(buf(i)) << (i % N * S)

new RNA(groups, buf.length)

}

def apply(bases: Base*) = fromSeq(bases)

def newBuilder: Builder[Base, RNA] =

new ArrayBuffer mapResult fromSeq

implicit def canBuildFrom: CanBuildFrom[RNA, Base, RNA] =

new CanBuildFrom[RNA, Base, RNA] {

def apply(): Builder[Base, RNA] = newBuilder

def apply(from: RNA): Builder[Base, RNA] = newBuilder

}

Now the behavior of map and ++ on RNA2 sequences becomes clearer. There is no CanBuildFrom instance that creates RNA2 sequences, so the next best available CanBuildFrom was found in the companion object of the inherited trait IndexedSeq. That implicit creates IndexedSeqs, and that's what you saw when applying map to rna2.

To address this shortcoming, you need to define an implicit instance of CanBuildFrom in the companion object of the RNA class. That instance should have type CanBuildFrom[RNA, Base, RNA]. Hence, this instance states that, given an RNA strand and a new element type Base, you can build another collection which is again an RNA strand. The two listings above on class RNA and its companion object show the details. Compared to class RNA2 there are two important differences. First, the newBuilder implementation has moved from the RNA class to its companion object. The newBuilder method in class RNA simply forwards to this definition. Second, there is now an implicit CanBuildFrom value in object RNA. To create such an object you need to define two apply methods in the CanBuildFrom trait. Both create a new builder for an RNA collection, but they differ in their argument list. The apply() method simply creates a new builder of the right type. By contrast, the apply(from) method takes the original collection as argument. This can be useful to adapt the dynamic type of builder's return type to be the same as the dynamic type of the receiver. In the case of RNA this does not come into play because RNA is a final class, so any receiver of static type RNA also has RNA as its dynamic type. That's why apply(from) also simply calls newBuilder, ignoring its argument.

That is it. The final RNA class implements all collection methods at their natural types. Its implementation requires a little bit of protocol. In essence, you need to know where to put the newBuilder factories and the canBuildFrom implicits. On the plus side, with relatively little code you get a large number of methods automatically defined. Also, if you don't intend to do bulk operations like take, drop, map, or ++ on your collection you can choose to not go the extra length and stop at the implementation shown in for class RNA1.

The discussion so far centered on the minimal amount of definitions needed to define new sequences with methods that obey certain types. But in practice you might also want to add new functionality to your sequences or to override existing methods for better efficiency. An example of this is the overridden foreach method in class RNA. foreach is an important method in its own right because it implements loops over collections. Furthermore, many other collection methods are implemented in terms of foreach. So it makes sense to invest some effort optimizing the method's implementation. The standard implementation of foreach in IndexedSeq will simply select every i'th element of the collection using apply, where i ranges from 0 to the collection's length minus one. So this standard implementation selects an array element and unpacks a base from it once for every element in an RNA strand. The overriding foreach in class RNA is smarter than that. For every selected array element it immediately applies the given function to all bases contained in it. So the effort for array selection and bit unpacking is much reduced.