- About Scala
- Documentation
- Code Examples
- Software
- Scala Developers
Code review this script
Fri, 2009-03-27, 23:41
Hello all,
After much advice from the folks on #scala (thank you!), I've got a first
pass at a script to perform some text manipulation. Would anyone be able to
offer some suggestions on how to improve this code or use better Scala
style?
The problem is to write a script to take data from a text file and transform
it into another format. In particular, this data comes from a data logger
monitoring some information about an air compressor. The data logger writes
out readings in a two-line format:
time,month,day,reading1, reading2, reading3, reading4
time,month,day,reading5, reading6, reading7, reading8
The program should combine the output into a single line:
time, month, day, reading1, reading2, reading3, reading4, reading5,
reading6, reading7, reading8
For example:
11:45,03,11,900,899,872,841,
11:45,03,11,787,821,0,0,
11:46,03,11,899,895,873,839,
11:46,03,11,781,818,0,0,
Would become:
11:45,03,11,900,899,872,841,787,821,0,0
11:46,03,11,899,895,873,839,781,818,0,0
The caveats are that at any point the data logger can lose power and a line
will be written in an incomplete form. These lines should be skipped.
My strategy:
- Create a List of tuples of line pairs
- Combine the pairs into a new structure if they represent the same time
- Process the data in a stream-like fashion so as to avoid loading the
entire file in memory first (one sample input file is 690k). (Of course,
RAM is cheap, but I was interested to try out Scala's Stream class)
- Try to avoid using variables
Here's what I came up with:
object LoggerFormatter {
def parse(lines: Stream[String]): Stream[String] = {
if (lines.isEmpty) return Stream.empty;
val allPairs = lines.zip(lines.tail)
for ((DataFormat(x), (DataFormat(y))) <- allPairs if x sameTimeAs y)
yield (x + y).toString + "\n"
}
def main(args: Array[String]) {
if (args.size == 1) {
import io.Source
val inputFile = Source.fromFile(args(0))
val output = parse(Stream.fromIterator(inputFile.getLines))
output.foreach(print)
}
else
Console.err.println("Please specify an input file")
}
}
Here's the objects which understand the line formats:
object DataFormat {
def unapply(line: String): Option[DataLine] = {
val parts = line.trim split ","
if (parts.length == 7) Some(DataLine(parts(0), parts(1), parts(2),
parts.drop(3).toList))
else None
}
}
case class DataLine(time: String, month: String, day: String, data:
List[String]) {
def +(other:DataLine) = DataLine(this.time, this.month, this.day,
this.data ++ other.data)
def sameTimeAs(other: DataLine) = this.time == other.time && this.month ==
other.month && this.day == other.day
override def toString = List(this.time, this.month, this.day,
this.data.mkString(",")).mkString(",")
}
One note - I appended the "\n" to the end of every line so I wouldn't have
to remember to do it in my unit tests, the console version (which prints to
the console), and the GUI (which will save to a File).
Thank you for your advice!
Cheers,
Dan
Sat, 2009-03-28, 15:37
#2
Re: Code review this script
Hi Johannes,
Thanks for your feedback. I had thought that a Stream class lazily
evaluated its tail, but looking at the code I see the following:
Here's the code for Stream.fromIterator:
def fromIterator[A](it: Iterator[A]): Stream[A] =
if (it.hasNext) cons(it.next, fromIterator(it)) else empty
If I read this correctly, this recursively adds all the elements of an
Iterator to a new Stream object. So this forces the evaluation of the
entire Iterator, is this correct?
As for the lazy part of Streams, here's the code for Stream.cons's apply()
method:
def apply[A](hd: A, tl: => Stream[A]) = new Stream[A] {
override def hasDefiniteSize = if (tlDefined) tlVal.hasDefiniteSize
else super.hasDefiniteSize
override def isEmpty = false
def head = hd
private var tlVal: Stream[A] = _
private def tlDefined = tlVal ne null
def tail: Stream[A] = {
if (!tlDefined) { tlVal = tl }
tlVal
}
As I understand it, tl is a by-name parameter, which simply wraps a function
around the argument so it is not evaluated at the call site, but rather when
it is needed in the method. Is this right?
If that's the case, will sticking with an Iterator read just enough from the
file without reading it all in at once? It looks like Source's getLines()
method returns an Iterator[String] which still uses the underlying
Iterator[Character] form the FileInputStream to lazily produce lines of
text.
I was concerned that zip() on Iterator would evaluate the whole sequence,
but it looks like it also returns an Iterator which lazily evaluates the
next entries.
Thanks,
Dan
Johannes Rudolph-2 wrote:
>
> I think scala.Stream is not doing, what you think it does (because
> it's unlike the java.io Stream classes). It is an implementation of a
> lazy list with a possibly infinite number of elements which calculates
> each element only once and then stores them for later reuse.
> Memory-wise you gain little.
>
> Johannes
>
Sun, 2009-03-29, 00:27
#3
Re: Code review this script
Johannes Rudolph wrote:
> I think scala.Stream is not doing, what you think it does (because
> it's unlike the java.io Stream classes). It is an implementation of a
> lazy list with a possibly infinite number of elements which calculates
> each element only once and then stores them for later reuse.
It should only store them as long as you need them, there was a bug some
time ago that prevented the collection of the start of a stream, but it
should be fixed by now: https://lampsvn.epfl.ch/trac/scala/ticket/692
> Memory-wise you gain little.
Except that your memory consumption should go from O(N) to O(1). If it
doesn't you are either holding onto the start of the Stream, or you
should reopen #7c7c7c.
- Florian
Sun, 2009-03-29, 01:47
#4
Re: Code review this script
I wrote:
> or you should reopen #7c7c7c.
I did it.
- Florian
Sun, 2009-03-29, 08:17
#5
Re: Code review this script
When I wrote this, I had the last discussion about the Stream class in
mind. What I took away then, was that it generally isn't possible to
use Stream in a way which releases the already visited elements,
because there is always the 'this'-parameter pointing to the head of
the list, and I know of no Scala language feature which would allow to
null out 'this'. (It is possible to do this on the bytecode level, but
there is as far as I know no Scala language-level syntax for
ALOAD_X
ACONST_NULL
ASTORE_X)
Now, I read in the bug comments, that the Scala compiler can somehow
(in this particular case?) null out the 'this'-parameter. I can accept
that (although it should be stated somewhere in the documentation,
because this is not what you would expect from a Java invocation
call).
This aside, I do not understand, how Daniel's or your new example in
the bug ticket, Florian, is supposed to work. You are holding a
reference to the Stream in a local variable and it is still in scope,
why should it be gc'd?
IMO in absence of a language primitve like that shown above, the only
memory-safe way to use a Stream is to never safe the Stream into a
local variable.
Johannes
Mon, 2009-03-30, 16:27
#6
Re: Code review this script
Johannes Rudolph schrieb:
> This aside, I do not understand, how Daniel's or your new example in
> the bug ticket, Florian, is supposed to work.
I assumed the compiler to be smarter than it is. In principle, the compiler
could find out that the Stream passed in is not actually used anywhere in
the method after the fist call and null out the reference. In fact, it must
do this if Streams are supposed to be usable as Streams but it doesn't.
- Florian
Mon, 2009-03-30, 18:57
#7
Re: Code review this script
On 03/30/2009 08:25 AM, Florian Hars wrote:
> I assumed the compiler to be smarter than it is. In principle, the
> compiler
> could find out that the Stream passed in is not actually used anywhere in
> the method after the fist call and null out the reference. In fact, it must
> do this if Streams are supposed to be usable as Streams but it doesn't.
>
Reminds me of this nice hack from the OpenQuark/CAL guys:
http://groups.google.com/group/cal_language/browse_thread/thread/728a3d4...
-0xe1a
I think scala.Stream is not doing, what you think it does (because
it's unlike the java.io Stream classes). It is an implementation of a
lazy list with a possibly infinite number of elements which calculates
each element only once and then stores them for later reuse.
Memory-wise you gain little.
Johannes
-----------------------------------------------
Johannes Rudolph
http://virtual-void.net
On Fri, Mar 27, 2009 at 11:41 PM, Daniel Wellman wrote:
>
> Hello all,
>
> After much advice from the folks on #scala (thank you!), I've got a first
> pass at a script to perform some text manipulation. Would anyone be able to
> offer some suggestions on how to improve this code or use better Scala
> style?
>
> The problem is to write a script to take data from a text file and transform
> it into another format. In particular, this data comes from a data logger
> monitoring some information about an air compressor. The data logger writes
> out readings in a two-line format:
>
> time,month,day,reading1, reading2, reading3, reading4
> time,month,day,reading5, reading6, reading7, reading8
>
> The program should combine the output into a single line:
>
> time, month, day, reading1, reading2, reading3, reading4, reading5,
> reading6, reading7, reading8
>
> For example:
>
> 11:45,03,11,900,899,872,841,
> 11:45,03,11,787,821,0,0,
> 11:46,03,11,899,895,873,839,
> 11:46,03,11,781,818,0,0,
>
> Would become:
> 11:45,03,11,900,899,872,841,787,821,0,0
> 11:46,03,11,899,895,873,839,781,818,0,0
>
> The caveats are that at any point the data logger can lose power and a line
> will be written in an incomplete form. These lines should be skipped.
>
> My strategy:
> - Create a List of tuples of line pairs
> - Combine the pairs into a new structure if they represent the same time
> - Process the data in a stream-like fashion so as to avoid loading the
> entire file in memory first (one sample input file is 690k). (Of course,
> RAM is cheap, but I was interested to try out Scala's Stream class)
> - Try to avoid using variables
>
> Here's what I came up with:
>
> object LoggerFormatter {
> def parse(lines: Stream[String]): Stream[String] = {
> if (lines.isEmpty) return Stream.empty;
>
> val allPairs = lines.zip(lines.tail)
>
> for ((DataFormat(x), (DataFormat(y))) <- allPairs if x sameTimeAs y)
> yield (x + y).toString + "\n"
> }
>
> def main(args: Array[String]) {
> if (args.size == 1) {
> import io.Source
>
> val inputFile = Source.fromFile(args(0))
> val output = parse(Stream.fromIterator(inputFile.getLines))
> output.foreach(print)
> }
> else
> Console.err.println("Please specify an input file")
> }
> }
>
> Here's the objects which understand the line formats:
>
> object DataFormat {
> def unapply(line: String): Option[DataLine] = {
> val parts = line.trim split ","
> if (parts.length == 7) Some(DataLine(parts(0), parts(1), parts(2),
> parts.drop(3).toList))
> else None
> }
> }
>
> case class DataLine(time: String, month: String, day: String, data:
> List[String]) {
> def +(other:DataLine) = DataLine(this.time, this.month, this.day,
> this.data ++ other.data)
>
> def sameTimeAs(other: DataLine) = this.time == other.time && this.month ==
> other.month && this.day == other.day
>
> override def toString = List(this.time, this.month, this.day,
> this.data.mkString(",")).mkString(",")
> }
>
>
>
> One note - I appended the "\n" to the end of every line so I wouldn't have
> to remember to do it in my unit tests, the console version (which prints to
> the console), and the GUI (which will save to a File).
>
>
> Thank you for your advice!
>
> Cheers,
> Dan
> --
> View this message in context: http://www.nabble.com/Code-review-this-script-tp22751256p22751256.html
> Sent from the Scala - User mailing list archive at Nabble.com.
>
>