This page is no longer maintained — Please continue to the home page at www.scala-lang.org

IO and parsing with idiomatic style?

28 replies
Anthony Accioly 2
Joined: 2010-12-14,
User offline. Last seen 42 years 45 weeks ago.

Hi guys,

As a exercise yesterday I wrote a Simple Scala program to split srt
files (text based subtitles format) and readjust the intervals
accordingly. For that I resorted mostly to standard Java APIs and code
style (not proudly, just because these are the tools I know). In a lot
of places I couldn't find a way to get rid of the mutability.

If you guys could kindly help me rewrite the code in a more
idiomatic / functional way this would really help my studies. I'm
specially interested in learning Scala API and abstractions.
The only requirement is that I don't want to read the entire InputFile
into memory and make a big list (I could come up with something like
Source -> getLines -> mkString -> split -> map -> filter -> map), but
this wouldn't help me with real life big file parsing code.

So here it is:

case class Entry(id: Long, start: Date, end: Date, text: String) {

override def toString = {
// Anything better than StringBuilder with ok performance?
val sb = new StringBuilder;
sb.append(id)
sb.append("\n")
sb.append(timeFormat.format(start))
sb.append(" --> ")
sb.append(timeFormat.format(end))
sb.append(text)
sb.append("\n\n")

sb.toString()
}

}

object Main {

val timeFormat = new SimpleDateFormat("HH:mm:ss,SSS")
private val timePattern = Pattern.compile("\\d\\d:\\d\\d:\\d\\d,\\d\
\d\\d")
private val splitTime = Calendar.getInstance();
private var counter = 0; // How can I eliminate the mutable counter?

/**
* Parameters
*/
def main(args: Array[String]) {
val inputFile = new File(args(0))
val outputFile = new File(args(1))
splitTime.setTime(timeFormat.parse(args(2)))
val encoding = args(3);

val sc = new Scanner(inputFile, encoding)
val pw = new PrintWriter(outputFile, encoding)

sc.useDelimiter("\n\r\n");
// Any chance to get rid of the imperative style code without
// Having to read the entire file at once?
while (sc.hasNext) {
val line = sc.next()
val originalEntry = readEntry(line)
if (originalEntry.start.after(splitTime.getTime)) {
pw.print(buildEntryForSplittedFile(originalEntry))
}
}
sc.close() // Ok, I know about try and finally, this was pure
laziness
pw.close()
}

private def readEntry(string: String)= {
val sc = new Scanner(string)
sc.nextLine // Skip original id
val start = timeFormat.parse(sc.findInLine(timePattern))
val end = timeFormat.parse(sc.findInLine(timePattern))
sc.useDelimiter("\\Z")
val text = sc.next()

Entry(0, start, end, text)
}

private def buildEntryForSplittedFile (originalEntry: Entry) = {
val id = counter + 1;
counter = counter + 1;
val newStart = subtractTimes(originalEntry.start, splitTime)
val newEnd = subtractTimes(originalEntry.end, splitTime)

Entry(id, newStart, newEnd, originalEntry.text)
}

private def subtractTimes(t1 : Date, t2: Calendar) = {
// And again, can I ger rid of the mutable calls with resorting to
external libraries such as Joda-time?
val t3 = Calendar.getInstance()
t3.setTime(t1)
t3.add(Calendar.MILLISECOND, - t2.get(Calendar.MILLISECOND))
t3.add(Calendar.SECOND, - t2.get(Calendar.SECOND))
t3.add(Calendar.MINUTE, - t2.get(Calendar.MINUTE))
t3.add(Calendar.HOUR_OF_DAY, - t2.get(Calendar.HOUR_OF_DAY))

t3.getTime
}

}

hohonuuli
Joined: 2009-08-30,
User offline. Last seen 3 years 9 weeks ago.
Re: IO and parsing with idiomatic style?
Hey Scala compiler guys,
Sun's (i.e. Oracle's) compiler uses StringBuilder 'under the hood' (See http://stackoverflow.com/questions/1296571/what-happens-when-java-compiler-sees-many-string-concatenations-in-one-line)
So this code ...
id + "\n" + timeFormat.format(Start) + " --> " + timeFormat(end) + text + "\n\n"
would basically compile to the code below (in javac). Does Scala's compiler do the same optimization?
  override def toString = {
   // Anything better than StringBuilder with ok performance?
   val sb = new StringBuilder;
   sb.append(id)
   sb.append("\n")
   sb.append(timeFormat.format(start))
   sb.append(" --> ")
   sb.append(timeFormat.format(end))
   sb.append(text)
   sb.append("\n\n")

   sb.toString()
 }


 --
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Brian Schlining
bschlining@gmail.com
Randall R Schulz
Joined: 2008-12-16,
User offline. Last seen 1 year 29 weeks ago.
Re: IO and parsing with idiomatic style?

On Monday 15 August 2011, Brian Schlining wrote:
> Hey Scala compiler guys,
>
> Sun's (i.e. Oracle's) compiler uses StringBuilder 'under the hood'
> (See
> http://stackoverflow.com/questions/1296571/what-happens-when-java-com
>piler-sees-many-string-concatenations-in-one-line )
>
> So this code ...
>
> id + "\n" + timeFormat.format(Start) + " --> " + timeFormat(end) +
> text + "\n\n"
>
> would basically compile to the code below (in javac). Does Scala's
> compiler do the same optimization?

The real question is: Why would you want to write code like that?

"%s%n%s --> %s%s%n%n".format(id,
timeFormat.format(Start),
timeFormat(end),
text)

Randall Schulz

E. Labun
Joined: 2010-06-20,
User offline. Last seen 42 years 45 weeks ago.
Re: IO and parsing with idiomatic style?

On 2011-08-16 07:04, Randall R Schulz wrote:
> "%s%n%s --> %s%s%n%n".format(id,
> timeFormat.format(Start),
> timeFormat(end),
> text)

And you could define toString as a "val" to avoid reevaluation each time when it's used:

override val toString = the code above

--
Eugen

E. Labun
Joined: 2010-06-20,
User offline. Last seen 42 years 45 weeks ago.
Re: IO and parsing with idiomatic style?

For the parsing itself I would try combinator parsers.

--
Eugen

Alex Repain
Joined: 2010-07-27,
User offline. Last seen 1 year 31 weeks ago.
Re: IO and parsing with idiomatic style?


2011/8/16 Eugen Labun <labun@gmx.net>
On 2011-08-16 07:04, Randall R Schulz wrote:
> "%s%n%s --> %s%s%n%n".format(id,
>                              timeFormat.format(Start),
>                              timeFormat(end),
>                              text)

And you could define toString as a "val" to avoid reevaluation each time when it's used:

override val toString = the code above

Not sure if this has changed since Scala 2.8.1, but until then, this code above was EXTREMELY (factor 100 or 1000) slower than the concatenation style :

id + "\n" + timeFormat.format(Start) + " --> " + timeFormat(end) +
> text + "\n\n"
 

--
Eugen



--
Alex REPAIN
ENSEIRB-MATMECA - student
TECHNICOLOR R&D - intern
BORDEAUX I      - master's student
SCALA           - enthusiast


Aydjen
Joined: 2009-08-21,
User offline. Last seen 1 year 28 weeks ago.
Re: IO and parsing with idiomatic style?

Hi,

 


Alex Repain <alex.repain@gmail.com> 2001/08/16 10:03:

Not sure if this has changed since Scala 2.8.1, but until then, this code above was EXTREMELY (factor 100 or 1000) slower than the concatenation style :

 

in a warmed-up VM (64bit server VM), the factor should be more like 30. At least that's what my benchmarks tell me.

 

Kind regards

Andreas 

Philippe Lhoste
Joined: 2010-09-02,
User offline. Last seen 42 years 45 weeks ago.
Re: IO and parsing with idiomatic style?

On 16/08/2011 01:54, Anthony Accioly wrote:
> If you guys could kindly help me rewrite the code in a more
> idiomatic / functional way this would really help my studies. I'm

I fear I am not the man to show you how to get rid of mutability, as I
am a Scala beginner myself. And I think like Martin Odersky, that a
little mutability here and there doesn't necessarily hurt, even more if
it is hidden somewhere behind the API: some imperative style in the
internal gears can make them spin smoother (ie. faster).

I noticed some minor points in your code that can be slightly improved,
those remarks probably apply to equivalent Java code as well. I hope you
won't mind these advices.

> case class Entry(id: Long, start: Date, end: Date, text: String) {
>
> override def toString = {
> // Anything better than StringBuilder with ok performance?
> val sb = new StringBuilder;
> sb.append(id)
> sb.append("\n")
> sb.append(timeFormat.format(start))
> sb.append(" --> ")
> sb.append(timeFormat.format(end))
> sb.append(text)
> sb.append("\n\n")
>
> sb.toString()
> }

As pointed out, you can use simple concatenation for such string
building: it ends being compiled to your code above, but it will be more
readable, IMHO. String builder is better suited to usage in a loop.
And as Eugen pointed out, making it a val is better (I wouldn't have
thought of it, overriding a def with a val isn't intuitive to me yet...).

> }
>
> object Main {
>
> val timeFormat = new SimpleDateFormat("HH:mm:ss,SSS")
> private val timePattern = Pattern.compile("\\d\\d:\\d\\d:\\d\\d,\\d\
> \d\\d")

Aah, regexes, I love them! You forgot to escape the dot here. And using
triple quotes avoids the backslash proliferation that makes me cringe in
Java:
private val timePattern = """\d\d:\d\d;\d\d\.\d{3}""".r

> private val splitTime = Calendar.getInstance();

You can drop (); there...

> private var counter = 0; // How can I eliminate the mutable counter?

Not sure if it is really necessary here. I saw a similar counter in
Akka, in the FSM implementation (counting the number of generations),
for example.

> /**
> * Parameters time>
> */
> def main(args: Array[String]) {

Even for simple test code, you should check the number of args and
report nicely (showing an usage string) to the user if it isn't what is
expected. Better than showing a stack trace...

> val inputFile = new File(args(0))
> val outputFile = new File(args(1))
> splitTime.setTime(timeFormat.parse(args(2)))
> val encoding = args(3);
>
> val sc = new Scanner(inputFile, encoding)
> val pw = new PrintWriter(outputFile, encoding)
>
> sc.useDelimiter("\n\r\n");
> // Any chance to get rid of the imperative style code without
> // Having to read the entire file at once?
> while (sc.hasNext) {
> val line = sc.next()
> val originalEntry = readEntry(line)
> if (originalEntry.start.after(splitTime.getTime)) {
> pw.print(buildEntryForSplittedFile(originalEntry))
> }
> }

Again, imperative style can be OK, particularly here where you produce
only side effects. But if you want something more idiomatic, you can use
something like:

import collection.JavaConverters._
for (line <- sc.asScala) { // Converts Java iterator to Scala one
val originalEntry = readEntry(line)
// ...
}

> sc.close() // Ok, I know about try and finally, this was pure
> laziness

Well, if your command line tool crashes, no need for clean closing of
files, I think... But indeed, nicer error messages can be good.

> pw.close()
> }
>
> private def readEntry(string: String)= {
> val sc = new Scanner(string)
> sc.nextLine // Skip original id
> val start = timeFormat.parse(sc.findInLine(timePattern))
> val end = timeFormat.parse(sc.findInLine(timePattern))
> sc.useDelimiter("\\Z")
> val text = sc.next()
>
> Entry(0, start, end, text)
> }
>
> private def buildEntryForSplittedFile (originalEntry: Entry) = {
> val id = counter + 1;
> counter = counter + 1;

counter += 1
val id = counter // or just use counter in the constructor

> val newStart = subtractTimes(originalEntry.start, splitTime)
> val newEnd = subtractTimes(originalEntry.end, splitTime)
>
> Entry(id, newStart, newEnd, originalEntry.text)
> }
>
> private def subtractTimes(t1 : Date, t2: Calendar) = {
> // And again, can I ger rid of the mutable calls with resorting to
> external libraries such as Joda-time?
> val t3 = Calendar.getInstance()
> t3.setTime(t1)
> t3.add(Calendar.MILLISECOND, - t2.get(Calendar.MILLISECOND))
> t3.add(Calendar.SECOND, - t2.get(Calendar.SECOND))
> t3.add(Calendar.MINUTE, - t2.get(Calendar.MINUTE))
> t3.add(Calendar.HOUR_OF_DAY, - t2.get(Calendar.HOUR_OF_DAY))

Can't you just subtract the timestamps?

> t3.getTime
> }
>
> }

HTH.

E. Labun
Joined: 2010-06-20,
User offline. Last seen 42 years 45 weeks ago.
Re: IO and parsing with idiomatic style?

>> Not sure if this has changed since Scala 2.8.1, but until then, this code above was EXTREMELY
>> (factor 100 or 1000) slower than the concatenation style :
>
>
>
> in a warmed-up VM (64bit server VM), the factor should be more like 30. At least that's what my
> benchmarks tell me.

@ String.format vs StringBuilder.append performance: good to know! I didn't expect such a huge
difference.

@ val toString vs def toString:

on second thought, if toString will be used only once (if at all), it's better to leave it a "def"
to speed up object creation and decrease the object size.

--
Eugen

Philippe Lhoste
Joined: 2010-09-02,
User offline. Last seen 42 years 45 weeks ago.
Re: IO and parsing with idiomatic style?

On 16/08/2011 14:02, Eugen Labun wrote:
> @ val toString vs def toString:
>
> on second thought, if toString will be used only once (if at all), it's better to leave it a "def"
> to speed up object creation and decrease the object size.

toString is rarely used once, if it often used in traces, for example,
when debugging with an IDE (showing its value), when serializing to a
text format, etc.

Perhaps it can be a lazy val, to avoid the pitfall you describe? Not
sure if it can be combined with an override.
Or perhaps the value can be memoized with a lazy val.

Jim McBeath
Joined: 2009-01-02,
User offline. Last seen 42 years 45 weeks ago.
Re: IO and parsing with idiomatic style?

On Mon, Aug 15, 2011 at 04:54:10PM -0700, Anthony Accioly wrote:
> Date: Mon, 15 Aug 2011 16:54:10 -0700 (PDT)
> From: Anthony Accioly
> To: scala-user
> Subject: [scala-user] IO and parsing with idiomatic style?
>
> If you guys could kindly help me rewrite the code in a more
> idiomatic / functional way this would really help my studies.

> object Main {

> private var counter = 0; // How can I eliminate the mutable counter?

> def main(args: Array[String]) {

> // Any chance to get rid of the imperative style code without
> // Having to read the entire file at once?
> while (sc.hasNext) {
> val line = sc.next()
> val originalEntry = readEntry(line)
> if (originalEntry.start.after(splitTime.getTime)) {
> pw.print(buildEntryForSplittedFile(originalEntry))
> }
> }

1. Use a for-comprehension (or foreach) on sc (assuming Scanner
implements Iterator).
2. Use map rather than the line and originalEntry variables.
3. Use filter rather than the if clause.
4. Use zipWithIndex and pass the index value into your builder method
rather than using a global counter.

So, replace the while loop with this:
(sc map { readEntry(_) } filter (_.start.after(splitTime.getTime))
zipWithIndex foreach { case (e,n) => buildEntryForSplittedFile(e,n) })

I have not compiled the above, so you may need to fiddle with the syntax
to get it to compile.

--
Jim

Lanny Ripple 2
Joined: 2011-08-16,
User offline. Last seen 42 years 45 weeks ago.
Re: IO and parsing with idiomatic style?

I don't know if it's idiomatic Scala but

* move construction of Entry case classes into Entry object.
* hide fiddly work behind well-known interfaces

import java.io._
import java.text._
import java.util.{Calendar, Date, Scanner}
import java.util.regex._

class Main {

val timeFormat = new SimpleDateFormat("HH:mm:ss,SSS")
val timePattern = Pattern.compile("""\d\d:\d\d:\d\d,\d\d\d""")
private val splitTime = Calendar.getInstance

case class Entry(id: Long, start: Date, end: Date, text: String) {

override lazy val toString: String = {
id.toString + "\n" + timeFormat.format(start) + "-->" +
timeFormat.format(end) + text + "\n\n"
}
}

object Entry {

protected val idIter = Iterator.from(0)

def apply(orig: Entry): Entry = {
val t0: Long = splitTime.getTime.getTime
val start = new Date(orig.start.getTime - t0)
val end = new Date(orig.end.getTime - t0)

Entry(idIter.next.toLong, start, end, orig.text)
}

def apply(line: String): Entry = {
val sc = new Scanner(line)
sc.nextLine // skip original id
val start = timeFormat.parse(sc.findInLine(timePattern))
val end = timeFormat.parse(sc.findInLine(timePattern))
sc.useDelimiter("""\Z""")
val text = sc.next

Entry(0L, start, end, text)
}
}

def main(args: Array[String]) {
// See https://github.com/sellmerfud/optparse for better
command-line handling
val inputFile = new File(args(0))
val outputFile = new File(args(1))
splitTime.setTime(timeFormat.parse(args(2)))
val encoding = args(3)

val lineIter = {
val sc = new Scanner(inputFile, encoding)
sc.useDelimiter("\n\r\n")

new Iterator[String] {
def hasNext: Boolean = { val hn = sc.hasNext; if (!hn)
sc.close(); hn }
def next: String = try { sc.next } finally
{ sc.close() }
}
}

val pw = new PrintWriter(outputFile, encoding)

try {
for {
line <- lineIter
val originalEntry = Entry(line)
if originalEntry.start after splitTime.getTime
} {
pw.print( Entry(originalEntry) )
}
}
finally {
pw.close()
}
}
}

On Aug 15, 6:54 pm, Anthony Accioly wrote:
> Hi guys,
>
> As a exercise yesterday I wrote a Simple Scala program to split srt
> files (text based subtitles format) and readjust the intervals
> accordingly. For that I resorted mostly to standard Java APIs and code
> style (not proudly, just because these are the tools I know). In a lot
> of places I couldn't find a way to get rid of the mutability.
>
> If you guys could kindly help me rewrite the code in a more
> idiomatic / functional way this would really help my studies. I'm
> specially interested in learning Scala API and abstractions.
> The only requirement is that I don't want to read the entire InputFile
> into memory and make a big list (I could come up with something like
> Source -> getLines -> mkString -> split -> map -> filter -> map), but
> this wouldn't help me with real life big file parsing code.
>
> So here it is:
>
> case class Entry(id: Long, start: Date, end: Date, text: String) {
>
>   override def toString = {
>     // Anything better than StringBuilder with ok performance?
>     val sb = new StringBuilder;
>     sb.append(id)
>     sb.append("\n")
>     sb.append(timeFormat.format(start))
>     sb.append(" --> ")
>     sb.append(timeFormat.format(end))
>     sb.append(text)
>     sb.append("\n\n")
>
>     sb.toString()
>   }
>
> }
>
> object Main {
>
>   val timeFormat = new SimpleDateFormat("HH:mm:ss,SSS")
>   private val timePattern = Pattern.compile("\\d\\d:\\d\\d:\\d\\d,\\d\
> \d\\d")
>   private val splitTime = Calendar.getInstance();
>   private var counter = 0; // How can I eliminate the mutable counter?
>
>   /**
>    * Parameters time>
>    */
>   def main(args: Array[String]) {
>     val inputFile = new File(args(0))
>     val outputFile = new File(args(1))
>     splitTime.setTime(timeFormat.parse(args(2)))
>     val encoding = args(3);
>
>     val sc = new Scanner(inputFile, encoding)
>     val pw = new PrintWriter(outputFile, encoding)
>
>     sc.useDelimiter("\n\r\n");
>     // Any chance to get rid of the imperative style code without
>     // Having to read the entire file at once?
>     while (sc.hasNext) {
>       val line = sc.next()
>       val originalEntry = readEntry(line)
>       if (originalEntry.start.after(splitTime.getTime)) {
>         pw.print(buildEntryForSplittedFile(originalEntry))
>       }
>     }
>     sc.close() // Ok, I know about try and finally, this was pure
> laziness
>     pw.close()
>   }
>
>   private def readEntry(string: String)= {
>     val sc = new Scanner(string)
>     sc.nextLine // Skip original id
>     val start = timeFormat.parse(sc.findInLine(timePattern))
>     val end = timeFormat.parse(sc.findInLine(timePattern))
>     sc.useDelimiter("\\Z")
>     val text = sc.next()
>
>     Entry(0, start, end, text)
>   }
>
>   private def buildEntryForSplittedFile (originalEntry: Entry) = {
>     val id = counter + 1;
>     counter = counter + 1;
>     val newStart = subtractTimes(originalEntry.start, splitTime)
>     val newEnd = subtractTimes(originalEntry.end, splitTime)
>
>     Entry(id, newStart, newEnd, originalEntry.text)
>   }
>
>   private def subtractTimes(t1 : Date, t2: Calendar) = {
>     // And again, can I ger rid of the mutable calls with resorting to
> external libraries such as Joda-time?
>     val t3 = Calendar.getInstance()
>     t3.setTime(t1)
>     t3.add(Calendar.MILLISECOND, - t2.get(Calendar.MILLISECOND))
>     t3.add(Calendar.SECOND, - t2.get(Calendar.SECOND))
>     t3.add(Calendar.MINUTE, - t2.get(Calendar.MINUTE))
>     t3.add(Calendar.HOUR_OF_DAY, - t2.get(Calendar.HOUR_OF_DAY))
>
>     t3.getTime
>   }
>
>
>
>
>
>
>
> }

Anthony Accioly 2
Joined: 2010-12-14,
User offline. Last seen 42 years 45 weeks ago.
Re: IO and parsing with idiomatic style?
Hi Eugen,

Thank you very much for the tip.
I've changed the toString implementation like you suggested.

Although I've read a couple of articles about Parse Combinators to see if I could grasp the technique (such as: http://www.codecommit.com/blog/scala/the-magic-behind-parser-combinators and http://debasishg.blogspot.com/2008/04/external-dsls-made-easy-with-scala.html), I don't think I understood how to use the ^^ and ^^^ to output the results without creating a big list of every subtitle entry in the file. If that is not a big trouble, could you please enlighten me about how to use a parser like this and read the file entry by entry? My angle here is avoiding to load the entire file at once or creating a big data structure containing the contents of the entire file into memory.

Cheers,


On Tue, Aug 16, 2011 at 3:48 AM, Eugen Labun <labun@gmx.net> wrote:
For the parsing itself I would try combinator parsers.

--
Eugen



--
Anthony Accioly
Anthony Accioly 3
Joined: 2011-08-16,
User offline. Last seen 42 years 45 weeks ago.
Re: Re: IO and parsing with idiomatic style?
Hi Philippe.

> private val timePattern = """\d\d:\d\d;\d\d\.\d{3}""".r
Great! That's a good improvement indeed.

> Even for simple test code, you should check the number of args and report nicely (showing an usage string) to the user if it isn't what is expected. Better than showing a stack trace...
Yeah, I wrote lazy code for myself, but I will write some validation code, it doesn't hurt anyway.

> counter += 1
> val id = counter // or just use counter in the constructor
Great, counter++ didn't work, so I assumed this would not work either.

> Can't you just subtract the timestamps?
You mean something like new Date(date1.getTime() - date2.getTime())? It didn't do the trick for me (the results are wrong). Or where you thinking about something else?

Cheers,

--
Anthony Accioly

Anthony Accioly 2
Joined: 2010-12-14,
User offline. Last seen 42 years 45 weeks ago.
Re: IO and parsing with idiomatic style?
Hi Jim,


So, replace the while loop with this:
   (sc map { readEntry(_) } filter (_.start.after(splitTime.getTime))
       zipWithIndex foreach { case (e,n) => buildEntryForSplittedFile(e,n) })


Neat suggestion (I didn't know about the zipWithIndex method, which can solve of a lot of thinks I currently do with mutable variables). The problem with this approach is the same with the approach that I mentioned in my original e-mail. This kind of processing will read everything into a big list. I'm looking for a approach which could read big files in chunks (I think that srt files are not the best example, so think of this program as something that would have to parse log files with potential Gbs of data).

Cheers,


--
Anthony Accioly
Anthony Accioly 3
Joined: 2011-08-16,
User offline. Last seen 42 years 45 weeks ago.
Re: Re: IO and parsing with idiomatic style?
Hi Lanny,


   object Entry {

       protected val idIter = Iterator.from(0)

       def apply(orig: Entry): Entry = {
           val t0: Long = splitTime.getTime.getTime
           val start = new Date(orig.start.getTime - t0)
           val end = new Date(orig.end.getTime - t0)

           Entry(idIter.next.toLong, start, end, orig.text)
       }

       def apply(line: String): Entry = {
           val sc = new Scanner(line)
           sc.nextLine // skip original id
           val start = timeFormat.parse(sc.findInLine(timePattern))
           val end = timeFormat.parse(sc.findInLine(timePattern))
           sc.useDelimiter("""\Z""")
           val text = sc.next

           Entry(0L, start, end, text)
       }
   }

Wow! Powerful. I think I'm beginning to understand the apply method.
 
       // See https://github.com/sellmerfud/optparse for better

Nice library ;).

       val lineIter = {
           val sc = new Scanner(inputFile, encoding)
           sc.useDelimiter("\n\r\n")

           new Iterator[String] {
               def hasNext: Boolean = { val hn = sc.hasNext; if (!hn)
sc.close(); hn }
               def next: String = try { sc.next } finally
{ sc.close() }
           }
       }
 Ok. This one got me confused. You are passing the Iterator to lineIter (like an anonymous inner class in Java) right? Why are you closing the Scanner in the next method?

Sincerely,

--
Anthony Accioly

Jim McBeath
Joined: 2009-01-02,
User offline. Last seen 42 years 45 weeks ago.
Re: IO and parsing with idiomatic style?
Anthony Accioly 2
Joined: 2010-12-14,
User offline. Last seen 42 years 45 weeks ago.
Re: IO and parsing with idiomatic style?
Hi Jim,

The map, filter, zipWithIndex and foreach methods are all evaluated lazily,
so in fact it processes each item from sc all the way through the pipeline
before handling the next item from sc.

Ohhh... I just learned somethign new :D. I always though this kind of chain would read the entire file to a List[Entry], filter and produce a new List[Entry] zip to a List[(Entry,Index)] and finally pattern match to produce a new List[Entry].

The only thing is... Even if the entire chain is running for each item, isn't the final result a List of every entry in the file?

What I mean is:

   * Scanner will return every number from 1 to Googol.
   * Lets say that readEntry and buildEntry both returns... say... A BigInt.
   * acceptNumber will only accept... says ... odd numbers.

Suppose that I do something like this:
   ((sc map { readEntry(_) } filter { acceptNumber(_) }
       zipWithIndex) foreach { case (e,n) => println(buildEntry(e,n)) })

Even as this code clearly discards the values after the Iteration, wouldn't this produce a List[BigInt] of 1^100/2 elements in memory? Or every BigInt will be simple discarded after the print?

Sincerely,
--
Anthony Accioly
E. Labun
Joined: 2010-06-20,
User offline. Last seen 42 years 45 weeks ago.
Re: IO and parsing with idiomatic style?

On 2011-08-16 19:29, Anthony Accioly wrote:
> Hi Eugen,
>
> Thank you very much for the tip.
> I've changed the toString implementation like you suggested.
>
> Although I've read a couple of articles about Parse Combinators to see if I could grasp the
> technique (such as: http://www.codecommit.com/blog/scala/the-magic-behind-parser-combinators and
> http://debasishg.blogspot.com/2008/04/external-dsls-made-easy-with-scala...), I don't think I
> understood how to use the ^^ and ^^^ to output the results without creating a big list of every
> subtitle entry in the file. If that is not a big trouble, could you please enlighten me about how to
> use a parser like this and read the file entry by entry? My angle here is avoiding to load the
> entire file at once or creating a big data structure containing the contents of the entire file into
> memory.

Hi Anthony,

I reread your requirements (don't read the entire input file, don't make a big list of all) and
unfortunately should say now that I see no (simple) way to achieve those goals with combinators.

The SRT format (http://www.matroska.org/technical/specs/subtitles/srt.html) is a repetition of
entries. So the 'rep'-combinator should be used. But it creates a List of parsed items and implies
the reading of entire input file.

We could write into output file inside of the rep-combinator (as a side effect) and transform the
inner parsed structure, to make it as simple as "()" (Unit), to reduce the memory usage. Or we could
implement another combinator ('loop') that will behave like 'rep' except it would explicitly discard
parsing results (i.e. exist only due to its side effects) and produce a Unit instead of a List. But
this requires more work and seems to be not an extremely elegant solution though.

Sorry if this doesn't help further.

--
Eugen

jibal
Joined: 2010-12-01,
User offline. Last seen 1 year 45 weeks ago.
Re: Re: IO and parsing with idiomatic style?

On Tue, Aug 16, 2011 at 11:37 AM, Anthony Accioly
wrote:
> Hi Lanny,
>
>
>>    object Entry {
>>
>>        protected val idIter = Iterator.from(0)
>>
>>        def apply(orig: Entry): Entry = {
>>            val t0: Long = splitTime.getTime.getTime
>>            val start = new Date(orig.start.getTime - t0)
>>            val end = new Date(orig.end.getTime - t0)
>>
>>            Entry(idIter.next.toLong, start, end, orig.text)
>>        }
>>
>>        def apply(line: String): Entry = {
>>            val sc = new Scanner(line)
>>            sc.nextLine // skip original id
>>            val start = timeFormat.parse(sc.findInLine(timePattern))
>>            val end = timeFormat.parse(sc.findInLine(timePattern))
>>            sc.useDelimiter("""\Z""")
>>            val text = sc.next
>>
>>            Entry(0L, start, end, text)
>>        }
>>    }
>
> Wow! Powerful. I think I'm beginning to understand the apply method.
>
>>
>>        // See https://github.com/sellmerfud/optparse for better
>
> Nice library ;).
>
>>        val lineIter = {
>>            val sc = new Scanner(inputFile, encoding)
>>            sc.useDelimiter("\n\r\n")
>>
>>            new Iterator[String] {
>>                def hasNext: Boolean = { val hn = sc.hasNext; if (!hn)
>> sc.close(); hn }
>>                def next: String = try { sc.next } finally
>> { sc.close() }
>>            }
>>        }
>
>
> Ok. This one got me confused. You are passing the Iterator to lineIter (like
> an anonymous inner class in Java) right? Why are you closing the Scanner in
> the next method?

Indeed, that close upon every invocation of next is certainly wrong.
Perhaps he meant catch rather than finally.

Jim McBeath
Joined: 2009-01-02,
User offline. Last seen 42 years 45 weeks ago.
Re: IO and parsing with idiomatic style?

On Tue, Aug 16, 2011 at 04:59:12PM -0300, Anthony Accioly wrote:
> Hi Jim,
>
> > The map, filter, zipWithIndex and foreach methods are all evaluated lazily,
> > so in fact it processes each item from sc all the way through the pipeline
> > before handling the next item from sc.
> >
> Ohhh... I just learned somethign new :D. I always though this kind of chain
> would read the entire file to a List[Entry], filter and produce a new
> List[Entry] zip to a List[(Entry,Index)] and finally pattern match to
> produce a new List[Entry].
>
> The only thing is... Even if the entire chain is running for each item,
> isn't the final result a List of every entry in the file?
>
> What I mean is:
>
> * Scanner will return every number from 1 to Googol.
> * Lets say that readEntry and buildEntry both returns... say... A BigInt.
>
> * acceptNumber will only accept... says ... odd numbers.
>
> Suppose that I do something like this:
> ((sc map { readEntry(_) } filter { acceptNumber(_) }
> zipWithIndex) foreach { case (e,n) => println(buildEntry(e,n)) })
>
> Even as this code clearly discards the values after the Iteration, wouldn't
> this produce a List[BigInt] of 1^100/2 elements in memory? Or every BigInt
> will be simple discarded after the print?
>
> Sincerely,

Anthony Accioly 2
Joined: 2010-12-14,
User offline. Last seen 42 years 45 weeks ago.
Re: IO and parsing with idiomatic style?
Hi Jim,

The foreach call does not return a value, so there is no list being
created. 

I tested the solution for a big file (generated in Scala :D) and received a OutOfMemoryError.
So I reduced the code even further, using the same steps (map -> filter -> zipWithIndex -> foreach -> println). I also implemented a imperative version.
Here is the code.

  def imperativeStyle() {
    var count = 1
    for (i <- 1 to 20000000) {
      val big = BigInt(i)
      if (big % 2 == 0) {
        println(count + " - " + big * 3)
        count += 1
      }
    }
  }

  def functionalStyle() {
    ((1 to 20000000) map { BigInt(_) }
      filter { _ % 2 == 0 } zipWithIndex) foreach {
      case(value, index) => println((index + 1) + " - " + (value * 3))
    }
  }

The first one runs without a problem, the second one throws a OutOfMemoryError.
I'm under the Strong impression that a List is being created in the background... Or at least that the intermediate state is being kept by Scala somehow, and that is the cause of the OutOfMemoryError... Or am I doing something wrong in the code that is breaking the expected behavior?

Cheers,

--
Anthony Accioly
Bill La Forge
Joined: 2011-07-13,
User offline. Last seen 42 years 45 weeks ago.
Re: IO and parsing with idiomatic style?
1. Doesn't (1 to 20000000) take a whole lot of memory on its own? I'll bet if you played around with it you could get out of memory just with this.

2. I keep reading on this list that map caches. But then there's something called a view that does not.

Bill La Forge

On Wed, Aug 17, 2011 at 8:34 AM, Anthony Accioly <a.accioly@7rtc.com> wrote:
Hi Jim,

The foreach call does not return a value, so there is no list being
created. 

I tested the solution for a big file (generated in Scala :D) and received a OutOfMemoryError.
So I reduced the code even further, using the same steps (map -> filter -> zipWithIndex -> foreach -> println). I also implemented a imperative version.
Here is the code.

  def imperativeStyle() {
    var count = 1
    for (i <- 1 to 20000000) {
      val big = BigInt(i)
      if (big % 2 == 0) {
        println(count + " - " + big * 3)
        count += 1
      }
    }
  }

  def functionalStyle() {
    ((1 to 20000000) map { BigInt(_) }
      filter { _ % 2 == 0 } zipWithIndex) foreach {
      case(value, index) => println((index + 1) + " - " + (value * 3))
    }
  }

The first one runs without a problem, the second one throws a OutOfMemoryError.
I'm under the Strong impression that a List is being created in the background... Or at least that the intermediate state is being kept by Scala somehow, and that is the cause of the OutOfMemoryError... Or am I doing something wrong in the code that is breaking the expected behavior?

Cheers,

--
Anthony Accioly

Anthony Accioly 2
Joined: 2010-12-14,
User offline. Last seen 42 years 45 weeks ago.
Re: IO and parsing with idiomatic style?

1. Doesn't (1 to 20000000) take a whole lot of memory on its own? I'll bet if you played around with it you could get out of memory just with this.

Taking a look at the source code I don't think so:
http://lampsvn.epfl.ch/trac/scala/browser/scala/tags/R_2_7_7_final/src/library/scala/Range.scala?view=markup

2. I keep reading on this list that map caches. But then there's something called a view that does not.
This delayed the OutOfMemoryError further. It printed the first million numbers easily, then slowed down a lot and finally OutOfMemoryError. Can anyone give further input?

  def functionalStyle() {
    ((1 to 20000000).view.map { BigInt(_) }
      filter { _ % 2 == 0 } zipWithIndex) foreach {
      case(value, index) => println((index + 1) + " - " + (value * 3))
    }
  }

Cheers,

--
Anthony Accioly
Brian Maso
Joined: 2011-07-21,
User offline. Last seen 42 years 45 weeks ago.
Re: IO and parsing with idiomatic style?

Scala collections are strict NY default -- all except Streams. Use the "view" method to make the range non-strict:

scala> ((1 to 20000000).view map {BigInt(_)} filter {_ % 1000 == 0} zipWithIndex) foreach {case(v, i) => println(v + " - " + i)}

Brian Maso

On Tuesday, August 16, 2011, Anthony Accioly <a.accioly@7rtc.com> wrote:
> Hi Jim,
>
>> The foreach call does not return a value, so there is no list being
>> created. 
>
> I tested the solution for a big file (generated in Scala :D) and received a OutOfMemoryError.
> So I reduced the code even further, using the same steps (map -> filter -> zipWithIndex -> foreach -> println). I also implemented a imperative version.
> Here is the code.
>
>   def imperativeStyle() {
>     var count = 1
>     for (i <- 1 to 20000000) {
>       val big = BigInt(i)
>       if (big % 2 == 0) {
>         println(count + " - " + big * 3)
>         count += 1
>       }
>     }
>   }
>
>   def functionalStyle() {
>     ((1 to 20000000) map { BigInt(_) }
>       filter { _ % 2 == 0 } zipWithIndex) foreach {
>       case(value, index) => println((index + 1) + " - " + (value * 3))
>     }
>   }
>
> The first one runs without a problem, the second one throws a OutOfMemoryError.
> I'm under the Strong impression that a List is being created in the background... Or at least that the intermediate state is being kept by Scala somehow, and that is the cause of the OutOfMemoryError... Or am I doing something wrong in the code that is breaking the expected behavior?
>
> Cheers,
>
> --
> Anthony Accioly
>
Philippe Lhoste
Joined: 2010-09-02,
User offline. Last seen 42 years 45 weeks ago.
Re: IO and parsing with idiomatic style?

On 16/08/2011 19:47, Anthony Accioly wrote:
> Yeah, I wrote lazy code for myself, but I will write some validation
> code, it doesn't hurt anyway.

Exactly. Most of the code I write is for myself (even if they lie in a
public repository) but I try and discipline myself to write friendly
code. It is a good exercise anyway. And you will soon be the naive user
in the future ("what I wanted to do there?"), anyway... ^_^

> > Can't you just subtract the timestamps?
> You mean something like new Date(date1.getTime() - date2.getTime())? It
> didn't do the trick for me (the results are wrong). Or where you
> thinking about something else?

No, it was just a naive/lazy suggestion... I don't even know what you do
with splitTime, actually.

Jim McBeath
Joined: 2009-01-02,
User offline. Last seen 42 years 45 weeks ago.
Re: IO and parsing with idiomatic style?

On Wed, Aug 17, 2011 at 12:04:22AM -0300, Anthony Accioly wrote:
> Hi Jim,
>
> > The foreach call does not return a value, so there is no list being created.
>
> I tested the solution for a big file (generated in Scala :D) and received a
> OutOfMemoryError.
> So I reduced the code even further, using the same steps (map -> filter ->
> zipWithIndex -> foreach -> println). I also implemented a imperative
> version.
> Here is the code.
>
> def imperativeStyle() {
> var count = 1
> for (i <- 1 to 20000000) {
> val big = BigInt(i)
> if (big % 2 == 0) {
> println(count + " - " + big * 3)
> count += 1
> }
> }
> }
>
> def functionalStyle() {
> ((1 to 20000000) map { BigInt(_) }
> filter { _ % 2 == 0 } zipWithIndex) foreach {
> case(value, index) => println((index + 1) + " - " + (value * 3))
> }
> }
>
> The first one runs without a problem, the second one throws a
> OutOfMemoryError.
> I'm under the Strong impression that a List is being created in the
> background... Or at least that the intermediate state is being kept by Scala
> somehow, and that is the cause of the OutOfMemoryError... Or am I doing
> something wrong in the code that is breaking the expected behavior?
>
> Cheers,
>

Anthony Accioly 2
Joined: 2010-12-14,
User offline. Last seen 42 years 45 weeks ago.
Re: IO and parsing with idiomatic style?
Hi Jim,

Sorry, let me be a bit more precise.  Scala's collection operators such
as map and filter return a collection type which is the same as the
input type (when possible).  So applying map to a List returns a List,
and applying map to an Iterator returns an Iterator.  If the type is
strict, such as List, then the map method will execute strictly, so
it will process the whole input List before returning a result List.
If the type is lazy, such as Iterator or Stream, it will return a new
Iterator or Stream, which is lazy, so it won't actually process the
pipeline (or pull items from the input Iterator or Stream) until you
request items from the output.

Thanks for the very detailed explanation and for the examples.
I've learned a lot from you.
About the lazy vs strict. You are completely right! The Java Scanner implements Iterator (http://download.oracle.com/javase/7/docs/api/java/util/Scanner.html), but I guess scala Iterator and Java Iterator are two different mammals. I will try to wrap either Source or Java Scanner inside a Scala Iterator and test the solution again.

As for now, I'm happy to report that for our simplification, Imperative and Functional Style have about the same performance.

Here's a flawed micro-benchmark:

  def main(args: Array[String]) {
    time(imperativeStyle, 200)
    val imperativeTime = time(imperativeStyle, 20000000)
    time(functionalStyle, 200)
    val functionalTime = time(functionalStyle, 20000000)
    println("--------------------------------")
    println("Imperative Time: " + formatTime(imperativeTime))
    println("Functional Time: " + formatTime(functionalTime))
  }

  def time(f: Int => Unit, iterations: Int) = {
    var initialTime = System.nanoTime
    f(iterations)
    System.nanoTime - initialTime
  }

  def formatTime(time: Long) = {
    "%.2f" format (time.asInstanceOf[Double] / 1000000)
  }

  def imperativeStyle(iterations: Int) {
    var count = 1
    for (i <- 1 to iterations) {
      val big = BigInt(i)
      if (big % 2 == 0) {
        println(count + " - " + big * 3)
        count += 1
      }
    }
  }

  def functionalStyle(iterations: Int) {
    ((new Scanner(iterations)) map { BigInt(_) }
      filter { _ % 2 == 0 } zipWithIndex) foreach {
      case(value, index) => println((index + 1) + " - " + (value * 3))
    }
  }

And here 3 outputs:

Imperative Time: 47128.33
Functional Time: 48372.39

Imperative Time: 48247.83
Functional Time: 47555.95

Imperative Time: 46280.32
Functional Time: 48317.91

In my eager quest to eliminate every val from the code, I even hit a rare JVM Error (at least, It is the first time I hit that Error after 5 years programming Java).

  def functionalStyle2(iterations: Int) {
    for ((value,index) <- ((1 to iterations).view.map { BigInt(_) }
           filter { _ % 2 == 0 } zipWithIndex)) {
        println((index + 1) + " - " + (value * 3))
    }
  }
  functionalStyle(20000000)

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

Philippe Lhoste point taken "a little mutability here and there doesn't necessarily hurt" :).

ps: Is there a easy way to set a delimiter into Source and obtain a Iterator? Or would I have to implement a Iterator that Wraps Source.getLines() and checks for my own delimiter?

Cheers,

--
Anthony Accioly
Jim McBeath
Joined: 2009-01-02,
User offline. Last seen 42 years 45 weeks ago.
Re: IO and parsing with idiomatic style?

On Wed, Aug 17, 2011 at 05:57:53PM -0300, Anthony Accioly wrote:
> ps: Is there a easy way to set a delimiter into Source and obtain a
> Iterator? Or would I have to implement a Iterator that Wraps
> Source.getLines() and checks for my own delimiter?

I'm not sure quite what you are doing, but this code will split first on
line breaks and then on another delimiter within the line:

scala> val a = Source.fromFile("/etc/passwd").getLines.flatMap(_.split(":"))
a: Iterator[java.lang.String] = non-empty iterator

scala> a take 20 toList
res1: List[java.lang.String] = List(root, x, 0, 0, root, /root, /bin/bash, bin, x, 1, 1, bin, /bin, /sbin/nologin, daemon, x, 2, 2, daemon, /sbin)

The "toList" is just to get it to print; if you just do "a take 20" you
will get back an iterator.

--
Jim

Copyright © 2012 École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland