This page is no longer maintained — Please continue to the home page at www.scala-lang.org

Combinator parsing… library bug?

4 replies
Andrew Forrest
Joined: 2008-12-20,
User offline. Last seen 42 years 45 weeks ago.
Hi all,
I’ve been writing some Scala code to parse a simple programming language. The language’s syntax is similar to JavaScript, so it has block comments (/* like this */), line comments (// like this) and strings ("in double quotes" or in 'single quotes'.) It has a lexer component and a parser component. The lexer was originally based on scala.util.parsing.combinator.lexical.StdLexical, specifically these bits:
def token: Parser[Token] =   ( letter ~ rep( letter | digit ) ^^ { case first ~ rest => processIdent(first :: rest mkString "") }  | digit ~ rep( digit ) ^^ { case first ~ rest => NumericLit(first :: rest mkString "") }  | '\'' ~ rep( chrExcept('\'', '\n', EofCh) ) ~ '\'' ^^ { case '\'' ~ chars ~ '\'' => StringLit(chars mkString "") }  | '\"' ~ rep( chrExcept('\"', '\n', EofCh) ) ~ '\"' ^^ { case '\"' ~ chars ~ '\"' => StringLit(chars mkString "") }  | EofCh ^^^ EOF  | '\'' ~> failure("unclosed string literal")          | '\"' ~> failure("unclosed string literal")          | delim                                               | failure("illegal character")  )
// see `whitespace in `Scanners' def whitespace: Parser[Any] = rep(    whitespaceChar  | '/' ~ '*' ~ comment  | '/' ~ '/' ~ rep( chrExcept(EofCh, '\n') )  | '/' ~ '*' ~ failure("unclosed comment")  )
protected def comment: Parser[Any] = (    '*' ~ '/'  ^^ { case _ => ' '  }  | chrExcept(EofCh) ~ comment  )
Note the 3 'failure' lines for unclosed comments or strings. I was seeing odd behaviour in my own code in that none of these ‘unclosed’ failures seemed to be firing. For the strings, I got 
`"' expected but X found (where X was a newline or the end-of-file character); for the block-comments, I had other code which was picking up the ‘/*’ as an identifier. (Like Scala, my language allows symbolic identifiers, so if it wasn't the start of a comment, "/*" would be a legal identifier.)
I think there are two problems here:
1. The 'failure's should be 'err's. Otherwise, 
a. because of the way Parser.Failures compose, the last parse failure is the one returned, so in the case of strings, the fact it's found the end of the file (or the end of the line) is returned rather than the 'failure' at the start of the unclosed string; b. inside the 'rep' in 'whitespace', the failure is just treated as a failure to match, and it's merrily ignored!

However, even this does not solve the problem. If a parser returns a Parsers.Failure, any attempt to backtrack will never return a Parsers.Error (only Parsers.Success, if one is found).
To make parsing code work with ‘err’s, you need to positively search for the error condition (an unterminated string/comment) BEFORE anything which could return a parse failure later in the stream.
I’m on shakier ground here, in terms of my understanding of the whole thing, but I believe that scala.util.parsing.combinator.Parsers.Failure could (should?) be changed:
case class Failure(override val msg: String, override val next: Input) extends NoSuccess(msg, next) {  /** The toString method of a Failure yields an error message */  override def toString = "["+next.pos+"] failure: "+msg+"\n\n"+next.pos.longString    def append[U >: Nothing](a: => ParseResult[U]): ParseResult[U] = { val alt = a; alt match {    case _: Success => alt    case _: Error => alt    case _: Failure => if (alt.next.pos < next.pos) this else alt  }} }
I believe this would allow the parsing to backtrack from a ‘Failure’ to an ‘Error’. Which makes sense, because otherwise ‘Error’s don’t stop the parse, like they’re supposed to. I don’t have my head round combinator parsing enough to know if this would have other horrible side effects, though.
In my case, rather than modifying the scala-library, I found that I could achieve what I wanted by making my ‘Failure’s into ‘Success’es, and returning ‘ErrorToken’s directly, (instead of trying to indirectly cause them through ‘err’):
| '\'' ^^^ ErrorToken("unclosed string literal") | '\"' ^^^ ErrorToken("unclosed string literal")

Anybody any comments?
–Andrew
dcsobral
Joined: 2009-04-23,
User offline. Last seen 38 weeks 5 days ago.
Re: Combinator parsing… library bug?
I'm not sure I understand you -- I have a hell of a headache right now :-) -- but shouldn't you be using ~! before the failures, to prevent backtracking?

On Fri, Jul 10, 2009 at 12:26 PM, Andrew Forrest <andrew@dysphoria.net> wrote:
Hi all,
I’ve been writing some Scala code to parse a simple programming language. The language’s syntax is similar to JavaScript, so it has block comments (/* like this */), line comments (// like this) and strings ("in double quotes" or in 'single quotes'.) It has a lexer component and a parser component. The lexer was originally based on scala.util.parsing.combinator.lexical.StdLexical, specifically these bits:
def token: Parser[Token] =   ( letter ~ rep( letter | digit ) ^^ { case first ~ rest => processIdent(first :: rest mkString "") }  | digit ~ rep( digit ) ^^ { case first ~ rest => NumericLit(first :: rest mkString "") }  | '\'' ~ rep( chrExcept('\'', '\n', EofCh) ) ~ '\'' ^^ { case '\'' ~ chars ~ '\'' => StringLit(chars mkString "") }  | '\"' ~ rep( chrExcept('\"', '\n', EofCh) ) ~ '\"' ^^ { case '\"' ~ chars ~ '\"' => StringLit(chars mkString "") }  | EofCh ^^^ EOF  | '\'' ~> failure("unclosed string literal")          | '\"' ~> failure("unclosed string literal")          | delim                                               | failure("illegal character")  )
// see `whitespace in `Scanners' def whitespace: Parser[Any] = rep(    whitespaceChar  | '/' ~ '*' ~ comment  | '/' ~ '/' ~ rep( chrExcept(EofCh, '\n') )  | '/' ~ '*' ~ failure("unclosed comment")  )
protected def comment: Parser[Any] = (    '*' ~ '/'  ^^ { case _ => ' '  }  | chrExcept(EofCh) ~ comment  )
Note the 3 'failure' lines for unclosed comments or strings. I was seeing odd behaviour in my own code in that none of these ‘unclosed’ failures seemed to be firing. For the strings, I got 
`"' expected but X found (where X was a newline or the end-of-file character); for the block-comments, I had other code which was picking up the ‘/*’ as an identifier. (Like Scala, my language allows symbolic identifiers, so if it wasn't the start of a comment, "/*" would be a legal identifier.)
I think there are two problems here:
1. The 'failure's should be 'err's. Otherwise, 
a. because of the way Parser.Failures compose, the last parse failure is the one returned, so in the case of strings, the fact it's found the end of the file (or the end of the line) is returned rather than the 'failure' at the start of the unclosed string; b. inside the 'rep' in 'whitespace', the failure is just treated as a failure to match, and it's merrily ignored!

However, even this does not solve the problem. If a parser returns a Parsers.Failure, any attempt to backtrack will never return a Parsers.Error (only Parsers.Success, if one is found).
To make parsing code work with ‘err’s, you need to positively search for the error condition (an unterminated string/comment) BEFORE anything which could return a parse failure later in the stream.
I’m on shakier ground here, in terms of my understanding of the whole thing, but I believe that scala.util.parsing.combinator.Parsers.Failure could (should?) be changed:
case class Failure(override val msg: String, override val next: Input) extends NoSuccess(msg, next) {  /** The toString method of a Failure yields an error message */  override def toString = "["+next.pos+"] failure: "+msg+"\n\n"+next.pos.longString     def append[U >: Nothing](a: => ParseResult[U]): ParseResult[U] = { val alt = a; alt match {    case _: Success => alt    case _: Error => alt    case _: Failure => if (alt.next.pos < next.pos) this else alt  }} }
I believe this would allow the parsing to backtrack from a ‘Failure’ to an ‘Error’. Which makes sense, because otherwise ‘Error’s don’t stop the parse, like they’re supposed to. I don’t have my head round combinator parsing enough to know if this would have other horrible side effects, though.
In my case, rather than modifying the scala-library, I found that I could achieve what I wanted by making my ‘Failure’s into ‘Success’es, and returning ‘ErrorToken’s directly, (instead of trying to indirectly cause them through ‘err’):
| '\'' ^^^ ErrorToken("unclosed string literal") | '\"' ^^^ ErrorToken("unclosed string literal")

Anybody any comments?
–Andrew



--
Daniel C. Sobral

Something I learned in academia: there are three kinds of academic reviews: review by name, review by reference and review by value.
Antonio Cunei
Joined: 2008-12-16,
User offline. Last seen 3 years 22 weeks ago.
Re: Combinator parsing… library bug? - OT

Sorry, but the "scala-internals" list is devoted to the core internals and
the development of the Scala compiler: we need to keep it on focus.

In short:
1) I am developing the parser combinator -> ok on scala-internals
2) I think I found a bug in the parser combinator -> not ok

Kindly continue this thread on either "scala" or "scala-user". If, after
the discussion, you think you encountered a bug, please submit a ticket on
Trac.

Thanks!
Toni

Daniel Sobral wrote:
> I'm not sure I understand you -- I have a hell of a headache right now :-)

Andrew Forrest
Joined: 2008-12-20,
User offline. Last seen 42 years 45 weeks ago.
Re: Combinator parsing… library bug? - OT

Okay, no worries; I wasn’t sure how ‘internal’ internals covered; will
repost on scala-user.

On 10 Jul 2009, at 23:19, Antonio Cunei wrote:

> Sorry, but the "scala-internals" list is devoted to the core
> internals and the development of the Scala compiler: we need to keep
> it on focus.
>
> In short:
> 1) I am developing the parser combinator -> ok on scala-internals
> 2) I think I found a bug in the parser combinator -> not ok
>
> Kindly continue this thread on either "scala" or "scala-user". If,
> after the discussion, you think you encountered a bug, please submit
> a ticket on Trac.
>
> Thanks!
> Toni
>
> Daniel Sobral wrote:
>> I'm not sure I understand you -- I have a hell of a headache right
>> now :-)

Andrew Forrest
Joined: 2008-12-20,
User offline. Last seen 42 years 45 weeks ago.
Re: Combinator parsing… library bug?

On 10 Jul 2009, at 18:53, Daniel Sobral wrote:
I'm not sure I understand you -- I have a hell of a headache right now :-) -- but shouldn't you be using ~! before the failures, to prevent backtracking?

Yeah... that doesn't actually help: It's not my code (it's copied straight from “scala.util.parsing.combinator.lexical.StdLexical”), and it also doesn't fix the problem, because by the time you reach the ‘failure’s, you’ve already backtracked.
The problem is that the attempts at successful string/comment parsing occur BEFORE the attempts to match a failure:
1 def token: Parser[Token] = 2 ( letter ~ rep( letter | digit ) ^^ { case first ~ rest => processIdent(first :: rest mkString "") }3 | digit ~ rep( digit ) ^^ { case first ~ rest => NumericLit(first :: rest mkString "") }4 | '\'' ~ rep( chrExcept('\'', '\n', EofCh) ) ~ '\'' ^^ { case '\'' ~ chars ~ '\'' => StringLit(chars mkString "") }5 | '\"' ~ rep( chrExcept('\"', '\n', EofCh) ) ~ '\"' ^^ { case '\"' ~ chars ~ '\"' => StringLit(chars mkString "") }6 | EofCh ^^^ EOF7 | '\'' ~> failure("unclosed string literal")        8 | '\"' ~> failure("unclosed string literal")        9 | delim                                             A | failure("illegal character")B )
So, line 4 attempts to match a quote, followed by a string, followed by an end quote, where the string cannot contain an end-of-file, or a newline. If, say, it finds a newline before the end quote, that particular match (line 4) will fail. Line 7 will ALSO fail, but the message generated on line 7 will never be reported to the user, since line 4 failed further along the input, and the way failures compose, the LATER failure is always the one reported.
However, line 7’s message is the one we want, and in fact the only reason that line 7 exists is to give a user-friendly error message.
Which is why I suggested that perhaps it should be ‘err’ rather than ‘failure’, and perhaps the composition rules for (the) “Failure” (class) should be changed to admit an “Error” earlier in the stream.

--Andrew

Copyright © 2012 École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland