- About Scala
- Documentation
- Code Examples
- Software
- Scala Developers
parser combinators vs. regex question
Thu, 2009-05-21, 19:10
I'm trying to do a CSV parser: http://gist.github.com/115557
A theoretical question:
The following regex: /(?xs) ("(.*?)"|) ; ("(.*?)"|) (?: \r\n | \z )/
have a nice property of ignoring any double quotes in the file which aren't
followed by either a semicolon, and end-of-line or an end-of-file.
In particular, I can pass "\"\"\";" - and it will be a valid CSV file
containing a double-quote in the first column.
If I'm not mistaking, it's called backtracking:
the regular expression engine can see that the second double-quote wasn't not
the proper match and will skip to the third double-quote, which is followed by
a semi-colon.
I have tried to do something similar with parser combinators,
e.g. with
def stringInQuotes = "" | ('"' ~ rep ("(?s).".r) ~ '"'
^^ {case _ ~ chars ~ _ => chars.mkString ("")})
or
def stringInQuotes = opt ('"' ~ rep (elem("value", (c: Char) => true)) ~ '"')
^^ {case None => ""; case Some (_ ~ chars ~ _) => chars.mkString ("")}
but the first doesn't do what expected
and the second goes into an infinite cycle.
Is it possible to do that kind of backtracking with the current (2.7.4) release
of the combinators and if so, then what i'm doing wrong?
Thu, 2009-05-21, 19:57
#2
Re: parser combinators vs. regex question
On Thu, May 21, 2009 at 05:53:35PM +0000, ArtemGr wrote:
> The following regex: /(?xs) ("(.*?)"|) ; ("(.*?)"|) (?: \r\n | \z )/
> have a nice property of ignoring any double quotes in the file which aren't
> followed by either a semicolon, and end-of-line or an end-of-file.
> In particular, I can pass "\"\"\";" - and it will be a valid CSV file
> containing a double-quote in the first column.
For 2.8 the guard combinator has been added, so you could say:
'"' ~> rep(chars) <~ '"' <~ guard("\r\n" | EOF)
but in 2.7 you can achieve the same effect with double negation:
'"' ~> rep(chars) <~ '"' <~ not(not("\r\n" | EOF))
This is pseudocode but you should be able to get it going. Neither
guard nor not consume any input, they only place constraints on a match.
Thu, 2009-05-21, 21:07
#3
Re: parser combinators vs. regex question
Paul Phillips writes:
> For 2.8 the guard combinator has been added, so you could say:
>
> '"' ~> rep(chars) <~ '"' <~ guard("\r\n" | EOF)
>
> but in 2.7 you can achieve the same effect with double negation:
>
> '"' ~> rep(chars) <~ '"' <~ not(not("\r\n" | EOF))
>
> This is pseudocode but you should be able to get it going. Neither
> guard nor not consume any input, they only place constraints on a match.
Thanks for the hint.
I think the guard is a look-ahead construct,
it doesn't answer the backtracking question in any way.
Intuitively, since there is a disjunction which effectively does a backtracking
(e.g. if the first Parser didn't work, disj. should return to the same position
and try the second Parser), I think replacing
def stringInQuotes = """(?xs) ".*?" |""".r ^^ {
case qstr => if (qstr.length != 0) qstr.substring (1, qstr.length - 1) else ""}
def line = stringInQuotes ~ ';' ~ stringInQuotes ~ (CRLF | EOF) ^^ {
case col1 ~ _ ~ col2 ~ _ => col1 :: col2 :: Nil}
with
def chars1 = rep ("(?s).".r) ^^ {case chars_ => chars_ mkString ""}
def chars2 = rep (elem("value", (c: Char) => true)) ^^ {
case chars_ => chars_ mkString ""}
def col1 = ('"' ~ chars1 ~ "\";" ^^ {case _ ~ value ~ _ => value}
| ";" ^^ {case _ => ""})
def col2 = ('"' ~ chars1 ~ ("\"" ~ (CRLF | EOF)) ^^ {
case _ ~ value ~ _ => value}
| (CRLF | EOF) ^^ {case _ => ""})
def line = col1 ~ col2 ^^ {case v1 ~ v2 => v1 :: v2 :: Nil}
should just work.
However, like I said earlier, with chars1 it gives wrong results, e.g.
[1.1] failure: `";' expected but `' found
with input "\"qq\nqq\";"
and with chars2 it goes into an infinite loop...
Thu, 2009-05-21, 22:07
#4
Re: Re: parser combinators vs. regex question
On Thu, May 21, 2009 at 08:01:28PM +0000, ArtemGr wrote:
> [snip]
The degree to which you are overcomplicating this by trying to apply
regexp backtracking defies description. Your code is too difficult to
read, but what you're trying to do can be done in one or two lines.
> I think the guard is a look-ahead construct, it doesn't answer the
> backtracking question in any way.
What do you think happens when the guard fails? Backtracking is what the
combinators do if you use the backtracking ops (that's | among others.)
Attempting to reimplement backtracking with regexps inside the
combinator framework is like throwing away your sword so you can fight
off the marauders with a lampshade.
Thu, 2009-05-21, 22:47
#5
Re: parser combinators vs. regex question
Paul Phillips writes:
> On Thu, May 21, 2009 at 08:01:28PM +0000, ArtemGr wrote:
> > [snip]
>
> The degree to which you are overcomplicating this by trying to apply
> regexp backtracking defies description.
I'm not "trying to apply" regex backtracking, i've already said it isn't
very convenient without access to subgroups.
> Your code is too difficult to
> read, but what you're trying to do can be done in one or two lines.
> > I think the guard is a look-ahead construct, it doesn't answer the
> > backtracking question in any way.
>
> What do you think happens when the guard fails? Backtracking is what the
> combinators do if you use the backtracking ops (that's | among others.)
That's what I implied in the previous post, by saying that disjunction should
use backtracking. Thanks for clarifying. That answers the theoretical
half of my question.
> Attempting to reimplement backtracking with regexps inside the
> combinator framework is like throwing away your sword so you can fight
> off the marauders with a lampshade.
I think the comparison would be rather of throwing away the spare parts of a
calculator in order to fight the maradeurs with the old good regex sword.
After all, the simple and intuitive regex
"""(?xs) ("(.*?)"|) ; ("(.*?)"|) (?: \r?\n | \z ) """
which took me one minute to write - works, and parser combinators, after 12
to 16 hours of tweaking are either fail unexpectedly or go skyrocket into
infinite loop.
Fri, 2009-05-22, 09:57
#6
Re: parser combinators vs. regex question
ArtemGr writes:
> Is it possible to do that kind of backtracking with the current (2.7.4)
> release of the combinators and if so, then what i'm doing wrong?
I have found an interesting comment
in the javadoc of disjunction method "Parser.|":
`p | q' succeeds if `p' succeeds or `q' succeeds Note that `q' is only tried if `p's failure is non-fatal (i.e., back-tracking is allowed).
It implies that backtracking is not always allowed. I wonder what kind of parsers might produce a "fatal failure" and why?
Fri, 2009-05-22, 10:07
#7
Re: Re: parser combinators vs. regex question
A not-matching branch of an alternative is non-fatal.
For better error handling you might anticipate and introduce error
branches (alternatives) to give better error messages. These are
fatal, since you don't want parsing to continue in these cases.
On Fri, May 22, 2009 at 10:49 AM, ArtemGr wrote:
> ArtemGr writes:
>> Is it possible to do that kind of backtracking with the current (2.7.4)
>> release of the combinators and if so, then what i'm doing wrong?
>
> I have found an interesting comment
> in the javadoc of disjunction method "Parser.|":
>
>
`p | q' succeeds if `p' succeeds or `q' succeeds > Note that `q' is only tried if `p's failure is non-fatal > (i.e., back-tracking is allowed).
> > It implies that backtracking is not always allowed. > > I wonder what kind of parsers might produce a "fatal failure" and why? > >
Fri, 2009-05-22, 14:17
#8
Re: Re: parser combinators vs. regex question
On Friday May 22 2009, Johannes Rudolph wrote:
> A not-matching branch of an alternative is non-fatal.
> For better error handling you might anticipate and introduce error
> branches (alternatives) to give better error messages. These are
> fatal, since you don't want parsing to continue in these cases.
Unless, of course, you do. I tend to think that nothing is worse than a
prematurely terminated parse. It's at least as bad as poor error
messages.
There are typically some (often many) errors that don't render the rest
of the parse impossible or meaningless. When I write parsers, I always
try to include as many error productions as possible.
Randall Schulz
Tue, 2009-09-29, 14:47
#9
Regex question
He,
I am trying to strip of potential leading/ tailing quotation marks
("), but don't get my regex right:
val s1 = "\"hello world\""
val s2 = "hello world"
val StrValue = """[^"]((\w*\s*)*)""".r
(StrValue findFirstIn s1) foreach (println)
(StrValue findFirstIn s2) foreach (println)
val StrValue(_, ss1) = s1 // XXX Match Error
println(ss1)
val StrValue(ss2) = s2 // XXX Match Error
println(ss2)
Can anyone help me out, please?
Cheers,
--
Normen Müller
Tue, 2009-09-29, 15:17
#10
Re: Regex question
I just recognized that
val Decimal = """(-)?(\d+)(\.\d*)?""".r
val Decimal(d) = "1.0"
println(d)
out of ``Programming in Scala'' also doesn't work. But I am pretty
sure it worked some time before 2.7.6. :\
On Sep 29, 2009, at 3:39 PM, Normen Müller wrote:
> He,
>
> I am trying to strip of potential leading/ tailing quotation marks
> ("), but don't get my regex right:
>
> val s1 = "\"hello world\""
> val s2 = "hello world"
>
> val StrValue = """[^"]((\w*\s*)*)""".r
>
> (StrValue findFirstIn s1) foreach (println)
> (StrValue findFirstIn s2) foreach (println)
>
> val StrValue(_, ss1) = s1 // XXX Match Error
> println(ss1)
>
> val StrValue(ss2) = s2 // XXX Match Error
> println(ss2)
>
> Can anyone help me out, please?
>
> Cheers,
> --
> Normen Müller
>
Cheers,
--
Normen Müller
Tue, 2009-09-29, 15:27
#11
Re: Re: Regex question
On Tuesday September 29 2009, Normen Müller wrote:
> I just recognized that
>
> val Decimal = """(-)?(\d+)(\.\d*)?""".r
> val Decimal(d) = "1.0"
> println(d)
>
> out of ``Programming in Scala'' also doesn't work. But I am pretty
> sure it worked some time before 2.7.6. :\
On this machine (where I don't do development and hence don't
bother updating Scala) I have 2.7.4.
But your result is to be expected. There are three capturing groups in
your RE, so you have to bind three values in the combined match /
declaration syntax:
scala> val Decimal(signPart, intPart, fracPart) = "1.0"
signPart: String = null
intPart: String = 1
fracPart: String = .0
Randall Schulz
Tue, 2009-09-29, 15:37
#12
Re: Re: Regex question
On Sep 29, 2009, at 4:19 PM, Randall R Schulz wrote:
> On Tuesday September 29 2009, Normen Müller wrote:
>> I just recognized that
>>
>> val Decimal = """(-)?(\d+)(\.\d*)?""".r
>> val Decimal(d) = "1.0"
>> println(d)
>>
>> out of ``Programming in Scala'' also doesn't work. But I am pretty
>> sure it worked some time before 2.7.6. :\
>
> On this machine (where I don't do development and hence don't
> bother updating Scala) I have 2.7.4.
>
> But your result is to be expected. There are three capturing groups in
> your RE, so you have to bind three values in the combined match /
> declaration syntax:
>
> scala> val Decimal(signPart, intPart, fracPart) = "1.0"
> signPart: String = null
> intPart: String = 1
> fracPart: String = .0
My fault … :(
But what about
val s1 = "\"hello world\""
val s2 = "hello world"
val StrValue = """[^"]((\w*\s*)*)""".r
val StrValue(_, ss1) = s1 // XXX Match Error
println(ss1)
>
>
> Randall Schulz
Cheers,
--
Normen Müller
Tue, 2009-09-29, 15:47
#13
Re: Re: Regex question
Normen Müller wrote:
> But what about
>
> val s1 = "\"hello world\""
> val s2 = "hello world"
>
> val StrValue = """[^"]((\w*\s*)*)""".r
>
> val StrValue(_, ss1) = s1 // XXX Match Error
> println(ss1)
You're telling it that a double-quote character at the start of your
string must NOT match. Try this:
val StrValue = """"*([\w\s]*)"*""".r
val StrValue(ss1) = s1
Ciao,
Gordon
Tue, 2009-09-29, 15:57
#14
Re: Re: Regex question
On Sep 29, 2009, at 4:38 PM, Gordon Tyler wrote:
> val StrValue = """"*([\w\s]*)"*""".r
And once more … my fault! You are absolutely right!
Thank you!!
Cheers,
--
Normen Müller
Tue, 2009-09-29, 16:27
#15
Re: Regex question
There are many problems. First, when you do pattern matching such as the match errors you indicated, the string must be _exactly_ equal to what is matched by findFirstIn. In other words:
(StrValue findFirstIn s1).get == s1
(StrValue findFirstIn s2).get == s2 The second one is true, the first one is not, and that's why the first one gives an error. The next problem is that, when doing pattern matching, one parameter will be returned for each parenthesis group (aside those you explicitly flag not to -- see the API docs on Java's Pattern class). Now, StrValue, as you defined, has two parenthesis, but you are only passing one parameter when matching against s2, and that's why that line gives an error.
Finally, the pattern itself is inefficent: ((\w*\s*)*). The problem is that this gives multiple ways of interpreting the same pattern. For instance, "abc" can be interpreted as (\w{1}\s{0}){3} or (\w{3}\s{0}){1}, or various multiple combinations. You must strive to have your patterns have only one possible match. What I recommend, to deal will all the problems, is val StrValue= """^(?:[^"]*")?([^"]*)(?:"[^"]*)?$""".r scala> val StrValue(ss1) = s1
ss1: String = hello world scala> val StrValue(ss2) = s2
ss2: String = hello world This is a rather complex pattern. You may have some fun (or not! :) figuring out what does it mean, and feeding it sample strings to see how it works. On Tue, Sep 29, 2009 at 10:39 AM, Normen Müller <normen.mueller@googlemail.com> wrote:
--
Daniel C. Sobral
Something I learned in academia: there are three kinds of academic reviews: review by name, review by reference and review by value.
(StrValue findFirstIn s2).get == s2 The second one is true, the first one is not, and that's why the first one gives an error. The next problem is that, when doing pattern matching, one parameter will be returned for each parenthesis group (aside those you explicitly flag not to -- see the API docs on Java's Pattern class). Now, StrValue, as you defined, has two parenthesis, but you are only passing one parameter when matching against s2, and that's why that line gives an error.
Finally, the pattern itself is inefficent: ((\w*\s*)*). The problem is that this gives multiple ways of interpreting the same pattern. For instance, "abc" can be interpreted as (\w{1}\s{0}){3} or (\w{3}\s{0}){1}, or various multiple combinations. You must strive to have your patterns have only one possible match. What I recommend, to deal will all the problems, is val StrValue= """^(?:[^"]*")?([^"]*)(?:"[^"]*)?$""".r scala> val StrValue(ss1) = s1
ss1: String = hello world scala> val StrValue(ss2) = s2
ss2: String = hello world This is a rather complex pattern. You may have some fun (or not! :) figuring out what does it mean, and feeding it sample strings to see how it works. On Tue, Sep 29, 2009 at 10:39 AM, Normen Müller <normen.mueller@googlemail.com> wrote:
He,
I am trying to strip of potential leading/ tailing quotation marks ("), but don't get my regex right:
val s1 = "\"hello world\""
val s2 = "hello world"
val StrValue = """[^"]((\w*\s*)*)""".r
(StrValue findFirstIn s1) foreach (println)
(StrValue findFirstIn s2) foreach (println)
val StrValue(_, ss1) = s1 // XXX Match Error
println(ss1)
val StrValue(ss2) = s2 // XXX Match Error
println(ss2)
Can anyone help me out, please?
Cheers,
--
Normen Müller
--
Daniel C. Sobral
Something I learned in academia: there are three kinds of academic reviews: review by name, review by reference and review by value.
Tue, 2009-09-29, 17:27
#16
Re: Regex question
On Tuesday September 29 2009, Daniel Sobral wrote:
> There are many problems. First, when you do pattern matching such as
> the match errors you indicated, the string must be _exactly_ equal to
> what is matched by findFirstIn. In other words:
>
> (StrValue findFirstIn s1).get == s1
> (StrValue findFirstIn s2).get == s2
That would make the "find" part of the method name very poorly chosen.
You describe a complete match, not a find.
This seems to contradict what you say, though:
scala> val s1 = "123abcXYZ"
s1: java.lang.String = 123abcXYZ
scala> val re1 = "abc".r
re1: scala.util.matching.Regex = abc
scala> re1.findFirstIn(s1)
res0: Option[String] = Some(abc)
> ...
>
> For instance, "abc" can be interpreted as (\w{1}\s{0}){3} or
> (\w{3}\s{0}){1}, or various multiple combinations. You must strive to
> have your patterns have only one possible match.
That is technically true, but REs disambiguate these through "maximum
bite" semantics.
> ...
Randall Schulz
Tue, 2009-09-29, 18:27
#17
Re: Regex question
On Tue, Sep 29, 2009 at 1:20 PM, Randall R Schulz <rschulz@sonic.net> wrote:
On Tuesday September 29 2009, Daniel Sobral wrote:I'm describing the condition that must hold true for a pattern matching to work.
> There are many problems. First, when you do pattern matching such as
> the match errors you indicated, the string must be _exactly_ equal to
> what is matched by findFirstIn. In other words:
>
> (StrValue findFirstIn s1).get == s1
> (StrValue findFirstIn s2).get == s2
That would make the "find" part of the method name very poorly chosen.
You describe a complete match, not a find.
scala> val re1 = "(abc)".r
This seems to contradict what you say, though:
scala> val s1 = "123abcXYZ"
s1: java.lang.String = 123abcXYZ
scala> val re1 = "abc".r
re1: scala.util.matching.Regex = abc
scala> re1.findFirstIn(s1)
res0: Option[String] = Some(abc)
re1: scala.util.matching.Regex = (abc) scala> (re1 findFirstIn s1).get == s1
res30: Boolean = false scala> re1.findFirstIn(s1)
res31: Option[String] = Some(abc) scala> val re1(ss1) = s1
scala.MatchError: 123abcXYZ
at .<init>(<console>:9)
at .<clinit>(<console>)
at RequestResult$.<init>(<console>:4)
at RequestResult$.<clinit>(<console>)
at RequestResult$result(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invo...
scala> val re1(ss1) = "abc"
ss1: String = abc
Which degenerate into exponential time searches if the match fails, as it backtrack each combination one by one.
> ...
>
> For instance, "abc" can be interpreted as (\w{1}\s{0}){3} or
> (\w{3}\s{0}){1}, or various multiple combinations. You must strive to
> have your patterns have only one possible match.
That is technically true, but REs disambiguate these through "maximum
bite" semantics.
> ...
Randall Schulz
--
Daniel C. Sobral
Something I learned in academia: there are three kinds of academic reviews: review by name, review by reference and review by value.
Tue, 2009-09-29, 18:37
#18
Re: Regex question
He Daniel,
On Sep 29, 2009, at 5:20 PM, Daniel Sobral wrote:
> StrValue= """^(?:[^"]*")?([^"]*)(?:"[^"]*)?$""".r
that's an very complex regex … I don't understand a word ;)
BUT, it works perfect for me (@thanks to Randall as well … his one was
fine for me as well!)!!! All I want to do is, to strip of potential
quotation marks before or after a string. Between the quotation marks
any character is allowed. I just tested your regex and it does
exactly that: THANK YOU!!!
Cheers,
--
Normen Müller
Wed, 2009-09-30, 07:27
#19
Re: Regex question
On Tue, Sep 29, 2009 at 8:20 AM, Daniel Sobral wrote:
>
> val StrValue= """^(?:[^"]*")?([^"]*)(?:"[^"]*)?$""".r
What result is expected with the following inputs?
"r"b"
r"b"
"r"b
"rb""
rb"
"rb
\""rb"
The above regex results with capture group 1:
"r"b" - No match
r"b" - Some(b)
"r"b - Some(r)
"rb"" - No match
rb" - No match
"rb - Some(rb)
\""rb" - Some()
These are all unappealing results to me. I suggest this pattern is
closer to the goal of stripping one leading and/or trailing quote
(plus whitespace):
"""^(?:\s*")?(.*?)(?:"\s*)?$""".r
"r"b" - Some(r"b)
r"b" - Some(r"b)
"r"b - Some(r"b)
"rb"" - Some(rb")
rb" - Some(rb)
"rb - Some(rb)
\""rb" - Some(\""rb)
Requiring the surrounding quotes be balanced (on both sides) gets uglier.
Testing methodology:
scala> """^(?:\s*")?(.*?)(?:"\s*)?$""".r.findFirstMatchIn("\\\"\"rb").map(_.group(1))
res51: Option[String] = Some(\""rb)
Wed, 2009-09-30, 10:07
#20
Re: Regex question
He Robert,
On Sep 30, 2009, at 8:21 AM, J Robert Ray wrote:
> On Tue, Sep 29, 2009 at 8:20 AM, Daniel Sobral
> wrote:
>>
>> val StrValue= """^(?:[^"]*")?([^"]*)(?:"[^"]*)?$""".r
>
> What result is expected with the following inputs?
Actually, these are very good questions. In my scenario a string can
be quoted or not. If the string is quoted, then the quotes have to be
balanced and I just want to extract the string between the quotes no
matter what characters are in between.
I guess the grammar could be something like this, assuming that
``StringLiteral'' accepts any character.
Name ::= '"' StringLiteral '"' | StringLiteral
> "r"b"
>
> r"b"
>
> "r"b
>
> "rb""
>
> rb"
>
> "rb
>
> \""rb"
>
> The above regex results with capture group 1:
>
> "r"b" - No match
>
> r"b" - Some(b)
>
> "r"b - Some(r)
>
> "rb"" - No match
>
> rb" - No match
>
> "rb - Some(rb)
>
> \""rb" - Some()
>
> These are all unappealing results to me. I suggest this pattern is
> closer to the goal of stripping one leading and/or trailing quote
> (plus whitespace):
>
> """^(?:\s*")?(.*?)(?:"\s*)?$""".r
>
> "r"b" - Some(r"b)
>
> r"b" - Some(r"b)
>
> "r"b - Some(r"b)
>
> "rb"" - Some(rb")
>
> rb" - Some(rb)
>
> "rb - Some(rb)
>
> \""rb" - Some(\""rb)
>
> Requiring the surrounding quotes be balanced (on both sides) gets
> uglier.
>
> Testing methodology:
>
> scala> """^(?:\s*")?(.*?)(?:"\s*)?$""".r.findFirstMatchIn("\\
> \"\"rb").map(_.group(1))
> res51: Option[String] = Some(\""rb)
Cheers,
--
Normen Müller
Wed, 2009-09-30, 16:57
#21
Re: Regex question
>>>>> "Normen" == Normen Müller writes:
Normen> Actually, these are very good questions. In my scenario a
Normen> string can be quoted or not. If the string is quoted, then the
Normen> quotes have to be balanced and I just want to extract the
Normen> string between the quotes no matter what characters are in
Normen> between.
dunno if your goal is to improve your understanding of regexes or just
get the job done. if the latter, I think
def unquoted(x: String): String =
if(x.head == '"' && x.last == '"')
x.drop(1).dropRight(1)
else x
is 10x more readable than a regex.
(note: this is 2.8 code, not sure what the most elegant 2.7 version
would be. hooray for the String improvements in 2.8.)
Mon, 2009-11-02, 20:47
#22
Re: Re: parser combinators vs. regex question
On Thu, May 21, 2009 at 10:44 AM, ArtemGr <artemciy@gmail.com> wrote:
ArtemGr <artemciy@...> writes:
As a side note, it would be good if the
implicit def regex(r: Regex): Parser[String]
method in RegexParsers produced a matcher with access to the matched
groups, instead of just a string.
I have also found the need for this, and have come up with the following solution as a makeshift change, as you can define a parser that does this for you :
def regexMatch(r : Regex) : Parser[Match] = Parser { in => regex(r)(in) match {
case Success(aString, theRest) => Success(r.findFirstMatchIn(aString).get, theRest)
case f@Failure(_,_) => f
case e@Error(_,_) => e
}}
The other solution is to change the implicit regex method in RegexParsers itself.
I was also wondering if there is a better solution to parsing different groups.
Thanks,
Manohar
Tue, 2012-01-10, 04:31
#23
Regex Question
Hi Everyone,
I wasn't able to find an example for what I'm trying to do. How can I convert the following Java code to Scala (and make it more Scala like)?
Pattern mapParser = Pattern.compile("\u0000([^:]*):([^\u0000]*)\u0000");
Map map = new LinkedHashMap();
Matcher matcher = mapParser.matcher(decrypted);
while (matcher.find()) {
map.put(matcher.group(1), matcher.group(2));
}
Thanks!
Tue, 2012-01-10, 18:31
#24
Re: Regex Question
On Tue, Jan 10, 2012 at 01:26, Drew Kutcharian wrote:
> Hi Everyone,
>
> I wasn't able to find an example for what I'm trying to do. How can I convert the following Java code to Scala (and make it more Scala like)?
>
> Pattern mapParser = Pattern.compile("\u0000([^:]*):([^\u0000]*)\u0000");
> Map map = new LinkedHashMap();
> Matcher matcher = mapParser.matcher(decrypted);
> while (matcher.find()) {
> map.put(matcher.group(1), matcher.group(2));
> }
This kind of question works better on codereview.stackexchange.com or
stackoverflow.com. Regardless, here's how you'd do it:
val mapParser = "\u0000([^:]*):([^\u0000]*)\u0000".r
val pairs = for (mapParser(key, value) <- mapParser findAllIn
decrypted) yield key -> value
val map = pairs.toMap
There are some ways in which you can improve the performance, but it
decreases a readability a bit.
Thu, 2012-01-12, 03:11
#25
Re: Regex Question
Thanks Daniel.
On Jan 10, 2012, at 9:29 AM, Daniel Sobral wrote:
> On Tue, Jan 10, 2012 at 01:26, Drew Kutcharian wrote:
>> Hi Everyone,
>>
>> I wasn't able to find an example for what I'm trying to do. How can I convert the following Java code to Scala (and make it more Scala like)?
>>
>> Pattern mapParser = Pattern.compile("\u0000([^:]*):([^\u0000]*)\u0000");
>> Map map = new LinkedHashMap();
>> Matcher matcher = mapParser.matcher(decrypted);
>> while (matcher.find()) {
>> map.put(matcher.group(1), matcher.group(2));
>> }
>
> This kind of question works better on codereview.stackexchange.com or
> stackoverflow.com. Regardless, here's how you'd do it:
>
> val mapParser = "\u0000([^:]*):([^\u0000]*)\u0000".r
> val pairs = for (mapParser(key, value) <- mapParser findAllIn
> decrypted) yield key -> value
> val map = pairs.toMap
>
> There are some ways in which you can improve the performance, but it
> decreases a readability a bit.
>
ArtemGr writes:
> I have tried to do something similar with parser combinators,
> e.g. with
> def stringInQuotes = "" | ('"' ~ rep ("(?s).".r) ~ '"'
> ^^ {case _ ~ chars ~ _ => chars.mkString ("")})
> or
> def stringInQuotes = opt ('"' ~ rep (elem("value", (c: Char) => true)) ~ '"')
> ^^ {case None => ""; case Some (_ ~ chars ~ _) => chars.mkString ("")}
>
> but the first doesn't do what expected
> and the second goes into an infinite cycle.
As a side note, it would be good if the
implicit def regex(r: Regex): Parser[String]
method in RegexParsers produced a matcher with access to the matched
groups, instead of just a string.
Then it will be possible to use the regex backtracking features locally inside
the parser combinators.