This page is no longer maintained — Please continue to the home page at www.scala-lang.org

Of possible interest: Regular expressions library

1 reply
Kenneth McDonald
Joined: 2009-01-11,
User offline. Last seen 42 years 45 weeks ago.

Appended is the code for a regular expressions library. It's quite
crude, with incomplete documentation and minimal testing, but I
thought it might be useful to some. For a fuller understanding of what
it does, read the preamble and check out the tests at the end.
Comments welcome, but I'm time-constrained, so probably can't do a lot
more on it. (It was mostly an exercise for learning Scala.)

Cheers,
Ken

==========================================
/**
Before becoming more formal, let me give you an example of how rex
makes it
easy to build complex regular expressions from simpler ones. Here's
how to build
a regular expression that recognizes complex numbers:

val posInt = CharRange('0','9')*1
val sign = Lit("+")|"-"
val optionalSign = sign.optional
val floatPat = optionalSign + posInt + (Lit(".") + posInt).optional
val complex = floatPat.group("re") + sign.group("op") +
floatPat.group("im") + "i"

When the complex pattern is used to find a complex number, the real
and imaginary parts,
and operator, can be pulled out of the MatchResult by name: 're',
'im', and 'op'.

By the way, I haven't run the above code, so it may have some
errors. But you can
see real code in the test suite at the end of this file.

Hopefully that's whetted your appetite. Here's a more formal
description of the
package.

The rex package provides an enhanced interface to regular
expressions. It
has four major goals:

1) Allow regular expressions to be built in small pieces and then
easily
composed into larger regular expression. This enhances readability,
and makes
it easy to test the subparts of a complex regular expression as
standalone entities.
Separate regular expressions can be used to build new regex's with
various operators
such as *, +, and |, and further operators exist for character
classes.

2) Allow groups to be named when they are defined, and then accessed
by their
names. This is different from Scala, where the onus is on the
programmer to ensure
that (if provided) a list of group names correctly corresponds to
the parentheses
in a regular expression.

3) Provide a more flexible mechanism for iterating through a
"findAllIn" regular
expression search. In particular, when a regular expression is used
to iterate
over a target string, both matches and non-matches are returned as
MatchResult
objects; the value of a {@literal matched} boolean variable
determines whether
this represents a successful match or not. This permits one to
easily process
all of a string, or just those parts of it that matched, or just
those parts that
didn't.

4) Provide useful predefined regular expressions. For example, there
is a
predefined {@literal PatFloat} that will match floating point
numbers. Since
regular expressions can be easily combined, the regex for a complex
number now
becomes just (@literal PatFloat.group("re") +
(Lit("-")|"+").group("sign") + PatFloat.group("im") + "i"}.
*/

package rex

import scala.util.matching.Regex
import scala.collection.immutable.HashMap
import scala.collection.mutable.ArrayBuffer
import org.scalatest.Suite

//== Matcher
===================================================================
protected object Matcher {
def anonGroup(m:Matcher) = "(?:" + m.pattern + ")"

/** Add a backquote immediately before any characters that have a
special
meaning in character classes. */
def backQuoteCharClassSpecials(s:String):String =
(for (c <- s) yield (if ("[]-^&\\.{}".indexOf(c) >= 0) "\\" + c
else c)).mkString("")

def backQuoteLiteralSpecials(s:String):String =
(for (c <- s) yield (if ("[]^$\\.{,}|+*?".indexOf(c) >= 0) "\\" +
c else c)).mkString("")

def assembleCharClass(set1:String, negated:Boolean, set2:String)
= {
val sign = if (negated) "^" else ""
val clause2 = if (set2.length==0) "" else "&&[" + sign + set2
+ "]"
"[" + set1 + clause2 + "]"
}
}

/** A class to make it easier to build and use regular expressions */
protected class Matcher(string:String) {
val regex = new Regex(string)
//lazy val engine = java.util.regex.Pattern.compile(string)
//---------------------------
def nameToGroupNumber = Map[String, Int]()
def groupCount = 0

def pattern = regex.pattern.pattern()
def mkString = pattern

/** Matches instances of this followed by other */
def +(other:Matcher) = new BinopMatcher(this, "", other)
def +(other:String):Matcher = this + Lit(other)

/** Matches instances of this or other. */
def |(other:Matcher) = new BinopMatcher(this, "|", other)
def |(other:String):Matcher = this | Lit(other)

/** Matches count or more instances of this */
def *(count:Int) = new Matcher(anonGroup + "{" + count + ",}")
def *?(count:Int) = new Matcher(anonGroup + "{" + count + ",}?")
def *+(count:Int) = new Matcher(anonGroup + "{" + count + ",}+")
/** Matches at least m and at most n instances of this, where m
and n are
the two elements of the tuple argument. */
def *(count:Tuple2[Int,Int]) = new Matcher(anonGroup + "{" +
count._1 + "," + count._2 + "}")
def *?(count:Tuple2[Int,Int]) = new Matcher(anonGroup + "{" +
count._1 + "," + count._2 + "}?")
def *+(count:Tuple2[Int,Int]) = new Matcher(anonGroup + "{" +
count._1 + "," + count._2 + "}+")

def optional = this * (0,1)

/** Make this pattern into a named group; the contents of the
group,
from a match, will be retrieved by its name, not by a number. */
def group(name:String) = new GroupMatcher(this, name)

def replaceAllIn(target:String, replacement:String) =
regex.replaceAllIn(target, replacement)

/** Returns true if this exactly matches the string, false
otherwise. */
def ~~=(target:String):Boolean = {
regex.findFirstIn(target) match {
case None => false
case Some(m) => m == target
}
}

/** Returns true if this matches somewhere in target, false
otherwise. */
def ~=(target:String):Boolean = {
regex.findFirstIn(target) match {
case None => false
case Some(m) => true
}
}

def !~=(target:String) = !(this ~= target)
def !~~=(target:String) = !(this ~~= target)

def findFirst(target:String) : Option[MatchResult] = {
regex.findFirstMatchIn(target) match {
case None => None
case Some(m) => Some(new MatchResult(true, m, null, this))
}
}

private def anonGroup = Matcher.anonGroup(this)

/** This gives an iterator that iterates over both successful and
failed
matches. (A failed match is the text between two successful
matches.) In this
way, you have access to all the text in the target string.

To see how to access results look at MatchResult.
*/
def findAllIn(target:String):MatchResultIterator =
new MatchResultIterator(target,
regex.findAllIn(target).matchData, this)
}

protected class BinopMatcher(val m1:Matcher, op:String, val
m2:Matcher) extends
Matcher(Matcher.anonGroup(m1) + op + Matcher.anonGroup(m2)) {

/** Number of non-anonymous (counting) groups in this regular
expression. */
override def groupCount = m1.groupCount + m2.groupCount

/** Given a group name, return its corresponding group number.

@see Matcher.group(name)
*/
override def nameToGroupNumber = m1.nameToGroupNumber ++
m2.nameToGroupNumber.map(x => (x._1, x._2 + m1.groupCount))
}

protected class GroupMatcher(pat:Matcher, val name:String) extends
Matcher("("+pat.pattern+")") {
override val nameToGroupNumber:Map[String, Int] = Map(name -> 1) ++
pat.nameToGroupNumber.map(x => (x._1, x._2 + 1))

override def groupCount = 1 + pat.groupCount
}

/** Define a literal pattern. This pattern is quoted internally, so no
characters
have special meanings in it. */
case class Lit(lit:String) extends
Matcher(Matcher.backQuoteLiteralSpecials(lit)) //
java.util.regex.Pattern.quote(lit)) //
Matcher.backQuotePatternLiterals(lit)) //

case object BndryLineStart extends Matcher("^")
case object BndryLineEnd extends Matcher("$")
case object BndryWord extends Matcher("\\b")
case object BndryNonWord extends Matcher("\\B")
case object BndryStringStart extends Matcher("\\A")
case object BndryStringEndExceptTerminator extends Matcher("\\Z")
case object BndryStringEnd extends Matcher("\\z")
case object BndryPreviousMatchEnd extends Matcher("\\G")

//== CHARACTER CLASSES
=========================================================

/** A final char class is typically of form [...&&[...]] or
[...&&[^...]], and
once formed, cannot be further combined with any other char classes.
*/
protected case class FinalCharClass(set1:String, negated:Boolean,
set2:String) extends
Matcher(Matcher.assembleCharClass(set1, negated, set2)) {
def this(set1:String) = this(set1, false, "")
}

/** Defines a character class that matches any of the characters in
the string.
The string is quoted internally, so no characters have special
meanings. */
case class CharSet(set:String) extends
RawCharClass(Matcher.backQuoteCharClassSpecials(set)) {

}

/** Defines a range of characters. Characters with special meanings
are backquoted
internally, so you do not need to backquote them */
case class CharRange(start:Char, end:Char) extends
RawCharClass(Matcher.backQuoteCharClassSpecials(start.toString) +
"-" + Matcher.backQuoteCharClassSpecials(end.toString))

/** A RawCharClass does not perform escaping on the string passed to it.
When used as a pattern, it simply wraps the string in "[]" */
protected class RawCharClass(val chars:String) extends Matcher("[" +
chars + "]") {
/** Char class subtraction; things in this but not in 'notin' */
def -(notin:RawCharClass):FinalCharClass =
FinalCharClass(this.chars, true, notin.chars)
def -(notin:String):FinalCharClass = this - CharSet(notin)

/** Char class intersection. This produces a char class that
matches
*chars that are in the intersection of 'this' and 'alsoIn'. For
reasons
*to complex to go into, once this operation has been used to
produce a char
*class, that char class cannot be used in other char class
operations (but
*it can of course be used in normal rex operations. */
def /\(alsoIn:RawCharClass) = FinalCharClass(this.chars, false,
alsoIn.chars)

/** Char class union. Use this to get a char class that has
"special" character symbols in it.

eg. CharSet("abc") \/ CharSpace */
def \/(orIn:RawCharClass) = new RawCharClass(this.chars +
orIn.chars)
}
case object CharUpper extends RawCharClass("\\p{Upper}")
case object CharLower extends RawCharClass("\\p{Lower}")
case object CharASCII extends RawCharClass("\\p{ASCII}")
case object CharAlpha extends RawCharClass("\\p{Alpha}")
case object CharDigit extends RawCharClass("\\p{Digit}")
case object CharAlnum extends RawCharClass("\\p{Alnum}")
case object CharPunct extends RawCharClass("\\p{Punct}")
case object CharGraph extends RawCharClass("\\p{Graph}")
case object CharPrint extends RawCharClass("\\p{Print}")
case object CharBlank extends RawCharClass("\\p{Blank}")
case object CharCntrl extends RawCharClass("\\p{Cntrl}")
case object CharXDigit extends RawCharClass("\\p{XDigit}")
case object CharSpace extends RawCharClass("\\p{Space}")

case object CharNonDigit extends RawCharClass("\\D")
case object CharWhitespace extends RawCharClass("\\s")
case object CharNonWhitespace extends RawCharClass("\\S")
case object CharWord extends RawCharClass("\\w")
case object CharNonWord extends RawCharClass("\\W")
/** Matches any character, including newlines. */
case object CharAny extends RawCharClass("\\s\\S")
/** Consume as many characters as possible. */
case object PatMaxChars extends Matcher("[\\s\\S]*")
/** Consume as few characters as possible */
case object PatMinChars extends Matcher("[\\s\\S]*?")
/** Consume as much whitespace as possible (possibly none). */
case object PatMaxWhitespace extends Matcher("[\\s]*")
/** Consume as much whitespace as possible, ensuring there is at least
some whitespace. */
case object PatSomeWhitespace extends Matcher("[\\s]+")
case object PatPosInt extends Matcher("[0-9]+")
case object PatSignedInt extends Matcher("(?:\\+|-)?[0-9]+")
case object PatPosFloat extends Matcher("[0-9]+(?:.[0-9]*)?")
case object PatFloat extends Matcher("(?:\\+|-)?[0-9]+(?:.[0-9]*)?")

//== MatchResult
===============================================================

/** Returns the results of a match. This can either be a successful or
unsuccessful match,
which permits MatchResults to return the strings "between" the matches
when iterating a
Matcher over a target string.

@param matched true if this was a successful match, false otherwise.
@see The testing code for simple examples.
*/
class MatchResult(val matched:Boolean, private val m:Regex.Match,
private val s:String, private val matcher:Matcher) {

/** Retrieve a group by its name */
def group(name:String) = m.group(matcher.nameToGroupNumber(name))
def group(groupNum:Int) =
if (!matched && groupNum == 0) s
else m.group(groupNum)
override def toString = this.group(0)
def string = this.group(0)
}

class MatchResultIterator(val target:String, val
matches:Iterator[Regex.Match], val matcher:Matcher) extends
Iterator[MatchResult] {
private var nextStart = 0
private var lastEnd = 0
private var nextMatch:Regex.Match = null
def hasNext = nextMatch != null || matches.hasNext || lastEnd <
target.length
private def substring(start:Int, end:Int) = {
lastEnd = end
new MatchResult(false, null, target.substring(start, end),
matcher)
}
private def nextmatch() = {
lastEnd = nextMatch.end
val tmp = nextMatch
nextMatch = null
new MatchResult(true, tmp, null, matcher)
}
def next = {
if (nextMatch != null) {
if (nextMatch.start > lastEnd) substring(lastEnd,
target.length)
else nextmatch()
} else if (!matches.hasNext) {
// At this point, we know that lastEnd < target.length,
otherwise
// hasNext would have returned false and next would
therefore not have
// been called.
substring(lastEnd, target.length)
} else {
nextMatch = matches.next
if (lastEnd < nextMatch.start) substring(lastEnd,
nextMatch.start)
else nextmatch
}
}
}

class ElementSuite extends Suite {
def testMatchingOperators() {
assert(Lit("Hello")~~="Hello")
// ~~= denotes exact match
assert (!(Lit("Hello")~~="Hey, Hello"))
// ~= denotes match within the target string
assert(Lit("ello")~="Hello")
}

def testEscaping() {
assert(Lit("[]^$\\.{,}|+*?") ~~= "[]^$\\.{,}|+*?")
assert(CharSet("[]-^&\\.{}")*0 ~~= "[]-^&\\.{}")
}

def testCharSet() {
assert(CharSet("abc") ~= "cde")
assert(CharSet("abc") !~= "de")
// "Special" characters are internally escaped.
assert(CharSet("-") ~= "-")
assert((CharSet("abc") - CharSet("cde")) !~= "c")
assert((CharSet("abc") - "c") !~= "c")
assert((CharAny * 0) ~~= "\na.,*")
assert("[\\s\\S&&[^\\p{ASCII}]]" === (CharAny -
CharASCII).pattern)
}

def testComplexFloatPat() {
// A complex is a float followed by a + or - followed by a
float, followed by an "i"
// The two numeric parts and the sign are named for access.
val complexMatcher = PatFloat.group("re") +
(Lit("-")|"+").group("sign") + PatFloat.group("im") + "i"
/* Match against a floating-point complex number and print
the result. */
val result = complexMatcher.findFirst("3.2+4.5i") match {
case None => None
case Some(m) => Some(m.group("re") + " " +
m.group("sign") + " " + m.group("im") + "i")
}
assert(Some("3.2 + 4.5i") === result)
}

def testBoundaryObjects() {
assert(BndryStringStart + Lit("a")*0 + BndryStringEnd ~~=
"aaa")
assert(BndryStringStart + Lit("a")*(0,2) + BndryStringEnd !
~~= "aaa")
}

def testHTMLTagPat() {
val tagPat = (Lit("<") + CharAny *? 1 + ">").group("tag")
assert(tagPat ~~= "")
val minFind = tagPat.findFirst("
") match {
case None => ""
case Some(m) => m.group("tag")
}
assert(minFind === "
")
}

def testFindAllIn() {
assert( (for(m <- Lit("a").findAllIn("aabbaba")) yield
m.string).mkString("") === "aabbaba")
assert( (for(m <- Lit("a").findAllIn("babbaba")) yield
m.string).mkString("") === "babbaba")
assert( (for(m <- Lit("a").findAllIn("aabbabb")) yield
m.string).mkString("") === "aabbabb")
assert( (for(m <- Lit("a").findAllIn("aabbabb") if
(m.matched)) yield m.string).mkString("") === "aaa")
assert( (for(m <- Lit("a").findAllIn("aabbabb") if (!
m.matched)) yield m.string).mkString("") === "bbbb")
}

def testReplaceAllIn() {
val tagPat = Lit(""
val target = "HelloGoodbyeNow"
assert (tagPat.replaceAllIn(target, "") ===
"HelloGoodbyeNow")
}
}

hrj
Joined: 2008-09-23,
User offline. Last seen 4 years 3 weeks ago.
Re: Of possible interest: Regular expressions library

Kenneth McDonald wrote:

> Appended is the code for a regular expressions library. It's quite
> crude, with incomplete documentation and minimal testing, but I
> thought it might be useful to some. For a fuller understanding of what
> it does, read the preamble and check out the tests at the end.
> Comments welcome, but I'm time-constrained, so probably can't do a lot
> more on it. (It was mostly an exercise for learning Scala.)

Looks good, though haven't tested it yet. Thanks for sharing it.

You could avoid the littering of Lit, by defining an implicit conversion of String => Lit

Copyright © 2012 École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland