This page is no longer maintained — Please continue to the home page at www.scala-lang.org

Text part vanishes when XML parsed with RNC

4 replies
Ryszard Kubiak
Joined: 2011-05-29,
User offline. Last seen 42 years 45 weeks ago.

Hello All,

I am experimenting with parsing XML files in the Scala language.
I am particularly interested in parsing XML files in a schema-aware
fashion. I would like to parse XML files using Relax NG schemas.

Unfortunately, I experience a serious problem when parsing with Relax NG.
Below, I quote a small program in which I adapt a method of RNG-based
validating
descibed at :

http://weblogs.java.net/blog/kohsuke/archive/2006/02/validate_xml_us.html

The problem can be explained on this minimalistic test.rnc grammar:

start = element x {text}

and this little test.xml file:

<?xml version="1.0" encoding="UTF-8"?>
abc

The problem is that the text part 'abc' of the input file disappears.
This happens
only when an RNC (or RNG) grammar is used. After tranging the grammar to XSD
and also when a standard Scala's default parsing method is used the text
part
gets into the XML document. Here are my results:

[SCL]$ java -jar /home/rysiek/XML/trang-20091111/trang.jar test.rnc test.xsd
[SCL]$ scala -cp .:/home/rysiek/XML/jing-20091111/bin/jing.jar ParseXML
test.xml test.rnc test.xsd
xmlRNC=
xmlXSD=abc
xmlDefault=abc

I am reluctant to convert my RNC grammars to XSD using the 'trang'
converter as some information
gets lost during conversion. It's because Relax NG enjoys richer
expressiveness than XSD.

I would appreciate if you could comment on the situation. Is it Scala's
way of contacting
with the parser responsible for the effect or is it the current Java's
libraries?

Best Regards
Ryszard Kubiak, Gdańsk, Poland

===

ParseXML.scala
object ParseXML {

import javax.xml.validation.{SchemaFactory}
import javax.xml.parsers.{SAXParserFactory, SAXParser}
import javax.xml.XMLConstants
import javax.xml.transform.stream.{StreamSource}
import java.io.{FileInputStream}
import scala.xml._

def parseWithRNC(xmlFileName: String, rncFileName: String): Elem = {
val prop = classOf[SchemaFactory].getName + ":" +
XMLConstants.RELAXNG_NS_URI
System.setProperty(prop,
"com.thaiopensource.relaxng.jaxp.CompactSyntaxSchemaFactory");
//System.setProperty(prop,
"com.thaiopensource.relaxng.jaxp.XMLSyntaxSchemaFactory");
val schemaFactory =
SchemaFactory.newInstance(XMLConstants.RELAXNG_NS_URI)
val schema = schemaFactory.newSchema(new StreamSource(new
FileInputStream(rncFileName)))
val parserFactory = SAXParserFactory.newInstance()
parserFactory.setSchema(schema)
val parser = parserFactory.newSAXParser()
XML.loadXML(new InputSource(xmlFileName), parser)
}

def parseWithXSD(xmlFileName: String, xsdFileName: String): Elem = {
val schemaFactory =
SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI)
val schema = schemaFactory.newSchema(new StreamSource(new
FileInputStream(xsdFileName)))
val parserFactory = SAXParserFactory.newInstance()
parserFactory.setSchema(schema)
val parser = parserFactory.newSAXParser()
XML.loadXML(new InputSource(xmlFileName), parser)
}

def main(args: Array[String]): Unit = {
val xmlRNC = parseWithRNC(args(0), args(1))
println("xmlRNC=" + xmlRNC.toString)
val xmlXSD = parseWithXSD(args(0), args(2))
println("xmlXSD=" + xmlXSD.toString)
val xmlDefault = XML.loadFile(args(0))
println("xmlDefault=" + xmlDefault.toString)
}

}

DaveScala
Joined: 2011-03-18,
User offline. Last seen 1 year 21 weeks ago.
Re: Text part vanishes when XML parsed with RNC

>I would appreciate if you could comment on the situation. Is it Scala's
>way of contacting
>with the parser responsible for the effect or is it the current Java's
>libraries?
It is obviously Scala's xml parser and saxparser collaboration in
method loadXML.

A work-around is to replace this line (it also works for xsd):
XML.loadXML(new InputSource(xmlFileName), parser)

with these two lines:
XML.withSAXParser(parser)
XML.loadFile(xmlFileName)

See my output:
{{{
C:\scala-2.9.0.1\examples>scala jing.Main test.xml test.rnc test.xs
xmlRNC=abc
xmlXSD=abc
xmlDefault=abc
}}}
(Note: The x xml tags might be filtered away by the forum/mailing
system)

Source:
=====
package jing
import javax.xml.validation.SchemaFactory
import javax.xml.parsers.SAXParserFactory
import javax.xml.XMLConstants
import javax.xml.transform.stream.StreamSource
import java.io.FileInputStream
import scala.xml._

object Main extends App {

val xmlRNC = parseWithRNC(args(0), args(1))
println("xmlRNC=" + xmlRNC.toString)
val xmlXSD = parseWithXSD(args(0), args(2))
println("xmlXSD=" + xmlXSD.toString)
val xmlDefault = XML.loadFile(args(0))
println("xmlDefault=" + xmlDefault.toString)

def parseWithRNC(xmlFileName: String, rncFileName: String) = {
val prop = classOf[SchemaFactory].getName + ":" +
XMLConstants.RELAXNG_NS_URI
System.setProperty(prop,
"com.thaiopensource.relaxng.jaxp.CompactSyntaxSchemaFactory")
val schemaFactory =
SchemaFactory.newInstance(XMLConstants.RELAXNG_NS_URI)
val schema = schemaFactory.newSchema(new StreamSource(new
FileInputStream(rncFileName)))
val parserFactory = SAXParserFactory.newInstance
parserFactory.setSchema(schema)
val parser = parserFactory.newSAXParser
XML.withSAXParser(parser)
XML.loadFile(xmlFileName)
}

def parseWithXSD(xmlFileName: String, xsdFileName: String) = {
val schemaFactory =
SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI)
val schema = schemaFactory.newSchema(new StreamSource(new
FileInputStream(xsdFileName)))
val parserFactory = SAXParserFactory.newInstance
parserFactory.setSchema(schema)
val parser = parserFactory.newSAXParser
XML.withSAXParser(parser)
XML.loadFile(xmlFileName)
}
}

DaveScala
Joined: 2011-03-18,
User offline. Last seen 1 year 21 weeks ago.
Re: Text part vanishes when XML parsed with RNC

>XML.withSAXParser(parser)
>XML.loadFile(xmlFileName)

Probably better:
val xml2 = XML.withSAXParser(parser)
xml2.loadFile(xmlFileName)

But apart from this, I doubt that XML is validating at all...

DaveScala
Joined: 2011-03-18,
User offline. Last seen 1 year 21 weeks ago.
Re: Text part vanishes when XML parsed with RNC

Yep, that's it.
parserFactory.setValidating(true)
brings it back to the same situation. :(

C:\scala-2.9.0.1\examples>scala jing.Main test.xml test.rnc test.xsd
http://relaxng.org/ns/structure/1.0, xmlRNC=
http://www.w3.org/2001/XMLSchema, xmlXSD=abc
xmlDefault=abc

I now strongly suspect the implementation of the handler of
http://relaxng.org/ns/structure/1.0 in the library.

Used libraries:
===========
isorelax.jar 2004/11/11, Java1.2 bytecode version 46.0
jing.jar 2009/11/11, Java1.5 bytecode version 49.0
saxon.jar 2005/11/24, Java 1.0.2 bytecode version 45.3
xercesImpl.jar 2007/09/14, Java 1.0.2 bytecode version 45.3
xml-api.jar 2006/11/19, Java 1.0.2 bytecode version 45.3

DaveScala
Joined: 2011-03-18,
User offline. Last seen 1 year 21 weeks ago.
Re: Text part vanishes when XML parsed with RNC

setValidating(true) is only for DTD validation!!! I cannot believe
this. So it is according specification that it doesn't validate
It's better to leave this false and use only the schema validator and
catching the error.

==========
see API:
http://download.oracle.com/docs/cd/E17802_01/webservices/webservices/doc...
setValidating
public void setValidating(boolean validating)Specifies that the parser
produced by this code will validate documents as they are parsed. By
default the value of this is set to false.
Note that "the validation" here means a validating parser as defined
in the XML recommendation. In other words, it essentially just
controls the DTD validation. (except the legacy two properties defined
in JAXP 1.2.)
To use modern schema languages such as W3C XML Schema or RELAX NG
instead of DTD, you can configure your parser to be a non-validating
parser by leaving the setValidating(boolean) method false, then use
the setSchema(Schema) method to associate a schema to a parser.

Parameters:
validating - true if the parser produced by this code will validate
documents as they are parsed; false otherwise.
==========

The updated version then. This is the way I would do it in Scala:

package jing
import javax.xml.validation.SchemaFactory
import javax.xml.parsers.SAXParserFactory
import javax.xml.XMLConstants
import javax.xml.transform.stream.StreamSource
import java.io.FileInputStream
import scala.xml._

object Main extends App {

val xmlRNC = parseWithRNC(args(0), args(1))
print(XMLConstants.RELAXNG_NS_URI)
println(", xmlRNC=" + xmlRNC.toString)
val xmlXSD = parseWithXSD(args(0), args(2))
print(XMLConstants.W3C_XML_SCHEMA_NS_URI)
println(", xmlXSD=" + xmlXSD.toString)
val xmlDefault = XML.loadFile(args(0))
println("xmlDefault=" + xmlDefault.toString)

def parseWithRNC(xmlFileName: String, rncFileName: String) = {
val prop = classOf[SchemaFactory].getName + ":" +
XMLConstants.RELAXNG_NS_URI
System.setProperty(prop,
"com.thaiopensource.relaxng.jaxp.CompactSyntaxSchemaFactory")
val schemaFactory =
SchemaFactory.newInstance(XMLConstants.RELAXNG_NS_URI)
val schema = schemaFactory.newSchema(new StreamSource(new
FileInputStream(rncFileName)))
val parserFactory = SAXParserFactory.newInstance
parserFactory.setSchema(schema)
val validator = schema.newValidator
try {
validator.validate(new StreamSource(xmlFileName))
XML.loadFile(xmlFileName)
} catch {
case(e: Exception) => e.toString
}
}

def parseWithXSD(xmlFileName: String, xsdFileName: String) = {
val schemaFactory =
SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI)
val schema = schemaFactory.newSchema(new StreamSource(new
FileInputStream(xsdFileName)))
val parserFactory = SAXParserFactory.newInstance
parserFactory.setSchema(schema)
val validator = schema.newValidator
try {
validator.validate(new StreamSource(xmlFileName))
XML.loadFile(xmlFileName)
} catch {
case(e: Exception) => e.toString
}
}
}

Output (note: xml tags are filtered away by the forum/mailing system)
==================================================
C:\scala-2.9.0.1\examples>scala jing.Main test.xml test.rnc test.xsd
http://relaxng.org/ns/structure/1.0,
xmlRNC=org.xml.sax.SAXParseException: eleme
nt "y" not allowed anywhere; expected the element end-tag or text
http://www.w3.org/2001/XMLSchema,
xmlXSD=org.xml.sax.SAXParseException: cvc-type
.3.1.2: Element 'x' is a simple type, so it must have no element
information ite
m [children].
xmlDefault=

C:\scala-2.9.0.1\examples>scala jing.Main test.xml test.rnc test.xsd
http://relaxng.org/ns/structure/1.0, xmlRNC=abc
http://www.w3.org/2001/XMLSchema, xmlXSD=abc
xmlDefault=abc

Copyright © 2012 École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland