This page is no longer maintained — Please continue to the home page at www.scala-lang.org

XML design, part 4: Additional features

3 replies
Jürgen Purtz
Joined: 2009-12-03,
User offline. Last seen 1 year 44 weeks ago.

Here is a small listing of possible additional feature:

- Validateing parsers with schemas comming from DTD, XML schema, Relax NG,
schematron, ...
- Data binding (JAXB) for schemas comming from XML schema, Relax NG,
schematron, ...
- XQuery support

Cheers, Jürgen

Anthony B. Coates
Joined: 2009-09-12,
User offline. Last seen 2 years 35 weeks ago.
XML design: my thoughts

Some good thoughts have been posted today. I thought I would add a few
comments.

People often underrate the complexity of writing a fully correct XML
parser. The basic angle bracket stuff is easy, but things around
characters encodings and namespace handling are harder, and certainly easy
to get wrong. For that reason, I have a lot of sympathy for the idea that
Scala should consider using well-tested parsers like Xerces-J or the .NET
parser. For Java, that would probably mean bundling a copy of Xerces, as
the one built into the JDK has been tinkered with by Sun, and all reports
that I've heard are that the tinkering has introduced problems. Certainly
if taking this approach, you would want a JAXP-style mechanism that allows
users to specify an alternative parser implementation.

Does Scala need its own XML parser, written in Scala? To be honest, I
think one of the strengths of Scala is that it interoperates with existing
Java and .NET code, and doesn't require you to rewrite everything from
scratch. I don't think most people would expect Scala to be rewriting
everything that is out there, any more than they would expect a complete
Scala re-implementation of the full Java or .NET API. So nobody is going
to mind much, I would argue, if Scala is focussed on making XML easier to
use than Java/C#, rather than focussed on providing a different parsing
codebase.

The desire for Java/.NET cross-platform consistency and compatibility is a
very worthy one. If it were to seem too hard to make, for example,
Xerces-J and the .NET XML parser work sufficiently consistently, another
alternative would be to convert Xerces-J into .NET CLR code using iKVM.
That is what Mike Kay does for the Saxon XSLT/XQuery engine, and by all
accounts the results are very good; iKVM apparently removes any business
case there may ever have been for a hand-coded .NET implementation of
Saxon, so good is the conversion. Perhaps that would be a better route
for Scala to take, using iKVM to port existing Java libraries to .NET,
rather than trying to reimplement those libraries in Scala.

There was some discussion of the DOM API. DOM has the advantage of
ubiquity, it's available everywhere, but it's not a good choice as an
internal API, as it is a well-known memory hog. If you look at XSLT
engines like Xalan and Saxon, which have been dealing with this issue for
a long time, they have developed alternative tree APIs that provide a
similar kind of API to the DOM, but with a significantly smaller memory
footprint. It might be better to look at re-using an implementation like
that, or otherwise to look at XOM, perhaps the most correct of the
"traditional" Java XML APIs.

As well as ease of use when working with XML, it is clear that Scala needs
to provide an immutable XML API as well as the usable mutable APIs. They
might require some work when re-using existing mutable APIs, either to
make them appear immutable, or to modify them to provide the necessary
immutability, without completely re-implementing them in Scala. I guess
it's worth noting that .NET does provide a read-only XML API that provides
XPath access to information, and that read-only API has a notably smaller
memory footprint than .NET's DOM API.

The idea that Scala should provide a more powerful path syntax, perhaps
full XPath, is one I would certainly support. At the same time, it would
be good if that XPath syntax was not just XML-specific, but could also be
applied to sequence structures generally, e.g. to JSON in Scala map form.
XPath for Java already exists in terms of a couple of APIs, such as Jaxen,
so it wouldn't be breaking new ground to look at something like this.

One area I would like to get to the chance to deal with, at some stage, is
making it convenient to work with huge XML documents. More and more I see
people trying to process XML documents that are 10M, 100M, 1G, 10G, ...
In many cases, there shouldn't be a real technical problem in processing
documents of this size, any more than there would be in processing a text
file of that size. The problems that people often have are due to using
tools that insist on loading all of the XML into memory before it can be
processed. Providing some kind of memory-paged XML API would be a benefit
to Java and .NET users generally, and might convince a few to make the
jump to Scala.

Anyway, that's enough for now. I just wanted to throw out a few ideas and
thoughts, and see what resonates with people and what doesn't.

Cheers, Tony.

Meredith Gregory
Joined: 2008-12-17,
User offline. Last seen 42 years 45 weeks ago.
Re: XML design: my thoughts
Dear SX'ers,
Has anyone taken a good look at CDuce and OCamlDuce? One of the most important aspects of this work is the integration of the XMLSchema-expressible types and the host language's expressible types. This would make schema-validation and type-checking converge. This is a really, really good thing, imho.
Best wishes,
--greg

On Thu, Dec 10, 2009 at 1:37 PM, Anthony B. Coates (Londata) <abcoates@londata.com> wrote:
Some good thoughts have been posted today.  I thought I would add a few comments.

People often underrate the complexity of writing a fully correct XML parser.  The basic angle bracket stuff is easy, but things around characters encodings and namespace handling are harder, and certainly easy to get wrong.  For that reason, I have a lot of sympathy for the idea that Scala should consider using well-tested parsers like Xerces-J or the .NET parser.  For Java, that would probably mean bundling a copy of Xerces, as the one built into the JDK has been tinkered with by Sun, and all reports that I've heard are that the tinkering has introduced problems.  Certainly if taking this approach, you would want a JAXP-style mechanism that allows users to specify an alternative parser implementation.

Does Scala need its own XML parser, written in Scala?  To be honest, I think one of the strengths of Scala is that it interoperates with existing Java and .NET code, and doesn't require you to rewrite everything from scratch.  I don't think most people would expect Scala to be rewriting everything that is out there, any more than they would expect a complete Scala re-implementation of the full Java or .NET API.  So nobody is going to mind much, I would argue, if Scala is focussed on making XML easier to use than Java/C#, rather than focussed on providing a different parsing codebase.

The desire for Java/.NET cross-platform consistency and compatibility is a very worthy one.  If it were to seem too hard to make, for example, Xerces-J and the .NET XML parser work sufficiently consistently, another alternative would be to convert Xerces-J into .NET CLR code using iKVM.  That is what Mike Kay does for the Saxon XSLT/XQuery engine, and by all accounts the results are very good; iKVM apparently removes any business case there may ever have been for a hand-coded .NET implementation of Saxon, so good is the conversion.  Perhaps that would be a better route for Scala to take, using iKVM to port existing Java libraries to .NET, rather than trying to reimplement those libraries in Scala.

There was some discussion of the DOM API.  DOM has the advantage of ubiquity, it's available everywhere, but it's not a good choice as an internal API, as it is a well-known memory hog.  If you look at XSLT engines like Xalan and Saxon, which have been dealing with this issue for a long time, they have developed alternative tree APIs that provide a similar kind of API to the DOM, but with a significantly smaller memory footprint.  It might be better to look at re-using an implementation like that, or otherwise to look at XOM, perhaps the most correct of the "traditional" Java XML APIs.

As well as ease of use when working with XML, it is clear that Scala needs to provide an immutable XML API as well as the usable mutable APIs.  They might require some work when re-using existing mutable APIs, either to make them appear immutable, or to modify them to provide the necessary immutability, without completely re-implementing them in Scala.  I guess it's worth noting that .NET does provide a read-only XML API that provides XPath access to information, and that read-only API has a notably smaller memory footprint than .NET's DOM API.

The idea that Scala should provide a more powerful path syntax, perhaps full XPath, is one I would certainly support.  At the same time, it would be good if that XPath syntax was not just XML-specific, but could also be applied to sequence structures generally, e.g. to JSON in Scala map form.  XPath for Java already exists in terms of a couple of APIs, such as Jaxen, so it wouldn't be breaking new ground to look at something like this.

One area I would like to get to the chance to deal with, at some stage, is making it convenient to work with huge XML documents.  More and more I see people trying to process XML documents that are 10M, 100M, 1G, 10G, ...  In many cases, there shouldn't be a real technical problem in processing documents of this size, any more than there would be in processing a text file of that size.  The problems that people often have are due to using tools that insist on loading all of the XML into memory before it can be processed.  Providing some kind of memory-paged XML API would be a benefit to Java and .NET users generally, and might convince a few to make the jump to Scala.

Anyway, that's enough for now.  I just wanted to throw out a few ideas and thoughts, and see what resonates with people and what doesn't.

Cheers, Tony.
Stefan Zeiger
Joined: 2008-12-21,
User offline. Last seen 27 weeks 3 days ago.
Re: XML design: my thoughts

Anthony B. Coates (Londata) wrote:
> People often underrate the complexity of writing a fully correct XML
> parser. The basic angle bracket stuff is easy, but things around
> characters encodings and namespace handling are harder, and certainly
> easy to get wrong. For that reason, I have a lot of sympathy for the
> idea that Scala should consider using well-tested parsers like
> Xerces-J or the .NET parser. For Java, that would probably mean
> bundling a copy of Xerces, as the one built into the JDK has been
> tinkered with by Sun, and all reports that I've heard are that the
> tinkering has introduced problems. Certainly if taking this approach,
> you would want a JAXP-style mechanism that allows users to specify an
> alternative parser implementation.
Or just use JAXP directly and default to the parser that comes with
Java. That should be good enough for most users. If someone needs a
different parser, I don't think a cross-platform Scala API for setting
it would help much. Most likely, the parser is platform-specific, and
with a custom API you'd need an additional adapter to use it with Scala.

-sz

Copyright © 2012 École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland