This page is no longer maintained — Please continue to the home page at www.scala-lang.org

which package to use to process large xml files (>50M)

9 replies
Tobias Wunner
Joined: 2010-06-01,
User offline. Last seen 42 years 45 weeks ago.

Hi,

using "scala.xml.XML.loadFile" I reached the limitations of my machine. Assuming that I have increased the
heapspace correctly [1] I am constantly getting "java.lang.OutOfMemoryError: Java heap space". The XML
file I am trying to process is very large (70M). Is there any other other XML parser I could use? Ideally one
which provides just as simple syntax for formulating queries :) Or is there any way to estimate how much
memory I would actually need to process an XML file of such size (it has roughly 3 million XML elements)?

Cheers,
Tobias

References
[1] scala -DJAVA_OPTS="-Xmx3g"

Stephen Tu
Joined: 2010-02-24,
User offline. Last seen 42 years 45 weeks ago.
Re: which package to use to process large xml files (>50M)
hi,

JAVA_OPTS is an env var which scala reads. so the right way to do it is (at least in the bash shell):

export JAVA_OPTS="-Xmx3g"
scala [...]

Randall R Schulz
Joined: 2008-12-16,
User offline. Last seen 1 year 29 weeks ago.
Re: which package to use to process large xml files (>50M)

On Tuesday July 13 2010, Stephen Tu wrote:
> hi,
>
> JAVA_OPTS is an env var which scala reads. so the right way to do it
> is (at least in the bash shell):
>
> export JAVA_OPTS="-Xmx3g"
> scala [...]

Or, for per-command environment setting / overriding:

% JAVA_OPTS="..." scala ...

Randall Schulz

Naftoli Gugenheim
Joined: 2008-12-17,
User offline. Last seen 42 years 45 weeks ago.
Re: which package to use to process large xml files (>50M)
Isn't there a pull parser?

On Tue, Jul 13, 2010 at 8:49 PM, Tobias Wunner <tobias.wunner@deri.org> wrote:
Hi,

using "scala.xml.XML.loadFile" I reached the limitations of my machine. Assuming that I have increased the
heapspace correctly [1] I am constantly getting "java.lang.OutOfMemoryError: Java heap space". The XML
file I am trying to process is very large (70M). Is there any other other XML parser I could use? Ideally one
which provides just as simple syntax for formulating queries :) Or is there any way to estimate how much
memory I would actually need to process an XML file of such size (it has roughly 3 million XML elements)?

Cheers,
Tobias


References
[1] scala -DJAVA_OPTS="-Xmx3g"

Seth Tisue
Joined: 2008-12-16,
User offline. Last seen 34 weeks 3 days ago.
Re: which package to use to process large xml files (>50M)

>>>>> "Naftoli" == Naftoli Gugenheim writes:

>> using "scala.xml.XML.loadFile" I reached the limitations of my
>> machine. Assuming that I have increased the heapspace correctly [1]
>> I am constantly getting "java.lang.OutOfMemoryError: Java heap
>> space". The XML file I am trying to process is very large (70M). Is
>> there any other other XML parser I could use? Ideally one which
>> provides just as simple syntax for formulating queries :) Or is
>> there any way to estimate how much memory I would actually need to
>> process an XML file of such size (it has roughly 3 million XML
>> elements)?

Naftoli> Isn't there a pull parser?

Indeed, in scala.xml.pull.

The 2.7 one is pretty much hosed, but the situation is greatly improved in 2.8.

Tony Morris 2
Joined: 2009-03-20,
User offline. Last seen 42 years 45 weeks ago.
Re: which package to use to process large xml files (>50M)

I have used Haskell's HXT to parse a 160GB XML file successfully and
considered porting the library to Scala, however, that task would be
quite difficult without laziness. I might give it a shot some day, but I
think you'll have to make a pretty big sacrifice without something similar.

Tobias Wunner wrote:
> Hi,
>
> using "scala.xml.XML.loadFile" I reached the limitations of my machine. Assuming that I have increased the
> heapspace correctly [1] I am constantly getting "java.lang.OutOfMemoryError: Java heap space". The XML
> file I am trying to process is very large (70M). Is there any other other XML parser I could use? Ideally one
> which provides just as simple syntax for formulating queries :) Or is there any way to estimate how much
> memory I would actually need to process an XML file of such size (it has roughly 3 million XML elements)?
>
> Cheers,
> Tobias
>
>
> References
> [1] scala -DJAVA_OPTS="-Xmx3g"
>

Mohamed Bana 2
Joined: 2009-10-21,
User offline. Last seen 42 years 45 weeks ago.
Re: which package to use to process large xml files (>50M)
Can you please give some examples.  I'd like to understand what exactly isn't doable.

 —Mohamed


On 14 July 2010 05:01, Tony Morris <tonymorris@gmail.com> wrote:
I have used Haskell's HXT to parse a 160GB XML file successfully and
considered porting the library to Scala, however, that task would be
quite difficult without laziness. I might give it a shot some day, but I
think you'll have to make a pretty big sacrifice without something similar.


Tobias Wunner wrote:
> Hi,
>
> using "scala.xml.XML.loadFile" I reached the limitations of my machine. Assuming that I have increased the
> heapspace correctly [1] I am constantly getting "java.lang.OutOfMemoryError: Java heap space". The XML
> file I am trying to process is very large (70M). Is there any other other XML parser I could use? Ideally one
> which provides just as simple syntax for formulating queries :) Or is there any way to estimate how much
> memory I would actually need to process an XML file of such size (it has roughly 3 million XML elements)?
>
> Cheers,
> Tobias
>
>
> References
> [1] scala -DJAVA_OPTS="-Xmx3g"
>

--
Tony Morris
http://tmorris.net/



Joshua.Suereth
Joined: 2008-09-02,
User offline. Last seen 32 weeks 5 days ago.
Re: which package to use to process large xml files (>50M)

Ah, but too much laziness and you'd never implement it! ;)

Doesn't scalaz have iteratees now? Is there an XML parser implementation that uses them?

- Josh

On Jul 14, 2010, at 12:01 AM, Tony Morris wrote:

> I have used Haskell's HXT to parse a 160GB XML file successfully and
> considered porting the library to Scala, however, that task would be
> quite difficult without laziness. I might give it a shot some day, but I
> think you'll have to make a pretty big sacrifice without something similar.
>
>
> Tobias Wunner wrote:
>> Hi,
>>
>> using "scala.xml.XML.loadFile" I reached the limitations of my machine. Assuming that I have increased the
>> heapspace correctly [1] I am constantly getting "java.lang.OutOfMemoryError: Java heap space". The XML
>> file I am trying to process is very large (70M). Is there any other other XML parser I could use? Ideally one
>> which provides just as simple syntax for formulating queries :) Or is there any way to estimate how much
>> memory I would actually need to process an XML file of such size (it has roughly 3 million XML elements)?
>>
>> Cheers,
>> Tobias
>>
>>
>> References
>> [1] scala -DJAVA_OPTS="-Xmx3g"
>>
>

James.Strachan
Joined: 2009-07-08,
User offline. Last seen 2 years 25 weeks ago.
Re: which package to use to process large xml files (>50M)

As an aside, you can usually use DOMish or object-xml mapping tools to
parse massive XML files - the one trick is you need to use a hook so
as the parser is parsing the XML you process and remove each 'row'
from the root element so the garbage collector can discard trees of
the XML document as you chug through.

e.g. if you're XML looks like this...

...
...
...
...

then you just need a hook so your handler gets called for each row
with a document something like:

...

Many DOM-ish APIs in Java are mutable so you can just add a hook to
remove the child node you've just processed from the root and you're
good to go.

It might be possible to create a modified Scala XML parser which
reuses the scala.xml DOM model (Document, Elem etc) but which calls a
function as each 'row' is parsed and returns the final child-less
document so you can process massive XML documents using the same Scala
DOM model.

On 14 July 2010 03:15, Naftoli Gugenheim wrote:
> Isn't there a pull parser?
>
> On Tue, Jul 13, 2010 at 8:49 PM, Tobias Wunner
> wrote:
>>
>> Hi,
>>
>> using "scala.xml.XML.loadFile" I reached the limitations of my machine.
>> Assuming that I have increased the
>> heapspace correctly [1] I am constantly getting
>> "java.lang.OutOfMemoryError: Java heap space". The XML
>> file I am trying to process is very large (70M). Is there any other other
>> XML parser I could use? Ideally one
>> which provides just as simple syntax for formulating queries :) Or is
>> there any way to estimate how much
>> memory I would actually need to process an XML file of such size (it has
>> roughly 3 million XML elements)?
>>
>> Cheers,
>> Tobias
>>
>>
>> References
>> [1] scala -DJAVA_OPTS="-Xmx3g"
>

huynhjl
Joined: 2009-10-27,
User offline. Last seen 42 years 45 weeks ago.
Re: which package to use to process large xml files (>50M)

That does not seem right. Assuming you have increased the heap space
properly... Loading the file would take may be up to 8 times the size of the
original file. So you should be able to load with less than 1 GB of memory.

If you are getting the out of memory error during the load phase, then it's
probably just that the -D option is incorrect.

If you are unix, ps -ef would show you how much virtual memory is used. The
windows task manager can show you the same.

Copyright © 2012 École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland