- About Scala
- Documentation
- Code Examples
- Software
- Scala Developers
which package to use to process large xml files (>50M)
Wed, 2010-07-14, 01:44
Hi,
using "scala.xml.XML.loadFile" I reached the limitations of my machine. Assuming that I have increased the
heapspace correctly [1] I am constantly getting "java.lang.OutOfMemoryError: Java heap space". The XML
file I am trying to process is very large (70M). Is there any other other XML parser I could use? Ideally one
which provides just as simple syntax for formulating queries :) Or is there any way to estimate how much
memory I would actually need to process an XML file of such size (it has roughly 3 million XML elements)?
Cheers,
Tobias
References
[1] scala -DJAVA_OPTS="-Xmx3g"
Wed, 2010-07-14, 02:17
#2
Re: which package to use to process large xml files (>50M)
On Tuesday July 13 2010, Stephen Tu wrote:
> hi,
>
> JAVA_OPTS is an env var which scala reads. so the right way to do it
> is (at least in the bash shell):
>
> export JAVA_OPTS="-Xmx3g"
> scala [...]
Or, for per-command environment setting / overriding:
% JAVA_OPTS="..." scala ...
Randall Schulz
Wed, 2010-07-14, 03:17
#3
Re: which package to use to process large xml files (>50M)
Isn't there a pull parser?
On Tue, Jul 13, 2010 at 8:49 PM, Tobias Wunner <tobias.wunner@deri.org> wrote:
On Tue, Jul 13, 2010 at 8:49 PM, Tobias Wunner <tobias.wunner@deri.org> wrote:
Hi,
using "scala.xml.XML.loadFile" I reached the limitations of my machine. Assuming that I have increased the
heapspace correctly [1] I am constantly getting "java.lang.OutOfMemoryError: Java heap space". The XML
file I am trying to process is very large (70M). Is there any other other XML parser I could use? Ideally one
which provides just as simple syntax for formulating queries :) Or is there any way to estimate how much
memory I would actually need to process an XML file of such size (it has roughly 3 million XML elements)?
Cheers,
Tobias
References
[1] scala -DJAVA_OPTS="-Xmx3g"
Wed, 2010-07-14, 04:57
#4
Re: which package to use to process large xml files (>50M)
>>>>> "Naftoli" == Naftoli Gugenheim writes:
>> using "scala.xml.XML.loadFile" I reached the limitations of my
>> machine. Assuming that I have increased the heapspace correctly [1]
>> I am constantly getting "java.lang.OutOfMemoryError: Java heap
>> space". The XML file I am trying to process is very large (70M). Is
>> there any other other XML parser I could use? Ideally one which
>> provides just as simple syntax for formulating queries :) Or is
>> there any way to estimate how much memory I would actually need to
>> process an XML file of such size (it has roughly 3 million XML
>> elements)?
Naftoli> Isn't there a pull parser?
Indeed, in scala.xml.pull.
The 2.7 one is pretty much hosed, but the situation is greatly improved in 2.8.
Wed, 2010-07-14, 05:07
#5
Re: which package to use to process large xml files (>50M)
I have used Haskell's HXT to parse a 160GB XML file successfully and
considered porting the library to Scala, however, that task would be
quite difficult without laziness. I might give it a shot some day, but I
think you'll have to make a pretty big sacrifice without something similar.
Tobias Wunner wrote:
> Hi,
>
> using "scala.xml.XML.loadFile" I reached the limitations of my machine. Assuming that I have increased the
> heapspace correctly [1] I am constantly getting "java.lang.OutOfMemoryError: Java heap space". The XML
> file I am trying to process is very large (70M). Is there any other other XML parser I could use? Ideally one
> which provides just as simple syntax for formulating queries :) Or is there any way to estimate how much
> memory I would actually need to process an XML file of such size (it has roughly 3 million XML elements)?
>
> Cheers,
> Tobias
>
>
> References
> [1] scala -DJAVA_OPTS="-Xmx3g"
>
Wed, 2010-07-14, 09:17
#6
Re: which package to use to process large xml files (>50M)
Can you please give some examples. I'd like to understand what exactly isn't doable.
—Mohamed
On 14 July 2010 05:01, Tony Morris <tonymorris@gmail.com> wrote:
—Mohamed
On 14 July 2010 05:01, Tony Morris <tonymorris@gmail.com> wrote:
I have used Haskell's HXT to parse a 160GB XML file successfully and
considered porting the library to Scala, however, that task would be
quite difficult without laziness. I might give it a shot some day, but I
think you'll have to make a pretty big sacrifice without something similar.
Tobias Wunner wrote:
> Hi,
>
> using "scala.xml.XML.loadFile" I reached the limitations of my machine. Assuming that I have increased the
> heapspace correctly [1] I am constantly getting "java.lang.OutOfMemoryError: Java heap space". The XML
> file I am trying to process is very large (70M). Is there any other other XML parser I could use? Ideally one
> which provides just as simple syntax for formulating queries :) Or is there any way to estimate how much
> memory I would actually need to process an XML file of such size (it has roughly 3 million XML elements)?
>
> Cheers,
> Tobias
>
>
> References
> [1] scala -DJAVA_OPTS="-Xmx3g"
>
--
Tony Morris
http://tmorris.net/
Wed, 2010-07-14, 11:47
#7
Re: which package to use to process large xml files (>50M)
Ah, but too much laziness and you'd never implement it! ;)
Doesn't scalaz have iteratees now? Is there an XML parser implementation that uses them?
- Josh
On Jul 14, 2010, at 12:01 AM, Tony Morris wrote:
> I have used Haskell's HXT to parse a 160GB XML file successfully and
> considered porting the library to Scala, however, that task would be
> quite difficult without laziness. I might give it a shot some day, but I
> think you'll have to make a pretty big sacrifice without something similar.
>
>
> Tobias Wunner wrote:
>> Hi,
>>
>> using "scala.xml.XML.loadFile" I reached the limitations of my machine. Assuming that I have increased the
>> heapspace correctly [1] I am constantly getting "java.lang.OutOfMemoryError: Java heap space". The XML
>> file I am trying to process is very large (70M). Is there any other other XML parser I could use? Ideally one
>> which provides just as simple syntax for formulating queries :) Or is there any way to estimate how much
>> memory I would actually need to process an XML file of such size (it has roughly 3 million XML elements)?
>>
>> Cheers,
>> Tobias
>>
>>
>> References
>> [1] scala -DJAVA_OPTS="-Xmx3g"
>>
>
Wed, 2010-07-14, 13:37
#8
Re: which package to use to process large xml files (>50M)
As an aside, you can usually use DOMish or object-xml mapping tools to
parse massive XML files - the one trick is you need to use a hook so
as the parser is parsing the XML you process and remove each 'row'
from the root element so the garbage collector can discard trees of
the XML document as you chug through.
e.g. if you're XML looks like this...
...
...
...
...
then you just need a hook so your handler gets called for each row
with a document something like:
...
Many DOM-ish APIs in Java are mutable so you can just add a hook to
remove the child node you've just processed from the root and you're
good to go.
It might be possible to create a modified Scala XML parser which
reuses the scala.xml DOM model (Document, Elem etc) but which calls a
function as each 'row' is parsed and returns the final child-less
document so you can process massive XML documents using the same Scala
DOM model.
On 14 July 2010 03:15, Naftoli Gugenheim wrote:
> Isn't there a pull parser?
>
> On Tue, Jul 13, 2010 at 8:49 PM, Tobias Wunner
> wrote:
>>
>> Hi,
>>
>> using "scala.xml.XML.loadFile" I reached the limitations of my machine.
>> Assuming that I have increased the
>> heapspace correctly [1] I am constantly getting
>> "java.lang.OutOfMemoryError: Java heap space". The XML
>> file I am trying to process is very large (70M). Is there any other other
>> XML parser I could use? Ideally one
>> which provides just as simple syntax for formulating queries :) Or is
>> there any way to estimate how much
>> memory I would actually need to process an XML file of such size (it has
>> roughly 3 million XML elements)?
>>
>> Cheers,
>> Tobias
>>
>>
>> References
>> [1] scala -DJAVA_OPTS="-Xmx3g"
>
Wed, 2010-07-14, 14:57
#9
Re: which package to use to process large xml files (>50M)
That does not seem right. Assuming you have increased the heap space
properly... Loading the file would take may be up to 8 times the size of the
original file. So you should be able to load with less than 1 GB of memory.
If you are getting the out of memory error during the load phase, then it's
probably just that the -D option is incorrect.
If you are unix, ps -ef would show you how much virtual memory is used. The
windows task manager can show you the same.
JAVA_OPTS is an env var which scala reads. so the right way to do it is (at least in the bash shell):
export JAVA_OPTS="-Xmx3g"
scala [...]