- About Scala
- Documentation
- Code Examples
- Software
- Scala Developers
List serialization and 5374
Mon, 2012-01-23, 13:08
https://issues.scala-lang.org/browse/SI-5374
This is due to an optimization in the way lists are serialized - instead
of serializing the nodes, values in the nodes are serialized directly:
private def writeObject(out: ObjectOutputStream) {
var xs: List[B] = this
while (!xs.isEmpty) { out.writeObject(xs.head); xs = xs.tail }
out.writeObject(ListSerializeEnd)
}
private def readObject(in: ObjectInputStream) {
hd = in.readObject.asInstanceOf[B]
assert(hd != ListSerializeEnd)
var current: ::[B] = this
while (true) in.readObject match {
case ListSerializeEnd =>
current.tl = Nil
return
case a : Any =>
val list : ::[B] = new ::(a.asInstanceOf[B], Nil)
current.tl = list
current = list
}
}
The problem with this is that any object that has multiple fields
referring to the same list object will be deserialized so that it has
references to multiple list objects.
I believe this is wrong, since the tail of the list is not private to
the list, and Java serialization is supposed to guarantee that
serializing an object A which points to B and C, where B transitively
points to C, will yield the same structure in memory after deserializing A.
Furthermore, an object having N references to the same list object of
length M will need N*M memory to be serialized.
And finally, serializing a list in which some elements point to the list
itself doesn't end
Instead of patching ListBuffers to serialize lists more smartly, I plan to:
1) remove custom serialization of '::' to serialize the entire list
structure with nodes
2) implement list buffer serialization more efficiently so that values
are serialized instead of nodes
Question is - does anyone see a problem with this?
Wed, 2012-01-25, 00:01
#2
Re: List serialization and 5374
I believe the other immutable data structures (HashMap, HashSet, Vector) behave the same way List did. Should they be changed, too? One thing to consider are compatibility issues down the road because implementation classes would be exposed in the binary representation.
- Tiark
On Jan 24, 2012, at 5:34 PM, Aleksandar Prokopec wrote:
> Just for the record - this change is now in trunk. It will increase the footprint of serialized lists, but will avoid other issues as discussed below.
>
> On 1/23/12 1:08 PM, Aleksandar Prokopec wrote:
>> From:
>>
>> https://issues.scala-lang.org/browse/SI-5374
>>
>> This is due to an optimization in the way lists are serialized - instead of serializing the nodes, values in the nodes are serialized directly:
>>
>> private def writeObject(out: ObjectOutputStream) {
>> var xs: List[B] = this
>> while (!xs.isEmpty) { out.writeObject(xs.head); xs = xs.tail }
>> out.writeObject(ListSerializeEnd)
>> }
>>
>> private def readObject(in: ObjectInputStream) {
>> hd = in.readObject.asInstanceOf[B]
>> assert(hd != ListSerializeEnd)
>> var current: ::[B] = this
>> while (true) in.readObject match {
>> case ListSerializeEnd =>
>> current.tl = Nil
>> return
>> case a : Any =>
>> val list : ::[B] = new ::(a.asInstanceOf[B], Nil)
>> current.tl = list
>> current = list
>> }
>> }
>>
>> The problem with this is that any object that has multiple fields referring to the same list object will be deserialized so that it has references to multiple list objects.
>> I believe this is wrong, since the tail of the list is not private to the list, and Java serialization is supposed to guarantee that serializing an object A which points to B and C, where B transitively points to C, will yield the same structure in memory after deserializing A.
>> Furthermore, an object having N references to the same list object of length M will need N*M memory to be serialized.
>> And finally, serializing a list in which some elements point to the list itself doesn't end
>>
>> Instead of patching ListBuffers to serialize lists more smartly, I plan to:
>> 1) remove custom serialization of '::' to serialize the entire list structure with nodes
>> 2) implement list buffer serialization more efficiently so that values are serialized instead of nodes
>>
>> Question is - does anyone see a problem with this?
>>
>>
>
>
Wed, 2012-01-25, 00:11
#3
Re: List serialization and 5374
On Tue, Jan 24, 2012 at 2:58 PM, Tiark Rompf <tiark.rompf@epfl.ch> wrote:
While we're on the subject, can someone make sure that DefaultMap and DefaultKeySet are serializable? :)
-0xe1a
I believe the other immutable data structures (HashMap, HashSet, Vector) behave the same way List did. Should they be changed, too? One thing to consider are compatibility issues down the road because implementation classes would be exposed in the binary representation.
While we're on the subject, can someone make sure that DefaultMap and DefaultKeySet are serializable? :)
-0xe1a
Wed, 2012-01-25, 09:21
#4
Re: List serialization and 5374
That's true, but I believe that List guarantees that list.tail always returns the same list object. ListBuffer was relying on this property. This is a difference between lists and other immutable data structures that we have - their structure is a part of their public api. This is not the case with immutable.HashMap for example - you cannot obtain a reference into the middle of the hash trie. While serializing an object with several updated versions of the same hash trie would currently serialize the same objects multiple times, it cannot lead to correctness problems as was the case with lists.
Wed, 2012-01-25, 10:31
#5
Re: List serialization and 5374
How will Lists serialized under the old scheme keep working when deserialized under the new one?
Wed, 2012-01-25, 10:41
#6
Re: List serialization and 5374
It won't.
The same goes for list buffers.
On 1/25/12 10:23 AM, Simon Ochsenreither wrote:
The same goes for list buffers.
On 1/25/12 10:23 AM, Simon Ochsenreither wrote:
How will Lists serialized under the old scheme keep working when deserialized under the new one?
-- Aleksandar Prokopec, Doctoral Assistant LAMP, IC, EPFL http://people.epfl.ch/aleksandar.prokopec
Wed, 2012-01-25, 10:51
#7
Re: List serialization and 5374
Isn't that like ... the main idea ... of serialization?
And it is not like List and ListBuffer are classes nobody uses.
And it is not like List and ListBuffer are classes nobody uses.
Wed, 2012-01-25, 10:51
#8
Re: List serialization and 5374
No, it isn't - even in JDK, not all classes are guaranteed to be
serializable between different major releases.
But it may be possible to make this backwards compatible.
On 1/25/12 10:38 AM, Simon Ochsenreither wrote:
But it may be possible to make this backwards compatible.
On 1/25/12 10:38 AM, Simon Ochsenreither wrote:
Isn't that like ... the main idea ... of serialization?
And it is not like List and ListBuffer are classes nobody uses.
-- Aleksandar Prokopec, Doctoral Assistant LAMP, IC, EPFL http://people.epfl.ch/aleksandar.prokopec
Wed, 2012-01-25, 10:51
#9
Re: List serialization and 5374
On Wed, Jan 25, 2012 at 9:43 AM, Aleksandar Prokopec <aleksandar.prokopec@epfl.ch> wrote:
In the JDK, classes that do not guarantee compatibility specify it in the javadoc (Swing is an example). The collection classes do guarantee compatibility across releases as far as I know.
Best,Ismael
No, it isn't - even in JDK, not all classes are guaranteed to be serializable between different major releases.
In the JDK, classes that do not guarantee compatibility specify it in the javadoc (Swing is an example). The collection classes do guarantee compatibility across releases as far as I know.
Best,Ismael
Wed, 2012-01-25, 11:01
#10
Re: List serialization and 5374
It's probably just a check of the serialVersionUID and then doing "the right thing".
Wed, 2012-01-25, 16:31
#11
Re: List serialization and 5374
That's a good point.
Here's one way how backwards compatible serialization could be done:
1) The `::` class can have a smart readObject and writeObject which
figure out in which version the list was serialized, and deserialize it
appropriately.
2) ListBuffers didn't have writeObject and readObject specified before
the fix for SI-5374. This means that I can't write a custom readObject
which is smart about figuring out the version of the serialized
ListBuffer, because I cannot assume the order in which the fields have
been serialized (or am I wrong and this is specified somewhere?).
If we remove `writeObject` and `readObject` methods which were
introduced in the fix for SI-5374, then deserializing a ListBuffer which
was serialized prior to the fix will still create an invalid ListBuffer,
as before.
So, since the serialization for ListBuffers was broken before, it makes
sense to keep writeObject and readObject introduced in the fix.
>
> In the JDK, classes that do not guarantee compatibility specify it in
> the javadoc (Swing is an example). The collection classes do guarantee
> compatibility across releases as far as I know.
>
> Best,
> Ismael
Wed, 2012-01-25, 19:31
#12
Re: List serialization and 5374
Slightly unrelated…. I wonder if it is possible to use the scala collection builder infrastructure to generically serialize/deserialize collections without exposing internal implementation details? This would probably improve serialization/deserialization speed/size and make it easier to refactor internals (like my TreeMap/TreeSet changes) without breaking compatibility with older serialized data.
Something like (pseudo-code):
trait DefaultSerialization extends GenTraversable {
private def writeObject(out: ObjectOutputStream) {
out.writeInt(this.size)
for (element <- this) out.writeObject(element)
}
private def readObject(in: ObjectInputStream) {
val size = in.readInt
val builder = companion.newBuilder
builder.sizeHint(size)
for (_ <- 1 to size) builder += in.readObject()
builder.result()
}
Obviously some collections may still need to do standard/customized serialization (anything without a definite size and/or BitSet, Stream, List, etc).
Regards,
Erik
My changes to TreeMap/TreeSet
On 25 jan. 2012, at 16:23, Aleksandar Prokopec wrote:
> That's a good point.
>
> Here's one way how backwards compatible serialization could be done:
>
> 1) The `::` class can have a smart readObject and writeObject which figure out in which version the list was serialized, and deserialize it appropriately.
>
> 2) ListBuffers didn't have writeObject and readObject specified before the fix for SI-5374. This means that I can't write a custom readObject which is smart about figuring out the version of the serialized ListBuffer, because I cannot assume the order in which the fields have been serialized (or am I wrong and this is specified somewhere?).
> If we remove `writeObject` and `readObject` methods which were introduced in the fix for SI-5374, then deserializing a ListBuffer which was serialized prior to the fix will still create an invalid ListBuffer, as before.
> So, since the serialization for ListBuffers was broken before, it makes sense to keep writeObject and readObject introduced in the fix.
>
>
>
>
>>
>> In the JDK, classes that do not guarantee compatibility specify it in the javadoc (Swing is an example). The collection classes do guarantee compatibility across releases as far as I know.
>>
>> Best,
>> Ismael
>
>
Just for the record - this change is now in trunk. It will increase the
footprint of serialized lists, but will avoid other issues as discussed
below.
On 1/23/12 1:08 PM, Aleksandar Prokopec wrote:
> From:
>
> https://issues.scala-lang.org/browse/SI-5374
>
> This is due to an optimization in the way lists are serialized -
> instead of serializing the nodes, values in the nodes are serialized
> directly:
>
> private def writeObject(out: ObjectOutputStream) {
> var xs: List[B] = this
> while (!xs.isEmpty) { out.writeObject(xs.head); xs = xs.tail }
> out.writeObject(ListSerializeEnd)
> }
>
> private def readObject(in: ObjectInputStream) {
> hd = in.readObject.asInstanceOf[B]
> assert(hd != ListSerializeEnd)
> var current: ::[B] = this
> while (true) in.readObject match {
> case ListSerializeEnd =>
> current.tl = Nil
> return
> case a : Any =>
> val list : ::[B] = new ::(a.asInstanceOf[B], Nil)
> current.tl = list
> current = list
> }
> }
>
> The problem with this is that any object that has multiple fields
> referring to the same list object will be deserialized so that it has
> references to multiple list objects.
> I believe this is wrong, since the tail of the list is not private to
> the list, and Java serialization is supposed to guarantee that
> serializing an object A which points to B and C, where B transitively
> points to C, will yield the same structure in memory after
> deserializing A.
> Furthermore, an object having N references to the same list object of
> length M will need N*M memory to be serialized.
> And finally, serializing a list in which some elements point to the
> list itself doesn't end
>
> Instead of patching ListBuffers to serialize lists more smartly, I
> plan to:
> 1) remove custom serialization of '::' to serialize the entire list
> structure with nodes
> 2) implement list buffer serialization more efficiently so that values
> are serialized instead of nodes
>
> Question is - does anyone see a problem with this?
>
>