This page is no longer maintained — Please continue to the home page at www.scala-lang.org

Maps not using MurmurHash.stringHash()?

6 replies

Thu, 2012-01-05, 18:46

tcn 2

Joined: 2011-11-06,

Hi!

I used to think that scala's collections were using MurmurHash:

"This is the hash used by collections and case classes (including
tuples)."
https://lampsvn.epfl.ch/trac/scala/browser/scala/tags/R_2_9_1_final/src/...

However when testing...

scala> val r = new BufferedReader( new FileReader( "/tmp/
colliding_strings" ) )
scala> val m = scala.collection.mutable.Map[String,String]()
scala> Iterator.continually(r.readLine).takeWhile(_ != null).foreach(l
=> m += l -> l)

...took forever.

So, java.lang.String.hashCode() is used? That's a bad idea, isn't it?

-tcn

Thu, 2012-01-05, 19:41

#1

ichoran

Joined: 2009-08-14,

Re: Maps not using MurmurHash.stringHash()?

Scala uses whatever hash is appropriate for the class in question. So yes, it uses Java hashes on strings. What I meant is that the collections and tuples use MurmurHash to create a new hash value based on their contents, so you don't get colliding collections. But if you put colliding objects (i.e. the objects' .hashCode methods return the same values) into a hash map, yes, they'll collide.

One could remedy this problem by creating a wrapper object that returned the MurmurHash of a string instead of its default hash, and use those in maps instead of strings; with implicit conversions it would be almost painless.

I'm not sure whether Scala ought to address the string hashing issue, as it did with the number hashing issue. Strings are hashed a lot, and Java's is a lot faster, but you need boxing to store a hash...so there are no great solutions.

--Rex

On Thu, Jan 5, 2012 at 12:46 PM, tcn <timo.nentwig@gmail.com> wrote:

Hi!

I used to think that scala's collections were using MurmurHash:

"This is the hash used by collections and case classes (including
tuples)."
https://lampsvn.epfl.ch/trac/scala/browser/scala/tags/R_2_9_1_final/src//library/scala/util/MurmurHash.scala#L1

However when testing...

scala> val r = new BufferedReader( new FileReader( "/tmp/
colliding_strings" ) )
scala> val m = scala.collection.mutable.Map[String,String]()
scala> Iterator.continually(r.readLine).takeWhile(_ != null).foreach(l
=> m += l -> l)

...took forever.

So, java.lang.String.hashCode() is used? That's a bad idea, isn't it?

-tcn

Fri, 2012-01-06, 06:21

#2

tcn 2

Joined: 2011-11-06,

Re: Maps not using MurmurHash.stringHash()?

I agree that performance is very important.

This actually caught my attention when watching a 28c3 talk on
attacking web container this way. It is very easy to effectively
shutting down a server just by dispatching a ton of colliding strings
(which end up being stored in a hash map).

String.hashCode collisions are very easy to generate. Maybe there's
some algorithm that's still fast but at least harder to generate
collisions for?

On Jan 5, 7:36 pm, Rex Kerr wrote:
> Scala uses whatever hash is appropriate for the class in question. So yes,
> it uses Java hashes on strings. What I meant is that the collections and
> tuples use MurmurHash to create a new hash value based on their contents,
> so you don't get colliding collections. But if you put colliding objects
> (i.e. the objects' .hashCode methods return the same values) into a hash
> map, yes, they'll collide.
>
> One could remedy this problem by creating a wrapper object that returned
> the MurmurHash of a string instead of its default hash, and use those in
> maps instead of strings; with implicit conversions it would be almost
> painless.
>
> I'm not sure whether Scala ought to address the string hashing issue, as it
> did with the number hashing issue. Strings are hashed a lot, and Java's is
> a lot faster, but you need boxing to store a hash...so there are no great
> solutions.
>
> --Rex
>
>
>
>
>
>
>
> On Thu, Jan 5, 2012 at 12:46 PM, tcn wrote:
> > Hi!
>
> > I used to think that scala's collections were using MurmurHash:
>
> > "This is the hash used by collections and case classes (including
> > tuples)."
>
> >https://lampsvn.epfl.ch/trac/scala/browser/scala/tags/R_2_9_1_final/s...
>
> > However when testing...
>
> > scala> val r = new BufferedReader( new FileReader( "/tmp/
> > colliding_strings" ) )
> > scala> val m = scala.collection.mutable.Map[String,String]()
> > scala> Iterator.continually(r.readLine).takeWhile(_ != null).foreach(l
> > => m += l -> l)
>
> > ...took forever.
>
> > So, java.lang.String.hashCode() is used? That's a bad idea, isn't it?
>
> > -tcn

Sat, 2012-01-07, 09:51

#3

tcn 2

Joined: 2011-11-06,

Re: Maps not using MurmurHash.stringHash()?

[SECURITY] Apache Tomcat and the hashtable collision DoS vulnerability

http://mail-archives.apache.org/mod_mbox/www-announce/201112.mbox/<4EFB9800.5010106@apache.org>

The existence of a wrapper would at least make people aware that is a
real issue and not only a theoretical one.

On Jan 6, 6:09 am, tcn wrote:
> I agree that performance is very important.
>
> This actually caught my attention when watching a 28c3 talk on
> attacking web container this way. It is very easy to effectively
> shutting down a server just by dispatching a ton of colliding strings
> (which end up being stored in a hash map).
>
> String.hashCode collisions are very easy to generate. Maybe there's
> some algorithm that's still fast but at least harder to generate
> collisions for?
>
> On Jan 5, 7:36 pm, Rex Kerr wrote:
>
>
>
>
>
>
>
> > Scala uses whatever hash is appropriate for the class in question. So yes,
> > it uses Java hashes on strings. What I meant is that the collections and
> > tuples use MurmurHash to create a new hash value based on their contents,
> > so you don't get colliding collections. But if you put colliding objects
> > (i.e. the objects' .hashCode methods return the same values) into a hash
> > map, yes, they'll collide.
>
> > One could remedy this problem by creating a wrapper object that returned
> > the MurmurHash of a string instead of its default hash, and use those in
> > maps instead of strings; with implicit conversions it would be almost
> > painless.
>
> > I'm not sure whether Scala ought to address the string hashing issue, as it
> > did with the number hashing issue. Strings are hashed a lot, and Java's is
> > a lot faster, but you need boxing to store a hash...so there are no great
> > solutions.
>
> > --Rex
>
> > On Thu, Jan 5, 2012 at 12:46 PM, tcn wrote:
> > > Hi!
>
> > > I used to think that scala's collections were using MurmurHash:
>
> > > "This is the hash used by collections and case classes (including
> > > tuples)."
>
> > >https://lampsvn.epfl.ch/trac/scala/browser/scala/tags/R_2_9_1_final/s...
>
> > > However when testing...
>
> > > scala> val r = new BufferedReader( new FileReader( "/tmp/
> > > colliding_strings" ) )
> > > scala> val m = scala.collection.mutable.Map[String,String]()
> > > scala> Iterator.continually(r.readLine).takeWhile(_ != null).foreach(l
> > > => m += l -> l)
>
> > > ...took forever.
>
> > > So, java.lang.String.hashCode() is used? That's a bad idea, isn't it?
>
> > > -tcn

Sat, 2012-01-07, 10:31

#4

Henry Story

Joined: 2011-03-26,

Re: Re: Maps not using MurmurHash.stringHash()?

On 7 Jan 2012, at 09:49, tcn wrote:

> [SECURITY] Apache Tomcat and the hashtable collision DoS vulnerability
>
> http://mail-archives.apache.org/mod_mbox/www-announce/201112.mbox/<4EFB9800.5010106@apache.org>
>
> The existence of a wrapper would at least make people aware that is a
> real issue and not only a theoretical one.

I suppose it would be useful to know where these bug reports are placed now (they used to be on bugs.sun.com) because then we could at least vote for the issue. In the 28c3 video the hackers mentioned someone at Oracle who stated that this was not an issue to be fixed. I'd have to look through the video to find the name again.
http://www.youtube.com/watch?v=R2Cq3CLI6H8

But the issue is larger I suppose than forms. In a LinkedData web where each web site is crawling other sites for information (from RSS feeds to Friend of a Friend (foaf) files, to DOAP statements, to shopping information ( http://purl.org/goodrelations/ ) each site crawling the web needs to parse XML, JSON, or other formats found on the web. It would be interesting to check if these parsers are not also using hashtables in their parsing work. Because if they are then this can also be used as a denial of service attack. Of course one would know the name of the URL that was evil, but because it so easy to place documents on the web that in name spaces that don't belong to one, it would be impossible to block domains that were problematic, unless one wanted to block huge parts of google. It would also be extremely difficult to argue for the removal of these documents. "Please Google could you remove this nasty document because it is a denial of service attack on my parser" does not sound like it will have much impact. (Well perhaps parsers don't use hash maps, and so I am just wrong)

Henry

>
> On Jan 6, 6:09 am, tcn wrote:
>> I agree that performance is very important.
>>
>> This actually caught my attention when watching a 28c3 talk on
>> attacking web container this way. It is very easy to effectively
>> shutting down a server just by dispatching a ton of colliding strings
>> (which end up being stored in a hash map).
>>
>> String.hashCode collisions are very easy to generate. Maybe there's
>> some algorithm that's still fast but at least harder to generate
>> collisions for?
>>
>> On Jan 5, 7:36 pm, Rex Kerr wrote:
>>
>>
>>
>>
>>
>>
>>
>>> Scala uses whatever hash is appropriate for the class in question. So yes,
>>> it uses Java hashes on strings. What I meant is that the collections and
>>> tuples use MurmurHash to create a new hash value based on their contents,
>>> so you don't get colliding collections. But if you put colliding objects
>>> (i.e. the objects' .hashCode methods return the same values) into a hash
>>> map, yes, they'll collide.
>>
>>> One could remedy this problem by creating a wrapper object that returned
>>> the MurmurHash of a string instead of its default hash, and use those in
>>> maps instead of strings; with implicit conversions it would be almost
>>> painless.
>>
>>> I'm not sure whether Scala ought to address the string hashing issue, as it
>>> did with the number hashing issue. Strings are hashed a lot, and Java's is
>>> a lot faster, but you need boxing to store a hash...so there are no great
>>> solutions.
>>
>>> --Rex
>>
>>> On Thu, Jan 5, 2012 at 12:46 PM, tcn wrote:
>>>> Hi!
>>
>>>> I used to think that scala's collections were using MurmurHash:
>>
>>>> "This is the hash used by collections and case classes (including
>>>> tuples)."
>>
>>>> https://lampsvn.epfl.ch/trac/scala/browser/scala/tags/R_2_9_1_final/s...
>>
>>>> However when testing...
>>
>>>> scala> val r = new BufferedReader( new FileReader( "/tmp/
>>>> colliding_strings" ) )
>>>> scala> val m = scala.collection.mutable.Map[String,String]()
>>>> scala> Iterator.continually(r.readLine).takeWhile(_ != null).foreach(l
>>>> => m += l -> l)
>>
>>>> ...took forever.
>>
>>>> So, java.lang.String.hashCode() is used? That's a bad idea, isn't it?
>>
>>>> -tcn

Social Web Architect
http://bblfish.net/

Sun, 2012-01-08, 13:41

#5

tcn 2

Joined: 2011-11-06,

Re: Maps not using MurmurHash.stringHash()?

Starting minute 34:

"As for Java itself, it does not seem like there is anything
that would require a change in Java hashmap
implementation." – Chandan, Oracle Security Alerts

http://events.ccc.de/congress/2011/Fahrplan/attachments/2007_28C3_Effect...

On Jan 7, 10:20 am, Henry Story wrote:
> On 7 Jan 2012, at 09:49, tcn wrote:
>
> > [SECURITY] Apache Tomcat and the hashtable collision DoS vulnerability
>
> >http://mail-archives.apache.org/mod_mbox/www-announce/201112.mbox/<4EFB9800.5010...@apache.org>
>
> > The existence of a wrapper would at least make people aware that is a
> > real issue and not only a theoretical one.
>
> I suppose it would be useful to know where these bug reports are placed now (they used to be on bugs.sun.com) because then we could at least vote for the issue. In the 28c3 video the hackers mentioned someone at Oracle who stated that this was not an issue to be fixed. I'd have to look through the video to find the name again.http://www.youtube.com/watch?v=R2Cq3CLI6H8
>
> But the issue is larger I suppose than forms. In a LinkedData web where each web site is crawling other sites for information (from RSS feeds to Friend of a Friend (foaf) files, to DOAP statements, to shopping information (http://purl.org/goodrelations/) each site crawling the web needs to parse XML, JSON, or other formats found on the web. It would be interesting to check if these parsers are not also using hashtables in their parsing work. Because if they are then this can also be used as a denial of service attack. Of course one would know the name of the URL that was evil, but because it so easy to place documents on the web that in name spaces that don't belong to one, it would be impossible to block domains that were problematic, unless one wanted to block huge parts of google. It would also be extremely difficult to argue for the removal of these documents. "Please Google could you remove this nasty document because it is a denial of service attack on my parser" does not sound like it will have much impact. (Well perhaps parsers don't use hash maps, and so I am just wrong)
>
> Henry
>
>
>
>
>
>
>
>
>
>
>
> > On Jan 6, 6:09 am, tcn wrote:
> >> I agree that performance is very important.
>
> >> This actually caught my attention when watching a 28c3 talk on
> >> attacking web container this way. It is very easy to effectively
> >> shutting down a server just by dispatching a ton of colliding strings
> >> (which end up being stored in a hash map).
>
> >> String.hashCode collisions are very easy to generate. Maybe there's
> >> some algorithm that's still fast but at least harder to generate
> >> collisions for?
>
> >> On Jan 5, 7:36 pm, Rex Kerr wrote:
>
> >>> Scala uses whatever hash is appropriate for the class in question. So yes,
> >>> it uses Java hashes on strings. What I meant is that the collections and
> >>> tuples use MurmurHash to create a new hash value based on their contents,
> >>> so you don't get colliding collections. But if you put colliding objects
> >>> (i.e. the objects' .hashCode methods return the same values) into a hash
> >>> map, yes, they'll collide.
>
> >>> One could remedy this problem by creating a wrapper object that returned
> >>> the MurmurHash of a string instead of its default hash, and use those in
> >>> maps instead of strings; with implicit conversions it would be almost
> >>> painless.
>
> >>> I'm not sure whether Scala ought to address the string hashing issue, as it
> >>> did with the number hashing issue. Strings are hashed a lot, and Java's is
> >>> a lot faster, but you need boxing to store a hash...so there are no great
> >>> solutions.
>
> >>> --Rex
>
> >>> On Thu, Jan 5, 2012 at 12:46 PM, tcn wrote:
> >>>> Hi!
>
> >>>> I used to think that scala's collections were using MurmurHash:
>
> >>>> "This is the hash used by collections and case classes (including
> >>>> tuples)."
>
> >>>>https://lampsvn.epfl.ch/trac/scala/browser/scala/tags/R_2_9_1_final/s...
>
> >>>> However when testing...
>
> >>>> scala> val r = new BufferedReader( new FileReader( "/tmp/
> >>>> colliding_strings" ) )
> >>>> scala> val m = scala.collection.mutable.Map[String,String]()
> >>>> scala> Iterator.continually(r.readLine).takeWhile(_ != null).foreach(l
> >>>> => m += l -> l)
>
> >>>> ...took forever.
>
> >>>> So, java.lang.String.hashCode() is used? That's a bad idea, isn't it?
>
> >>>> -tcn
>
> Social Web Architecthttp://bblfish.net/
>
> smime.p7s
> 6KViewDownload

Sat, 2012-01-28, 15:51

#6

tcn 2

Joined: 2011-11-06,

Re: Maps not using MurmurHash.stringHash()?

So, nobody seems to care... By accident I just discovered:

We are most excited by the performance of CityHash64() and its variants on
short strings, but long strings are interesting as well.

CityHash is intended to be fast, under the constraint that it hash very
well. For CPUs with the CRC32 instruction, CRC is speedy, but CRC wasn't
designed as a hash function and shouldn't be used as one. CityHashCrc128()
is not a CRC, but it uses the CRC32 machinery. http://code.google.com/p/cityhash/source/browse/trunk/README

Copyright © 2012 École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland