This page is no longer maintained — Please continue to the home page at www.scala-lang.org

a simple scala hadoop example

1 reply
Miles Egan
Joined: 2010-07-05,
User offline. Last seen 42 years 45 weeks ago.

It took me a bit of trial and error to get the WordCount example from
the Hadoop tutorials working in Scala so I decided to bundle up a
working example in case I might save somebody else some trouble:
http://github.com/cageface/scala-hadoop-example

I'm still pretty new to Scala so there might be some unpleasantly
non-idiomatic code there.

Miles Egan
Joined: 2010-07-05,
User offline. Last seen 42 years 45 weeks ago.
Re: a simple scala hadoop example

I decided to update this example to the newer Hadoop API and found it
somewhat painful. Translating the java code using the older API was
very straightforward but this newer API was much trickier.

The new API requires you to implement a Mapper and a Reducer by
subclassing the API classes and overriding a method. The signatures of
the java classes are as follows:

public class Mapper {
public class Context
extends MapContext {
protected void map(KEYIN key, VALUEIN value, Context context) {}
}
}

and

public class Reducer {
public class Context
extends ReduceContext {
protected void reduce(KEYIN key, Iterable values,
Context context);
}
}

After a lot of trial and error and searching I finally got scala
classes with these signatures to work:
class TokenizerMapper extends Mapper[Object,Text,Text,IntWritable] {
override
def map(key:Object, value:Text,
context:Mapper[Object,Text,Text,IntWritable]#Context)
}

class IntSumReducer extends Reducer[Text,IntWritable,Text,IntWritable] {
override
def reduce(key:Text, values:java.lang.Iterable[IntWritable],
context:Reducer[Text,IntWritable,Text,IntWritable]#Context) = {
}
}

This seems ugly. Repeating the type signatures twice is non-DRY, the
#Context accessor is a bit obscure and it was pure guesswork to
discover that I had to specify java.lang.Iterable. Using just
"Iterable" doesn't work. Some wrapper/mapping reason?

Is there a way I can improve this or is this just the overhead of
dealing with some of the uglier java apis?

Copyright © 2012 École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland