really, nothing here

software geek

28.9.08

Hadoop Reduce Value-Iterators are Flyweights

Hadoop reduce iterators are pretty broken, so realize that before you waste tons of time like I did. The following won't work (never mind the crap job I'm doing protecting my heap):

List entries = new ArrayList();
while(values.hasNext()) {
entries.add(values.next());
}

If fails because Entry is a flyweight instance that gets recycled on each iteration. A nice optimization if you need it, but a rather heavy-handed assumption and one that should have been documentated, and an optimization that in this case leaves you with a  List populated with references to the same object instance. This could take you a while to track down.

If you, like me, can afford to cache your values in your reduce step and have a multi-pass algorithm which requires you to cache do something like this:

List entries = new ArrayList();
while(values.hasNext()) {
entries.add(new Entry(values.next()));
}

where Entry(entry) is a copy-constructor (you could use clone() if you've done nothing sexy to your class and for some reason hate portable design standards).

Hadoop is a great project, but they really drop the ball on documenting oddities like this.

1 Comments:

Blogger bernie said...

good. thanks!

28/5/09 00:35  

Post a Comment

Links to this post:

Create a Link

<< Home