Hadoop Reduce Value-Iterators are Flyweights
Hadoop reduce iterators are pretty broken, so realize that before you waste tons of time like I did. The following won't work (never mind the crap job I'm doing protecting my heap):
List entries = new ArrayList();
while(values.hasNext()) {
entries.add(values.next());
}
If fails because
Entry is a flyweight instance that gets recycled on each iteration. A nice optimization if you need it, but a rather heavy-handed assumption and one that should have been documentated, and an optimization that in this case leaves you with a List populated with references to the same object instance. This could take you a while to track down.If you, like me, can afford to cache your values in your reduce step and have a multi-pass algorithm which requires you to cache do something like this:
List entries = new ArrayList();
while(values.hasNext()) {
entries.add(new Entry(values.next()));
}
where
Entry(entry) is a copy-constructor (you could use clone() if you've done nothing sexy to your class and for some reason hate portable design standards).Hadoop is a great project, but they really drop the ball on documenting oddities like this.

1 Comments:
good. thanks!
Post a Comment
Links to this post:
Create a Link
<< Home