really, nothing here

software geek

17.6.07

SVD is dead, long live SVD

Simon Funk (a pen name) pretty much single handedly raised the bar on what can be considered "barely competent" collaborative filtering when he published a speedy, stable and more or less correct SVD solution to the Netflix prize. It's a huge contribution to the community and, althogh it hasn't begat a succession of disclosure regarding incremental improvements to the algorithm (presumably my mucking around in Simon's postulated non-linear "G" function) its clear from the incredible volume of scores centered on Simon's original RMSE of .90 just how important this contribution is.

I wish I had something as fundementally interesting to contribute to the discussion, but at this time I'm only just getting back to working on the prize after a pretty rough semester at fake grad school. What I can say is that as nice a solution as the SVD with non-observable regression variables is, its incomplete, and probably not extendable into the winning solution space. This isn't conclusive, of course, and there's a lot of work that could be done to prove me right or wrong having to do with backing out probability intervals on the discovered regression variables, as well as working out what introducing observables, as well as unobservable, regression variables does (I'm guessing that's how Simon moved up to .89 -- or maybe he mixed his results with yet another team).

The real problem, however, is in the assumption that components add up linearly to composed an ultimate value. There are a couple of takes on this, but the combination of uniform density, clipping at max and min values, and linear combinations seems destined only to adequately represent the average user's reaction to mediocre movies. There's a ton of work on how responsive individual's really are to "quality" and none of it suggests that the reaction is in any way linear. Furthermore, that isn't really the interesting bit of the question -- "double averaging" or averaging the average score of a user with the average score of the movie yields results that really aren't that much more wrong that the SVD results (.97ish vs. .90ish).

So let's pretend for a second that you figured out which of the seven or so decent distribution choices you use to represent user responsiveness to quality. Now you've got to figure out how you're going to handle the issue of self-selection. This arises because, well, people can only rate what they've seen (well, honest people) and people only see things they have a reasonable shot at liking (well, normal people). Salon's machist, Farhad Manjoo, writes this up better than I can. So now you have two interesting distribution issues that I can't wait to get cracking on: what's the distbribution of a user's knowledge of the general responsiveness they will have to the movie's quality; what's the distribution of the user's probability of even trying this movie out. But wait, don't start plugging in your heirarchical models just yet -- because you only care about these distributions during the training (regression) of your model. -- during the prediction phases users don't get a chance to self-select, so you should just spit out the results.

(Well maybe, the Netflix corpus kinda sucks with regards to little things like just who made what recommendations under what conditions and frankly, though what I said holds true for a real production system, its entirely likely that all of the inputs in the Netflix corpus, including the holdout set, are self-selected reviews subject to the previously mentioned fun and games with heirachical modelling).

Like other posts -- this one keeps growing, come back for results and more thoughts.

0 Comments:

Post a Comment

Links to this post:

Create a Link

<< Home