Connected Attribute Lists for Improved Collaborative Filtering

The following idea applies to, and other sites using collaborative filtering.  (Such as Netflix)

I propose that collaborative filtering sites quiz users on their beliefs, feelings, opinions, about many random things, much like a dating site, or a “What X are YOU?” meme.  After quizzing a user, they can utilize their answers to build secondary attributes based on feature extraction of the questionnaire across all users.  They then find what secondary attributes correlate to primary attributes, through a feature extraction of users primary attributes utilizing the secondary attributes as the training set.

In other words, if a question asks:  “I think Ubuntu is better than OSX” and options are “Strongly agree?  Somewhat Agree?  No Opinion?  Somewhat disagree?  Strongly Disagreee?”  If I answer “somewhat agree”, then we make a comparison of the attributes.  Say we find an attribute we name “Apple News Story Preference attribute” (Note that these attributes aren’t actually defined in the primary dataset, they are generated through feature extraction), then my result may show that I dislike apple news stories given my answer.  Or it may show I don’t hate them, because due to providing an opinion (and not strongly agreeing), I showed that I actually care about technology in general.  (Therefore the answer no opinion should be on a different dimension, and not part of the 1-5 scale applies to preference)

In contrast, some attributes will NOT have correspondences, such as the answer to whether or not you agree with the statement “I like the letter 352.”  It is arbitrary and random, and few attributes will relate to it in the primary attribute set, so therefore, stating any opinion on this quiz question provides no adjustment to the recommendation algorithm.  Note that correspondence of answers to primary set preferences will change as culture changes, so we shouldn’t compare 2008 answers to 2010, etc.

In summary, I want the recommendation algorithm to pair me with other people who have similar primary dataset attributes, and I believe we can use a secondary dataset to boost correspondences in a sparse matrix.


From a user interface standpoint, one random question could appear on the sidebar every day for users to vote up or down (Either one random new one, or one random one from a set of 50).  This would give the learner a very nice, steady building, less sparse dataset, and if you wanted to decay the strength of “old opinions” in the dataset, this’d be a great method.  Basically “which half of the reddit users online today am I like? (And does that correspond to any attribute in the primary dataset)”

Leave a Reply