Looking for tips on a recommendation algorithm

jon724665 · Jan 22, 2013

Hi everyone,

Long time lurker, first time poster! I'm developing a website where users can read and participate in discussions, articles and questions. I'm currently developing an algorithm that will recommend to each user discussions, articles and questions that they would be interested in. I've written up what I've come up with so far. I'm not looking for anyone to spend too much time on this, just mostly looking for pointers. It'd be a great result to have someone say "ha! What are you doing? That thing will fall apart with more than 200 users!" etc. Please poke holes in it and make any suggestions you like. If you think the entire approach is wrong, please let me know if there's a particular area I should be researching to re-work it.

On a final note, I'm not a computer scientist and I have no experience with these kind of algorithms. Thanks in advance!

Aim:
To successfully recommend to a user discussions, articles and questions that he or she would be interested in reading. This algorithm is not intended to be a perfect solution. Rather, it is intended to be a solution until we can employ computer scientists to re-write it. Whilst functional, it is far from efficient in its current state (read: it may be slow and most likely won’t scale well).

Required definitions:
A category is a broad field - e.g. Mathematics, Physics, Biology, etc.
A sub-category is the next level - e.g. Pure Mathematics, Applied mathematics, etc.
A sub-sub-category is the next level again - e.g. Number Theory, Real analysis, etc.
A thread is any question, article or discussion - e.g. "is religion incompatible with science?"

The order is then category > sub-category > sub-sub-category > thread

Overview:
Each user is connected to a unique recommendation 'cluster' that feeds them the following recommendations: discussions, articles and questions to read.

The cluster a user is connected to is determined by their top 3 most viewed categories (e.g. Biology, Physics and Maths). That is, there is a cluster for Biology, Physics and Maths, a cluster for Biology, Physics and Sport, etc. Using our current number of categories, 22, this amounts to 1,540 clusters (22 choose 3).

Each cluster then uses information from the connected users to determine recommendations for those users. It will do so by recommending the most viewed threads (discussions, articles, questions, etc.) from all of its users combined. More specifically, to reduce computation it will identify the total most viewed sub-sub-categories from those users combined, then simply take the most viewed thread from each of those sub-sub-categories. Connections are therefore based on categories, but recommendations are based on sub-sub-categories.
Figure 1: http://imgur.com/An9cawZ

Consider the cluster above, C1 (cluster 1). Users A, B and C are connected to the Maths, Bio, Physics cluster because they each have these three as their top three most viewed categories. If both B and C have the sub-sub-category, Number Theory, then A will most likely receive a recommendation of the most viewed thread in Number Theory.

The particulars:
- Each cluster must have at least 40 users connected before it becomes operational - this will prevent recommendations that are unequally weighted (e.g. user A gets recommended everything user B reads).
- If a user’s primary (read: best) cluster is not operational, they will be connected to the next best cluster. They will remain here until their primary cluster becomes operational.
- A user’s cluster is determined by their most viewed categories. As this changes, their cluster changes.
- Every user connected to a particular cluster receives the same recommendations from that cluster. The assumption is that users with the same categories in their most viewed will be interested in similar things. The recommendation list will be filtered based on the users top sub-subs so that no redundant recommendations occur. For instance, a user which has a particular sub-sub in their most viewed list that is also a top sub-sub in the cluster will not be recommended anything from this sub-sub. This is based on the assumption there’s a good chance they have viewed the most viewed material in that sub-sub.
- We only show the top 20 recommendations at any point in time, then we load more when they scroll
- If we assume that there are 22 categories, an average of 20 sub-categories per category and 15 sub-sub-categories per sub-category, we have approx. 6,600 sub-sub-categories. This should be an overestimate.
- If we assume that there are 22 categories, then there are 1,540 clusters (22 choose 3).

The method:
Each user has a data profile detailing the following: http://imgur.com/td6ApWF

From this data, the user’s ‘cluster profile’ can be determined. To determine the cluster, we simply sort the views column and extract the top three most viewed sub-sub-categories; these constitute the cluster profile. This table will consist of a maximum of 26,400 (6,600 * 4) cells per user. If we don’t create cells until they are required and we make the assumption that in reality most users won’t view threads from more than 150 sub-sub-categories, we can use a working assumption of a maximum of 600 (150 * 4) cells per user. This will be even lower until they reach this working maximum.

In addition to the above, another site-wide table will be kept to determine what each cluster should recommend. It will appear as follows: http://imgur.com/rKHdjpB

Each time a user views a thread in a sub-sub-category, the cell corresponding to that sub-sub-category and that user’s cluster is incremented. That is, if a user that’s connected to cluster 1 views a thread in particle physics, the first cell is incremented by 1.

Cluster 1’s recommendations are then simply the highest viewed threads from the most viewed of these sub-sub-categories. Each user from cluster 1 will receive say the top 20 of these as recommendations.

Alternatively, we can do away with this table entirely. Instead of performing a cluster-check each time a thread is viewed and incrementing the site-wide table, we can utilise the data we already have. As we have already stored how many views a particular user has for each sub-sub-category, we can calculate the total number of views in a particular cluster for each sub-sub-category on the fly. That is, when cluster 1 needs to make a recommendation, it looks at each user connected to it, totals their views for each sub-sub-category, calculates the total views per sub-sub-category for the entire cluster and makes the recommendation based on that. Clearly, the trade-off here is saving storage in exchange for computation. The computation appears minimal in comparison to the storage however, so the trade-off appears worthwhile at first glance.

Thanks again for taking the time to read this. Any assistance would be greatly appreciated.

mfb · Jan 23, 2013

Hmmhmm... I see some problems with your approach, and I think they are hard to fix with that framework:

- If you base recommendations (leading to views) on views, you always get problems with positive feedback: It does not matter if a thread is interesting, if it gets recommended at some point it will keep getting recommended for this cluster unless the topic looks really boring. On the other hand, many interesting threads might not appear there, do not get views and continue to not appear there.
- Why those clusters? Is it relevant for recommendations in particle physics if the user is interested in sports?
- At the same point, why don't you use sub-(sub-)category data for users? A user interested in particle physics will be interested in different threads than a user interested in solid state physics - even if they share some general interest in sports.
- Why do you want to recommend just one thread per sub-sub-category? Why not all, or nearly all? A user interested in one thread in particle physics is usually interested in other threads in particle physics as well - probably more than in the most popular thread in solid state physics.
- at the same time, this can be done with simple subscriptions to subforums (or thread tags): "Get new threads in all subscribed forums" might be an interesting feature, and it is easy to implement.

- you can improve any algorithm with feedback from the user: "Was this recommendation good? Yes/No" - and a clever way to include this in future recommendations.With perfect knowledge what users visited and where they posted in the past: For a specific user A, for every category the user is interested in, look for other users B,C, ... with a similar profile within that category, check what they visited and where they posted, recomment those threads to A. If A visits that thread based on the recommendation, count this as interest (to recomment for other users with a similar profile), but with a low weight to avoid the feedback loop issue.

Looking for tips on a recommendation algorithm

FAQ: Looking for tips on a recommendation algorithm

What is a recommendation algorithm?

How do recommendation algorithms work?

What are the benefits of using recommendation algorithms?

What are some common types of recommendation algorithms?

How can a recommendation algorithm be evaluated?

Similar threads

Hot Threads

Recent Insights