Implemented in Apache Spark or Hadoop MrJob.
One interesting example to practice is the classical question: “The people you may know”. On apps similar to Facebook, you got users and their friends list as the followed. How to suggest user A and E to make friend to each other but not to User K? (based on User A and E get several common friends B, C, & D, and we assume only mutual friendship existed.)
A B, C, D, G #User A has user B/C/D/G as friends E B, C, D K L, J, G .....
Let’s take a look at the input again, since we assume only mutual friendship existed, the real input would extend to the following, how would you solve it?
A B, C, D, G B A, E #Because user B is user A and user E's friend. C A, E D A, E E B, C, D G A, K J K K L, J, G L K
The most straightforward approach is to iterate all users. For each user, compare his/her friends to all other users for suggestion. This approach is feasible if applied asynchronously because people adding friends at different time. However, if there’s ton’s of users/friends and you wanna to calculate all suggestions at once, a more scalable approach is required.
MapReduce is one of Google’s approaches for processing big data, and currently there are many implementations based on the idea, such as Apache Hadoop or Spark, etc. If we can find a way to count common friends, independently, we can split such big job to many workers and make it parallel.
What is the “independently” mean? During the processing of each line in the input, the straightforward approach needs to compare a line to another line. That is not independent. There is actually a way to count common friends of one line, without reading all other lines. In such way huge inputs can be split to different workers for parallel processing without inter communication in the beginning of the common friend counting. How is it possible? It’s based on one fact: A certain user’s friends all have one common friend, which is that certain user.
A B, C, D, G B & C have one common friend (A) B & D have one common friend (A) B & G have one common friend (A) C & D have one common friend (A) .....so on so forth....
Thus when reading line for user A, I first make all permutation pair within that line. There are two kinds of pair here:
[connected] Those who are already connected friends, and I assign their ‘common friends count’ as a very negative number -9999999999 as my personal label for whoever are already friends. Thus I won’t suggest friendship request for users who are already friends, since 9999999999 is bigger than the world population.
[commons] Those who have user A as the common friend, and I assigned these pair’s “common friends count” as one, since they have one common friend user A. Here is the example:
You may think, wow, why generate or map to so many ((pair), count) from one single line? Is it necessary? The point is such mapping process is independently line to line, getting their counts (-9999999999 or 1) doesn’t depend on reading any other lines. Thus the mapping process of huge input can split into many workers for parallel processing.
The next step is to use the pair as the key, sum up or reduce their counts. So after reduce, the sum of these pair from all three lines would like the following: Totally there are two pairs of (A, B), and the sum of the two pairs would be -19999999998. There would be 3 pairs of (A, E), and the sum of the three pairs would be 3. Please note here the summing/reduce process can be executed on different workers, the final reduced result won’t be changed.
Now we see the pattern: The reduced results, which is the counts, are the number of common friends. Don’t forget to filter the positive counts and the pair will be the suggestions with the common friend count. We just need to group the same leading users and suggest the most suggested pairs based on their counts.
Based on the same algorithm described above, here are my two implementations, by using two different frameworks. The first one is Python API to the cluster computing framework Apache Spark. It comes with many handy functions such as groupByKey() or takeOrdered() for grouping pair-counts or choosing the top pair-counts. You can check my PySpark code here. In addition, by using Ruby-Spark gem, I also coded a Ruby-Spark solution for you to compare.
My second implementation used MRJob, the Yelp’s Python API controlling Hadoop for MapReduce tasks. There are not too many built-in functions so I have hard time to implement the groupByKey() or top counts. After struggling here is my working code of groupByKey() in Yelp MrJob: