COLLABORATIVE FILTERING (CF) APPROACHES
The CF teclmique can be classified into two categories as shown in Figure 8.2 (Su and Khoshgoftaar, 2009):
i. Model-Based CF: It uses some algorithms of machine learning (ML) like Bayesian network clustering and rule-based approaches which builds a model on user-item rating dataset and then recommends items to the user.
ii Neighborhood/Memorv-Based CF: Similarity and prediction computation are the two major steps used in this category of CF.
FIGURE 8.2 Collaborative filtering techniques.
WORKING PRINCIPLE OF NEIGHBORHOOD-BASED COLLABORATIVE FILTERING (CF)
Figure 8.3 shows the conceptual framework of neighborhood-based CF (Yang et al., 2016). Neighborhood CF defines the closest neighbors using the following two algorithms:
- • User-Based CF Algorithm: User similarity metric is used to find the nearest neighbors. The rating value of these neighbors and their similarity values are utilized in the prediction of unrated items of users for the fonnation of the Top-n list in the recommendation.
- • Item-Based CF Algorithm: In the item-based CF algorithm, the nearest neighbors are deteimined using the similarity values of items, and these similarity values and rating values of these neighbors are used in the formation of the recommendation list to the user.
FIGURE 8.3 A conceptual framework for neighborhood-based collaborative filtering.
Table 8.2 shows the descriptions of the notations used in this chapter.
DATA EXTRACTION METHODS USED IN COLLABORATIVE FILTERING (CF)
CF uses ratings in the recommendation process. Two types of ratings have been used in CF for a recommendation-explicit rating and implicit rating (Li et al., 2018).
TABLE 8.2 Notations and Their Descriptions
Notation |
Description |
Sim(i,j) |
Similarity between two items i and j |
R u,t |
Rating value of user u on item i |
R u |
Average or mean rating value of user u |
Щ |
Number of ratings of user u on both items i and j |
пП |
Predicted rating value of user u on item i |
r 1 |
Average or mean rating value of item i |
- 1. Explicit: These ratings are the specific rating that a user gives to a product (for example, a user rates a book 3 on a scale of 1 to 5). These explicit ratings are directly used in the extractions of users’ interest for future recommendation. The disadvantage of explicit data is that it makes user responsible for data collection and future rating prediction who hardly takes interest to give a rating on a particular item.
- 2. Implicit: These ratings are collected by logging the user’s data generated while browsing the website. Implicit data are easier to collect as it does not put any pressure on the user to rate the products on the site. However, dealing with an implicit rating is very complicated as it is hard to find the users’ preferences from these collected users’ browsing data. Using these collected ratings (explicit or implicit); RSs predict the unknown ratings of the user based on different similarity metrics and these predicted ratings used in the recommendation process.
SIMILARITY METRICS USED IN COLLABORATIVE FILTERING (CF) ALGORITHMS
There are various similarity metrics used in the CF to find the nearest neighbors and similarity values (Sarwar et al., 2001; Bilge and Kaleli, 2014; Bobadilla et ah, 2012). The metrics used in the item-based CF are:
1. Cosine Similarity (CS): The function of cosine distance finds similarity between two samples by studying the cosine of the angle between them to quantify the similarity. The similarity values are in the range [1,-1], where 1 shows the maximum similarity and -1
depicts no similarity. CS between two items i and j, is calculated using:
Here, i and j identifies the dot-product between two items.
2. Adjusted Cosine Similarity (ACS): It is similar to cosine distance, also caters to the individual user’s rating. To achieve this, it subtracts the average user rating from the individual ratings to get uniformity. It is computed by:
Here, R_{u}. and R_{u} are the rating value of user и on two items i and j, respectively. R_{u} shows the average rating value of user u.
3. Pearson Correlation (PC): It is the most popular Similarity Metric and is widely used in various experiments. The similarity in it is represented between [1,-1], where 1 shows the maximum similarity and -1 depicts no similarity. Similarity using PC, in Item-based CF algorithm is:
Here, R. and R are the mean rating value of two items i and j, respectively.
4. Jaccard Similarity (JS): It considers only all the common ratings between items in spite of the absolute rating value of items. It is calculated by [1]:
5. Spearman Correlation (SC): It is calculated just like PC, but it
uses the respective rank of the actual rating value. The equation of calculating similarity value by SC is as follows:
Here, к and к show the respective rank of items / and j of rating value of user u. k. and к denote the average rank of items i and j respectively.
6. Euclidean Distance (ED): The Euclidian distance uses the underroot of the squared sum of the difference between individual ratings of the two samples whose similarity we want to find. The distance gives an insight into how different the rating patterns are:
7. Manhattan Distance (MD): The equation to find similarity using MD is given below.
8. Mean Squared Distance (MSD): It is similar to ED only difference is that the whole Euclidian distance is squared, thus removing under-root from the mathematics thus making calculations easier. The equation of MSD for calculating the similarity value is shown by: