This is the second installment of a sequence on recommendation systems. The first is on smoothed (dampened) average ratings and can be found here.
How do we compare a movie with a single 5-star rating to one with many 4 and 5 ratings?
Say we want to take user ratings of movies and rank them. A simple approach would be to just look at the average rating of each movie, but that doesn’t feel right.. different movies have different amounts of ratings, and we don’t want small samples to skew our estimate of the ‘’true’’ rating that would arise if the whole population rated that movie. As suggested by Evan Miller, using the Lower Confidence Bound ratings is one approach to handle this. This is another non-personalized recommendation system. Let’s start by looking at the highest rated movies from the MovieLens dataset on Kaggle.
Code
# import the data# ratings doesn't include movie names so merge with ids to get namesratings = pd.read_csv("archive/rating.csv", parse_dates=['timestamp'])ids = pd.read_csv("archive/movie.csv")ratings = pd.merge(ratings, ids, on='movieId', how='left')# Find each movie's mean ratingavg_ratings = ratings.groupby(['movieId', 'title'])['rating'].agg( avg_rating='mean', rating_count='count')avg_ratings.sort_values(by ='avg_rating', ascending=False, inplace =True)avg_ratings.head()
movieId
Title
Mean Rating
Number of Ratings
117314
Neurons to Nirvana (2013)
5.0
1
117418
Victor and the Secret of Crocodile Mansion (2012)
5.0
1
117061
The Green (2011)
5.0
1
109571
Into the Middle of Nowhere (2010)
5.0
1
109715
Inquire Within (2012)
5.0
1
Not quite the blockbusters or classics we were expecting.. We’ll soon see that there are a bunch of movies that had just one rating with an average of 5 stars. We already did a touch of EDA in the previous post, so let’s dive right into the rating method.
The latter is meant for binary data, but we can tweak the adapt the technique and convert our ratings (which range from 0.5 to 5) to a fractional amount of positive or negative votes. Specifically, a 0.5 rating will be 1 negative vote and no positive votes, a 1 star rating is 1/9 of a positive vote and 8/9 of a negative vote, and so on.
Rating (Stars)
Positive votes
Negative votes
0.5
0
1
1
0.111
0.888
1.5
0.222
0.777
2
0.333
0.666
2.5
0.444
0.555
3
0.555
0.444
3.5
0.666
0.333
4
0.777
0.222
4.5
0.888
0.111
5
1
0
Bernoulli Confidence Interval
The idea here is that instead of taking the rating to be the mean over the \(N_j\) ratings of movie \(j\), (i.e. \(r_j = \sum_i X_i/N_j\)), we can look at a confidence interval of a certain level of confidence \(1-\alpha\) and take its lower bound to be our “rating.” Let
\(N_j\) the number of ratings of movie \(j\)
\(\bar{X} = \frac{1}{N_j} \sum_i X_i\), and
\(t_{\alpha/2}\) be the \(1-\alpha/2\) quantile of the Student’s t-distribution with \(N_j - 1\) degrees of freedom
This approach will assuage the issue of overrated movies being recommended. The equation above is a biased estimator of the true underlying rating; it underrates all movies, with the bias being less severe for movies with more reviews. Movies with few reviews will have a wider confidence interval, so their rating will tend to be lower. As a movie gets more ratings, its confidence interval will narrow and be closer to the true rating. We’ll apply the method with \(\alpha = 0.05\), though this is a hyperparameter that can be tuned.
Code
# Step 1: Define a function to compute the CIdef compute_confidence_interval(group, confidence=0.95): n = group.count() mean = group.mean() std_err = stats.sem(group, nan_policy='omit') h = std_err * stats.t.ppf((1+ confidence) /2., n-1) if n >1else0 lower = mean - h upper = mean + hreturn pd.Series({'avg_rating': mean,'rating_count': n,'ci_lower': lower,'ci_upper': upper })
Code
# Apply the function after groupingCI_ratings = ratings.groupby(['movieId', 'title'])['rating'].apply(compute_confidence_interval).unstack().reset_index()
Code
# Sort by the ratings and take a look at the top ranked moviesCI_ratings.sort_values(by ='avg_rating', ascending=False, inplace =True)CI_ratings.head()
Title
Mean Rating
Number of Ratings
CI Lower Bound
CI Upper Bound
Black Girl (La noire de…) (1966)
5.0
1.0
5.0
5.0
Hatful of Rain, A (1957)
5.0
1.0
5.0
5.0
Perfect Plan, A (Plan parfait, Un) (2012)
5.0
1.0
5.0
5.0
Rebels of the Neon God (Qing shao nian nuo zha) (1992)
5.0
1.0
5.0
5.0
Examined Life (2008)
5.0
1.0
5.0
5.0
Hmm.. still not what we might expect from the “top” movies. Keen observers will notice that when \(N_j = 1\), our confidence interval reduces to a point, so the movies with solely one 5-star rating will have their lower confidence bound be equal to 5. We can try to work around this by only looking at movies with at least a certain number of ratings, or movies with at least two different types of ratings. When we require at least two ratings, we still have the same issue as before:
Title
Mean Rating
Number of Ratings
CI Lower Bound
CI Upper Bound
Dream of Light (a.k.a. Quince Tree Sun, The) (Sol del membrillo, El) (1992)
5.00
2.0
5.000
5.000
Little Mermaid: Ariel’s Beginning, The (2008)
5.00
2.0
5.000
5.000
Absolute Giganten (1999)
5.00
2.0
5.000
5.000
Repentance (Monanieba) (1984)
4.75
2.0
1.573
7.927
Year of the Hare, The (Jäniksen vuosi) (1977)
4.75
4.0
4.291
5.209
The issue continues to hold even even as we raise the minimum amount of reviews to 15:
Title
Mean Rating
Number of Ratings
CI Lower Bound
CI Upper Bound
Shawshank Redemption, The (1994)
4.450
7837.0
4.434
4.466
Heart of a Dog (Sobachye serdtse) (1988)
4.367
15.0
4.045
4.689
Cowboy Bebop (1998)
4.364
22.0
4.064
4.663
Godfather, The (1972)
4.357
5162.0
4.334
4.380
Usual Suspects, The (1995)
4.354
5873.0
4.335
4.373
If we raise the bar to at least 62 reviews, the I finally recognize the top 5:
Title
Mean Rating
Number of Ratings
CI Lower Bound
CI Upper Bound
Shawshank Redemption, The (1994)
4.450
7837.0
4.434
4.466
Godfather, The (1972)
4.357
5162.0
4.334
4.380
Usual Suspects, The (1995)
4.354
5873.0
4.335
4.373
Schindler’s List (1993)
4.302
6309.0
4.281
4.322
Band of Brothers (2001)
4.285
550.0
4.212
4.357
Wilson Score Interval
Using the t-distribution has other technical issues, and the “correct” way of computing the interval is using the Wilson Score Interval. Again, the idea here is that instead of taking the rating to be the mean over the \(N_j\) ratings of movie \(j\), (i.e. \(r_j = \sum_i X_i/N_j\)), we can look at a confidence interval of a certain level of confidence \(1-\alpha\) and take its lower bound to be our “rating.” Let
\(N_j\) the number of ratings of movie \(j\)
\(\bar{X} = \frac{1}{N_j} \sum_i X_i\), and
\(z_{\alpha/2}\) be the \(1-\alpha/2\) quantile of the standard normal distribution, and .
This approach will assuage the issue of overrated movies being recommended. The equation above is a biased estimator of the true underlying rating; it underrates all movies, with the bias being less severe for movies with more reviews. Movies with few reviews will have a wider confidence interval, so their rating will tend to be lower. As a movie gets more ratings, its confidence interval will narrow and be closer to the true rating. We’ll apply the method with \(\alpha = 0.05\), though this is a hyperparameter that can be tuned.
Code
# Convert ratings to fractional positive (0 to 1 scale)ratings['fractional_positive'] = (ratings['rating'] -0.5) /4.5def wilson_score_interval(pos, n, confidence=0.95):if n ==0:return (0.0, 0.0) z = norm.ppf(1- (1- confidence) /2) phat = pos / n denominator =1+ z**2/ n centre = phat + z**2/ (2*n) margin = z * np.sqrt((phat * (1- phat) + z**2/ (4*n)) / n) lower = (centre - margin) / denominator upper = (centre + margin) / denominatorreturn lower, upperdef compute_fractional_wilson(group): n = group['fractional_positive'].count() pos = group['fractional_positive'].sum() lower, upper = wilson_score_interval(pos, n)return pd.Series({'fractional_positive_mean': pos / n if n >0else np.nan,'rating_count': n,'wilson_lower': lower,'wilson_upper': upper })# Group by movieId and namewilson_CI_ratings = ratings.groupby(['movieId', 'title']).apply(compute_fractional_wilson).reset_index()
Code
wilson_CI_ratings.head()
Code
# Our top movieswilson_CI_ratings.sort_values(by ='wilson_lower', ascending=False, inplace =True)wilson_CI_ratings.head(10)
Title
LCB Rating
Rating Count
Fractional Positive Mean
Wilson Lower
Wilson Upper
Shawshank Redemption, The (1994)
4.418
4461.0
0.880
0.871
0.890
Godfather, The (1972)
4.336
2988.0
0.865
0.852
0.877
Usual Suspects, The (1995)
4.313
3391.0
0.859
0.847
0.871
Schindler’s List (1993)
4.252
3608.0
0.846
0.834
0.857
Godfather: Part II, The (1974)
4.221
2005.0
0.843
0.827
0.859
Casablanca (1942)
4.207
1784.0
0.842
0.824
0.858
One Flew Over the Cuckoo’s Nest (1975)
4.171
2119.0
0.832
0.816
0.848
Rear Window (1954)
4.164
1277.0
0.836
0.814
0.855
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)
4.158
3187.0
0.826
0.813
0.839
Fight Club (1999)
4.147
2850.0
0.825
0.811
0.838
The top 5 using this method are the same as the top 5 that we saw from the dampened mean approach for higher values of \(\lambda\), but the next 5 differ more with Casablanca being the only one on both lists. The dampened mean’s approach’s top 10 is the table below:
Title
Dampened Mean
Ratings Sum
Number of Ratings
Shawshank Redemption, The (1994)
4.447
281788.0
63366
Godfather, The (1972)
4.365
180503.5
41355
Usual Suspects, The (1995)
4.334
203741.5
47006
Schindler’s List (1993)
4.310
215741.5
50054
Godfather: Part II, The (1974)
4.275
117144.0
27398
Seven Samurai (Shichinin no samurai) (1954)
4.274
49627.5
11611
Rear Window (1954)
4.271
74530.5
17449
Band of Brothers (2001)
4.262
18353.0
4305
Casablanca (1942)
4.258
103686.0
24349
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)
4.256
27776.5
6525
Finally, here’s a visualization that has both the original rating (blue) and the lower confidence bound (LCB) rating (orange) with \(\alpha = .05\), as well as the difference in the ratings (gray line). The visualization is only for a sample of 50 movies so it doesn’t get too cluttered. You can see that the movies with fewer ratings are generally pulled further towards the left, and those with more ratings have generally budged less, but this pattern isn’t a guarantee by this method. A movie with few ratings could be consistently rated and have a narrower confidence interval than a movie that has been rated more times with inconsistent ratings.
Further approaches and questions
Stay tuned for future posts on other recommender systems where I’ll answer questions such as
How does one select the hyperparameter \(\alpha\)?