RecFlix: Lower Confidence Bound Rating

Motivation

This is the second installment of a sequence on recommendation systems. The first is on smoothed (dampened) average ratings and can be found here.

How do we compare a movie with a single 5-star rating to one with many 4 and 5 ratings?

Say we want to take user ratings of movies and rank them. A simple approach would be to just look at the average rating of each movie, but that doesn’t feel right.. different movies have different amounts of ratings, and we don’t want small samples to skew our estimate of the ‘’true’’ rating that would arise if the whole population rated that movie. As suggested by Evan Miller, using the Lower Confidence Bound ratings is one approach to handle this. This is another non-personalized recommendation system. Let’s start by looking at the highest rated movies from the MovieLens dataset on Kaggle.

Code
# import the data
# ratings doesn't include movie names so merge with ids to get names
ratings = pd.read_csv("archive/rating.csv", parse_dates=['timestamp'])
ids = pd.read_csv("archive/movie.csv")
ratings = pd.merge(ratings, ids, on='movieId', how='left')

# Find each movie's mean rating
avg_ratings = ratings.groupby(['movieId', 'title'])['rating'].agg(
    avg_rating='mean',
    rating_count='count'
)

avg_ratings.sort_values(by = 'avg_rating', ascending=False, inplace = True)
avg_ratings.head()
movieId Title Mean Rating Number of Ratings
117314 Neurons to Nirvana (2013) 5.0 1
117418 Victor and the Secret of Crocodile Mansion (2012) 5.0 1
117061 The Green (2011) 5.0 1
109571 Into the Middle of Nowhere (2010) 5.0 1
109715 Inquire Within (2012) 5.0 1

Not quite the blockbusters or classics we were expecting.. We’ll soon see that there are a bunch of movies that had just one rating with an average of 5 stars. We already did a touch of EDA in the previous post, so let’s dive right into the rating method.

Lower confidence bound rating

This system takes the lower bound for, say, a 95% confidence interval of the true rating and recommends those with the highest lower confidence bound. We can use both the popular yet problematic Bernoulli confidence interval, which relies on a version of the Central Limit Theorem, and also the more accurate Wilson score interval.

The latter is meant for binary data, but we can tweak the adapt the technique and convert our ratings (which range from 0.5 to 5) to a fractional amount of positive or negative votes. Specifically, a 0.5 rating will be 1 negative vote and no positive votes, a 1 star rating is 1/9 of a positive vote and 8/9 of a negative vote, and so on.

Rating (Stars) Positive votes Negative votes
0.5 0 1
1 0.111 0.888
1.5 0.222 0.777
2 0.333 0.666
2.5 0.444 0.555
3 0.555 0.444
3.5 0.666 0.333
4 0.777 0.222
4.5 0.888 0.111
5 1 0

Bernoulli Confidence Interval

The idea here is that instead of taking the rating to be the mean over the \(N_j\) ratings of movie \(j\), (i.e. \(r_j = \sum_i X_i/N_j\)), we can look at a confidence interval of a certain level of confidence \(1-\alpha\) and take its lower bound to be our “rating.” Let

\[r_j = \bar{X} - t_{\alpha/2} \frac{s}{\sqrt{N_j}} \]

This approach will assuage the issue of overrated movies being recommended. The equation above is a biased estimator of the true underlying rating; it underrates all movies, with the bias being less severe for movies with more reviews. Movies with few reviews will have a wider confidence interval, so their rating will tend to be lower. As a movie gets more ratings, its confidence interval will narrow and be closer to the true rating. We’ll apply the method with \(\alpha = 0.05\), though this is a hyperparameter that can be tuned.

Code
# Step 1: Define a function to compute the CI
def compute_confidence_interval(group, confidence=0.95):
    n = group.count()
    mean = group.mean()
    std_err = stats.sem(group, nan_policy='omit')
    h = std_err * stats.t.ppf((1 + confidence) / 2., n-1) if n > 1 else 0
    lower = mean - h
    upper = mean + h
    return pd.Series({
        'avg_rating': mean,
        'rating_count': n,
        'ci_lower': lower,
        'ci_upper': upper
    })
Code
# Apply the function after grouping
CI_ratings = ratings.groupby(['movieId', 'title'])['rating'].apply(compute_confidence_interval).unstack().reset_index()
Code
# Sort by the ratings and take a look at the top ranked movies
CI_ratings.sort_values(by = 'avg_rating', ascending=False, inplace = True)
CI_ratings.head()
Title Mean Rating Number of Ratings CI Lower Bound CI Upper Bound
Black Girl (La noire de…) (1966) 5.0 1.0 5.0 5.0
Hatful of Rain, A (1957) 5.0 1.0 5.0 5.0
Perfect Plan, A (Plan parfait, Un) (2012) 5.0 1.0 5.0 5.0
Rebels of the Neon God (Qing shao nian nuo zha) (1992) 5.0 1.0 5.0 5.0
Examined Life (2008) 5.0 1.0 5.0 5.0

Hmm.. still not what we might expect from the “top” movies. Keen observers will notice that when \(N_j = 1\), our confidence interval reduces to a point, so the movies with solely one 5-star rating will have their lower confidence bound be equal to 5. We can try to work around this by only looking at movies with at least a certain number of ratings, or movies with at least two different types of ratings. When we require at least two ratings, we still have the same issue as before:

Title Mean Rating Number of Ratings CI Lower Bound CI Upper Bound
Dream of Light (a.k.a. Quince Tree Sun, The) (Sol del membrillo, El) (1992) 5.00 2.0 5.000 5.000
Little Mermaid: Ariel’s Beginning, The (2008) 5.00 2.0 5.000 5.000
Absolute Giganten (1999) 5.00 2.0 5.000 5.000
Repentance (Monanieba) (1984) 4.75 2.0 1.573 7.927
Year of the Hare, The (Jäniksen vuosi) (1977) 4.75 4.0 4.291 5.209

The issue continues to hold even even as we raise the minimum amount of reviews to 15:

Title Mean Rating Number of Ratings CI Lower Bound CI Upper Bound
Shawshank Redemption, The (1994) 4.450 7837.0 4.434 4.466
Heart of a Dog (Sobachye serdtse) (1988) 4.367 15.0 4.045 4.689
Cowboy Bebop (1998) 4.364 22.0 4.064 4.663
Godfather, The (1972) 4.357 5162.0 4.334 4.380
Usual Suspects, The (1995) 4.354 5873.0 4.335 4.373

If we raise the bar to at least 62 reviews, the I finally recognize the top 5:

Title Mean Rating Number of Ratings CI Lower Bound CI Upper Bound
Shawshank Redemption, The (1994) 4.450 7837.0 4.434 4.466
Godfather, The (1972) 4.357 5162.0 4.334 4.380
Usual Suspects, The (1995) 4.354 5873.0 4.335 4.373
Schindler’s List (1993) 4.302 6309.0 4.281 4.322
Band of Brothers (2001) 4.285 550.0 4.212 4.357

Wilson Score Interval

Using the t-distribution has other technical issues, and the “correct” way of computing the interval is using the Wilson Score Interval. Again, the idea here is that instead of taking the rating to be the mean over the \(N_j\) ratings of movie \(j\), (i.e. \(r_j = \sum_i X_i/N_j\)), we can look at a confidence interval of a certain level of confidence \(1-\alpha\) and take its lower bound to be our “rating.” Let

  • \(N_j\) the number of ratings of movie \(j\)
  • \(\bar{X} = \frac{1}{N_j} \sum_i X_i\), and
  • \(z_{\alpha/2}\) be the \(1-\alpha/2\) quantile of the standard normal distribution, and .

\[r_j = \left ( \hat{p} + \frac{z_{\alpha/2}^2}{2N_j} - z_{\alpha/2} \sqrt{(\hat{p}(1-\hat{p}) + z_{\alpha/2}^2/4N_j )/N_j}\right ) / (1 + z_{\alpha/2}^2/N_j)\]

This approach will assuage the issue of overrated movies being recommended. The equation above is a biased estimator of the true underlying rating; it underrates all movies, with the bias being less severe for movies with more reviews. Movies with few reviews will have a wider confidence interval, so their rating will tend to be lower. As a movie gets more ratings, its confidence interval will narrow and be closer to the true rating. We’ll apply the method with \(\alpha = 0.05\), though this is a hyperparameter that can be tuned.

Code
# Convert ratings to fractional positive (0 to 1 scale)
ratings['fractional_positive'] = (ratings['rating'] - 0.5) / 4.5

def wilson_score_interval(pos, n, confidence=0.95):
    if n == 0:
        return (0.0, 0.0)
    z = norm.ppf(1 - (1 - confidence) / 2)
    phat = pos / n
    denominator = 1 + z**2 / n
    centre = phat + z**2 / (2*n)
    margin = z * np.sqrt((phat * (1 - phat) + z**2 / (4*n)) / n)
    lower = (centre - margin) / denominator
    upper = (centre + margin) / denominator
    return lower, upper

def compute_fractional_wilson(group):
    n = group['fractional_positive'].count()
    pos = group['fractional_positive'].sum()
    lower, upper = wilson_score_interval(pos, n)
    return pd.Series({
        'fractional_positive_mean': pos / n if n > 0 else np.nan,
        'rating_count': n,
        'wilson_lower': lower,
        'wilson_upper': upper
    })

# Group by movieId and name
wilson_CI_ratings = ratings.groupby(['movieId', 'title']).apply(compute_fractional_wilson).reset_index()
Code
wilson_CI_ratings.head()
Code
# Our top movies
wilson_CI_ratings.sort_values(by = 'wilson_lower', ascending=False, inplace = True)
wilson_CI_ratings.head(10)
Title LCB Rating Rating Count Fractional Positive Mean Wilson Lower Wilson Upper
Shawshank Redemption, The (1994) 4.418 4461.0 0.880 0.871 0.890
Godfather, The (1972) 4.336 2988.0 0.865 0.852 0.877
Usual Suspects, The (1995) 4.313 3391.0 0.859 0.847 0.871
Schindler’s List (1993) 4.252 3608.0 0.846 0.834 0.857
Godfather: Part II, The (1974) 4.221 2005.0 0.843 0.827 0.859
Casablanca (1942) 4.207 1784.0 0.842 0.824 0.858
One Flew Over the Cuckoo’s Nest (1975) 4.171 2119.0 0.832 0.816 0.848
Rear Window (1954) 4.164 1277.0 0.836 0.814 0.855
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981) 4.158 3187.0 0.826 0.813 0.839
Fight Club (1999) 4.147 2850.0 0.825 0.811 0.838

The top 5 using this method are the same as the top 5 that we saw from the dampened mean approach for higher values of \(\lambda\), but the next 5 differ more with Casablanca being the only one on both lists. The dampened mean’s approach’s top 10 is the table below:

Title Dampened Mean Ratings Sum Number of Ratings
Shawshank Redemption, The (1994) 4.447 281788.0 63366
Godfather, The (1972) 4.365 180503.5 41355
Usual Suspects, The (1995) 4.334 203741.5 47006
Schindler’s List (1993) 4.310 215741.5 50054
Godfather: Part II, The (1974) 4.275 117144.0 27398
Seven Samurai (Shichinin no samurai) (1954) 4.274 49627.5 11611
Rear Window (1954) 4.271 74530.5 17449
Band of Brothers (2001) 4.262 18353.0 4305
Casablanca (1942) 4.258 103686.0 24349
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.256 27776.5 6525

Finally, here’s a visualization that has both the original rating (blue) and the lower confidence bound (LCB) rating (orange) with \(\alpha = .05\), as well as the difference in the ratings (gray line). The visualization is only for a sample of 50 movies so it doesn’t get too cluttered. You can see that the movies with fewer ratings are generally pulled further towards the left, and those with more ratings have generally budged less, but this pattern isn’t a guarantee by this method. A movie with few ratings could be consistently rated and have a narrower confidence interval than a movie that has been rated more times with inconsistent ratings.

Further approaches and questions

Stay tuned for future posts on other recommender systems where I’ll answer questions such as

  • How does one select the hyperparameter \(\alpha\)?
  • How can we measure the performance of the model?
  • What about personalized recommender systems?