RecFlix: Lower Confidence Bound Rating

How do we compare a movie with a single 5-star rating to one with many 4 and 5 ratings?

Say we want to take user ratings of movies and rank them. A simple approach would be to just look at the average rating of each movie, but that doesn’t feel right.. different movies have different amounts of ratings, and we don’t want small samples to skew our estimate of the ‘’true’’ rating that would arise if the whole population rated that movie. As suggested by Evan Miller, using the Lower Confidence Bound ratings is one approach to handle this. This is another non-personalized recommendation system. Let’s start by looking at the highest rated movies from the MovieLens dataset on Kaggle.

Code

# import the data
# ratings doesn't include movie names so merge with ids to get names
ratings = pd.read_csv("archive/rating.csv", parse_dates=['timestamp'])
ids = pd.read_csv("archive/movie.csv")
ratings = pd.merge(ratings, ids, on='movieId', how='left')

# Find each movie's mean rating
avg_ratings = ratings.groupby(['movieId', 'title'])['rating'].agg(
    avg_rating='mean',
    rating_count='count'
)

avg_ratings.sort_values(by = 'avg_rating', ascending=False, inplace = True)
avg_ratings.head()

movieId	Title	Mean Rating	Number of Ratings
117314	Neurons to Nirvana (2013)	5.0	1
117418	Victor and the Secret of Crocodile Mansion (2012)	5.0	1
117061	The Green (2011)	5.0	1
109571	Into the Middle of Nowhere (2010)	5.0	1
109715	Inquire Within (2012)	5.0	1

Not quite the blockbusters or classics we were expecting.. We’ll soon see that there are a bunch of movies that had just one rating with an average of 5 stars. We already did a touch of EDA in the previous post, so let’s dive right into the rating method.

Lower confidence bound rating

This system takes the lower bound for, say, a 95% confidence interval of the true rating and recommends those with the highest lower confidence bound. We can use both the popular yet problematic Bernoulli confidence interval, which relies on a version of the Central Limit Theorem, and also the more accurate Wilson score interval.

The latter is meant for binary data, but we can tweak the adapt the technique and convert our ratings (which range from 0.5 to 5) to a fractional amount of positive or negative votes. Specifically, a 0.5 rating will be 1 negative vote and no positive votes, a 1 star rating is 1/9 of a positive vote and 8/9 of a negative vote, and so on.

Rating (Stars)	Positive votes	Negative votes
0.5	0	1
1	0.111	0.888
1.5	0.222	0.777
2	0.333	0.666
2.5	0.444	0.555
3	0.555	0.444
3.5	0.666	0.333
4	0.777	0.222
4.5	0.888	0.111
5	1	0

Bernoulli Confidence Interval

The idea here is that instead of taking the rating to be the mean over the \(N_j\) ratings of movie \(j\), (i.e. \(r_j = \sum_i X_i/N_j\)), we can look at a confidence interval of a certain level of confidence \(1-\alpha\) and take its lower bound to be our “rating.” Let

\(N_j\) the number of ratings of movie \(j\)
\(\bar{X} = \frac{1}{N_j} \sum_i X_i\), and
\(t_{\alpha/2}\) be the \(1-\alpha/2\) quantile of the Student’s t-distribution with \(N_j - 1\) degrees of freedom
\(s\) be the sample standard deviation

\[r_j = \bar{X} - t_{\alpha/2} \frac{s}{\sqrt{N_j}} \]

This approach will assuage the issue of overrated movies being recommended. The equation above is a biased estimator of the true underlying rating; it underrates all movies, with the bias being less severe for movies with more reviews. Movies with few reviews will have a wider confidence interval, so their rating will tend to be lower. As a movie gets more ratings, its confidence interval will narrow and be closer to the true rating. We’ll apply the method with \(\alpha = 0.05\), though this is a hyperparameter that can be tuned.

Code

# Step 1: Define a function to compute the CI
def compute_confidence_interval(group, confidence=0.95):
    n = group.count()
    mean = group.mean()
    std_err = stats.sem(group, nan_policy='omit')
    h = std_err * stats.t.ppf((1 + confidence) / 2., n-1) if n > 1 else 0
    lower = mean - h
    upper = mean + h
    return pd.Series({
        'avg_rating': mean,
        'rating_count': n,
        'ci_lower': lower,
        'ci_upper': upper
    })

Code

# Apply the function after grouping
CI_ratings = ratings.groupby(['movieId', 'title'])['rating'].apply(compute_confidence_interval).unstack().reset_index()

Code

# Sort by the ratings and take a look at the top ranked movies
CI_ratings.sort_values(by = 'avg_rating', ascending=False, inplace = True)
CI_ratings.head()

Title	Mean Rating	Number of Ratings	CI Lower Bound	CI Upper Bound
Black Girl (La noire de…) (1966)	5.0	1.0	5.0	5.0
Hatful of Rain, A (1957)	5.0	1.0	5.0	5.0
Perfect Plan, A (Plan parfait, Un) (2012)	5.0	1.0	5.0	5.0
Rebels of the Neon God (Qing shao nian nuo zha) (1992)	5.0	1.0	5.0	5.0
Examined Life (2008)	5.0	1.0	5.0	5.0

Hmm.. still not what we might expect from the “top” movies. Keen observers will notice that when \(N_j = 1\), our confidence interval reduces to a point, so the movies with solely one 5-star rating will have their lower confidence bound be equal to 5. We can try to work around this by only looking at movies with at least a certain number of ratings, or movies with at least two different types of ratings. When we require at least two ratings, we still have the same issue as before:

Title	Mean Rating	Number of Ratings	CI Lower Bound	CI Upper Bound
Consuming Kids: The Commercialization of Childhood (2008)	5.0	2.0	5.0	5.0
Catastroika (2012)	5.0	2.0	5.0	5.0
Memorial Day (2011)	4.5	2.0	4.5	4.5
Orderers (Les ordres) (1974)	4.5	2.0	4.5	4.5
Ankur (The Seedling) (1974)	4.5	3.0	4.5	4.5

Once we raise the minimum amount of reviews to 5, this issue goes away:

Title	Mean Rating	Number of Ratings	CI Lower Bound	CI Upper Bound
Shawshank Redemption, The (1994)	4.447	63366.0	4.441	4.453
Godfather, The (1972)	4.365	41355.0	4.357	4.373
Usual Suspects, The (1995)	4.334	47006.0	4.328	4.341
Schindler's List (1993)	4.310	50054.0	4.303	4.317
Godfather: Part II, The (1974)	4.276	27398.0	4.265	4.286
Rear Window (1954)	4.271	17449.0	4.260	4.283
Seven Samurai (Shichinin no samurai) (1954)	4.274	11611.0	4.259	4.289
Casablanca (1942)	4.258	24349.0	4.247	4.269
One Flew Over the Cuckoo's Nest (1975)	4.248	29932.0	4.239	4.257
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)	4.257	6525.0	4.237	4.277

Wilson Score Interval

Using the t-distribution has other technical issues, and the “correct” way of computing the interval is using the Wilson Score Interval. Again, the idea here is that instead of taking the rating to be the mean over the \(N_j\) ratings of movie \(j\), (i.e. \(r_j = \sum_i X_i/N_j\)), we can look at a confidence interval of a certain level of confidence \(1-\alpha\) and take its lower bound to be our “rating.” Let

\(N_j\) the number of ratings of movie \(j\)
\(\bar{X} = \frac{1}{N_j} \sum_i X_i\), and
\(z_{\alpha/2}\) be the \(1-\alpha/2\) quantile of the standard normal distribution, and .

\[r_j = \left ( \hat{p} + \frac{z_{\alpha/2}^2}{2N_j} - z_{\alpha/2} \sqrt{(\hat{p}(1-\hat{p}) + z_{\alpha/2}^2/4N_j )/N_j}\right ) / (1 + z_{\alpha/2}^2/N_j)\]

This approach will assuage the issue of overrated movies being recommended. The equation above is a biased estimator of the true underlying rating; it underrates all movies, with the bias being less severe for movies with more reviews. Movies with few reviews will have a wider confidence interval, so their rating will tend to be lower. As a movie gets more ratings, its confidence interval will narrow and be closer to the true rating. We’ll apply the method with \(\alpha = 0.05\), though this is a hyperparameter that can be tuned.

Code

# Convert ratings to fractional positive (0 to 1 scale)
ratings['fractional_positive'] = (ratings['rating'] - 0.5) / 4.5

def wilson_score_interval(pos, n, confidence=0.95):
    if n == 0:
        return (0.0, 0.0)
    z = norm.ppf(1 - (1 - confidence) / 2)
    phat = pos / n
    denominator = 1 + z**2 / n
    centre = phat + z**2 / (2*n)
    margin = z * np.sqrt((phat * (1 - phat) + z**2 / (4*n)) / n)
    lower = (centre - margin) / denominator
    upper = (centre + margin) / denominator
    return lower, upper

def compute_fractional_wilson(group):
    n = group['fractional_positive'].count()
    pos = group['fractional_positive'].sum()
    lower, upper = wilson_score_interval(pos, n)
    return pd.Series({
        'fractional_positive_mean': pos / n if n > 0 else np.nan,
        'rating_count': n,
        'wilson_lower': lower,
        'wilson_upper': upper
    })

# Group by movieId and name
wilson_CI_ratings = ratings.groupby(['movieId', 'title']).apply(compute_fractional_wilson).reset_index()

Code

wilson_CI_ratings.head()

Code

# Our top movies
wilson_CI_ratings.sort_values(by = 'wilson_lower', ascending=False, inplace = True)
wilson_CI_ratings.head(10)

Title	LCB Rating	Rating Count	Fractional Positive Mean	Wilson Lower	Wilson Upper
Shawshank Redemption, The (1994)	4.435	63366.0	0.877	0.875	0.880
Godfather, The (1972)	4.349	41355.0	0.859	0.855	0.862
Usual Suspects, The (1995)	4.320	47006.0	0.852	0.849	0.855
Schindler's List (1993)	4.296	50054.0	0.847	0.844	0.850
Godfather: Part II, The (1974)	4.256	27398.0	0.839	0.835	0.843
Rear Window (1954)	4.246	17449.0	0.838	0.833	0.843
Seven Samurai (Shichinin no samurai) (1954)	4.244	11611.0	0.839	0.832	0.845
Casablanca (1942)	4.237	24349.0	0.835	0.830	0.840
One Flew Over the Cuckoo's Nest (1975)	4.229	29932.0	0.833	0.829	0.837
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)	4.225	23220.0	0.833	0.828	0.837

The top 5 using this method are the same as the top 5 that we saw from the dampened mean approach for higher values of \(\lambda\), but the next 5 differ more with Casablanca being the only one on both lists. The dampened mean’s approach’s top 10 is the table below:

Title	Dampened Mean	Ratings Sum	Number of Ratings
Shawshank Redemption, The (1994)	4.447	281788.0	63366
Godfather, The (1972)	4.365	180503.5	41355
Usual Suspects, The (1995)	4.334	203741.5	47006
Schindler’s List (1993)	4.310	215741.5	50054
Godfather: Part II, The (1974)	4.275	117144.0	27398
Seven Samurai (Shichinin no samurai) (1954)	4.274	49627.5	11611
Rear Window (1954)	4.271	74530.5	17449
Band of Brothers (2001)	4.262	18353.0	4305
Casablanca (1942)	4.258	103686.0	24349
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)	4.256	27776.5	6525

Finally, here’s a visualization that has both the original rating (blue) and the lower confidence bound (LCB) rating (orange) with \(\alpha = .05\), as well as the difference in the ratings (gray line). The visualization is only for a sample of 50 movies so it doesn’t get too cluttered. You can see that the movies with fewer ratings are generally pulled further towards the left, and those with more ratings have generally budged less, but this pattern isn’t a guarantee by this method. A movie with few ratings could be consistently rated and have a narrower confidence interval than a movie that has been rated more times with inconsistent ratings.

Further approaches and questions

Stay tuned for future posts on other recommender systems where I’ll answer questions such as

How does one select the hyperparameter \(\alpha\)?
How can we measure the performance of the model?
What about personalized recommender systems?

Motivation

How do we compare a movie with a single 5-star rating to one with many 4 and 5 ratings?

Lower confidence bound rating

Bernoulli Confidence Interval

Wilson Score Interval

Further approaches and questions