RecFlix: Linear Model

Motivation

This is the third installment of a sequence on recommendation systems.

How do we account for raters’ biases?

Say we want to take user ratings of movies and rank them. Some previous approaches try to rank movies by accounting for sample sizes and confidence intervals, but they didn’t consider biases of users who were doing the ratings. Indeed, if User A just happens to indiscriminantly love movies and rate them all highly, and also happens to be the only person who rated Movie 1, then that movie might currently be overrated. We can apply a linear model to account for raters’ biases to uncover what a movie’s rating would be after adjusting for its raters. This is an example of a non-personalzed recommendation system. Let’s start by looking at the highest rated movies from the MovieLens dataset on Kaggle.

Code
# import the data
# ratings doesn't include movie names so merge with ids to get names
ratings = pd.read_csv("archive/rating.csv", parse_dates=['timestamp'])
ids = pd.read_csv("archive/movie.csv")
ratings = pd.merge(ratings, ids, on='movieId', how='left')

# Find each movie's mean rating
avg_ratings = ratings.groupby(['movieId', 'title'])['rating'].agg(
    avg_rating='mean',
    rating_count='count'
)

avg_ratings.sort_values(by = 'avg_rating', ascending=False, inplace = True)
avg_ratings.head()
movieId Title Mean Rating Number of Ratings
117314 Neurons to Nirvana (2013) 5.0 1
117418 Victor and the Secret of Crocodile Mansion (2012) 5.0 1
117061 The Green (2011) 5.0 1
109571 Into the Middle of Nowhere (2010) 5.0 1
109715 Inquire Within (2012) 5.0 1

After accounting for who the raters of these films were, we’ll have a better idea of what they might be rated. There are a lot of approaches we can take with the linear model, such as filtering to only include movies with a certain number of ratings.

We can define our variables as

  • \(r_{ij}\) is the rating that person \(i\) gave (or would give) movie \(j\)

Our statistical model is then

\[r_{ij} = \mu + \alpha_i + \beta_j\]

And we can use ordinary least squares to estimate the \(\alpha_i\) and \(\beta_j\). Here,

  • \(\mu\) is the overall (global) mean
  • \(\alpha_i\) is the “effect” of user \(i\), which measures how much above the mean user \(i\) tends to rate movies after considering movie effects, and
  • \(\beta_j\) is the “effect” of movie \(j\), which measures how much above the mean movie \(j\) tends to score, after considering user effects.

Once the model is fit, we can look at \(\beta_j\) to determine which movies are scoring the highest after accounting for the users.

Mathematically, we’re finding the optimal parameters \(\alpha_i\) and \(\beta_j\) that minimize the mean squared error plus a regularization term:

\[L(\vec{\alpha}, \vec{\beta}) = \sum_{i, j} \left (r_{ij} - (\mu + \alpha_i + \beta_j) \right )^2 + \lambda \left ( \sum_i \alpha_i^2 + \sum_j \beta_j^2 \right ) \]

The regularization here is an \(L^2\) or ridge regression, and alternatively, one could incorporate LASSO regularization. The hyperparameter \(\lambda\) can be fine tuned, which we’ll discuss in another post; here it’s taken to be \(\lambda = 1\).

Code
# setup and fit the model
# this cell will likely take a while to run, there > 20,000 parameters to fit in this optimization
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=True)

model = make_pipeline(
    ohe,
    Ridge(alpha=1.0)       # or Lasso. This is lambda in the equation above
)

model.fit(ratings[['userId','movieId']], ratings['rating'])

encoder = model.named_steps['onehotencoder']

feature_names = encoder.get_feature_names_out(['userId', 'movieId'])

coef = model.named_steps['ridge'].coef_
intercept = model.named_steps['ridge'].intercept_
Code
# Extract the coefficients alpha_i and beta_j
coef_df = pd.DataFrame({
    'feature': feature_names,
    'coefficient': coef
}).sort_values(by='coefficient', ascending=False)
Code
# Extract only the users' coefficients, alpha_i
user_df = coef_df[coef_df['feature'].str.startswith('userId')]
user_df.head()
Code
# Extract only the movies' coefficients, beta_j
movie_df = coef_df[coef_df['feature'].str.startswith('movieId')]
movie_df.head()
Code
# prepare the df for merging to get movie names
movie_df['movieId'] = movie_df['feature'].str.split('_', expand=True)[1].astype(int)
movie_df.head()
Code
# merge to get movie names
merged = movie_df.merge(mean_ratings, on='movieId', how='left')
Code
# Get adjusted ratings by not incorporating any alpha_i, just taking mu + beta_j
global_mean = ratings['rating'].mean()
merged['Adjusted Rating'] = global_mean + merged['coefficient']
Code
# find which movies have the largest coefficients
merged.sort_values(by = 'coefficient', ascending=False, inplace = True)
merged[['title', 'coefficient', 'avg_rating', 'rating_count']].head()
Title Adjusted Rating Coefficient Mean Rating Number of Ratings
0 Marihuana (1936) 5.547 2.021 5.000 1
1 Always for Pleasure (1978) 5.070 1.544 5.000 1
2 Supermarket Woman (Sûpâ no onna) (1996) 4.956 1.431 4.750 2
3 Great White Silence, The (1924) 4.861 1.336 4.500 4
4 Welfare (1975) 4.835 1.309 4.417 6
5 I Belong (Som du ser meg) (2012) 4.815 1.289 4.750 2
6 No Distance Left to Run (2010) 4.812 1.287 5.000 1
7 Hot Pepper (1973) 4.808 1.282 4.250 2
8 For Neda (2010) 4.798 1.272 4.500 5
9 That Day, on the Beach (Hai tan de yi tian) (1983) 4.786 1.261 4.500 2

It looks like we have the same issue as before, where some movies with very few ratings are still dominating the ratings. If we filter to only include movies with at least 50 ratings, we see a list that closely resembles the top 10 movies if we looked at the mean rating. First, our list of top 10 adjusted ratings:

Title Adjusted Rating Coefficient Mean Rating Number of Ratings
45 Shawshank Redemption, The (1994) 4.609 1.083 4.447 63366
61 Godfather, The (1972) 4.558 1.032 4.365 41355
69 Paths of Glory (1957) 4.535 1.009 4.233 3568
79 Usual Suspects, The (1995) 4.525 0.999 4.334 47006
85 Seven Samurai (Shichinin no samurai) (1954) 4.520 0.994 4.274 11611
87 Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.519 0.993 4.257 6525
89 Third Man, The (1949) 4.518 0.992 4.246 6565
90 Decalogue, The (Dekalog) (1989) 4.516 0.991 4.174 402
92 Lives of Others, The (Das leben der Anderen) (2006) 4.514 0.989 4.235 5720
95 Fawlty Towers (1975-1979) 4.512 0.986 4.128 230

Now compare this with the list of the top 10 movies with at least 50 reviews, sorted by mean rating:

movieId Title Mean Rating Number of Ratings
318 Shawshank Redemption, The (1994) 4.447 63366
858 Godfather, The (1972) 4.365 41355
50 Usual Suspects, The (1995) 4.334 47006
527 Schindler’s List (1993) 4.310 50054
1221 Godfather: Part II, The (1974) 4.276 27398
2019 Seven Samurai (Shichinin no samurai) (1954) 4.274 11611
904 Rear Window (1954) 4.271 17449
7502 Band of Brothers (2001) 4.263 4305
912 Casablanca (1942) 4.258 24349
922 Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.257 6525

Half the movies appear on both lists, but then a handful differ. It’s interesting to see some international films rise in the adjusted rankings; I wonder if this is partially due to cultural norms when it comes ratings, which is well-summarized in the meme at the top that I took from the lectures from my Bayesian stats book club. In more words, are those movies/shows being rated by people who tend to assign lower ratings, leading them to be underrated? There are some anecdotal examples of people experiencing cultural differences in ratings, and others have explored the subject quantitatively, noticing differences across genders as well.

One more thing I’d like to look at is which movies have the largest gaps between their unadjusted and adjusted ratings, i.e. the largest residuals. We can think of these as the most underrated (or overrated) movies, consequences of being rated by many folks who consistently rate much lower (or higher) than the average rater. Let’s start with some underrated ones:

Code
merged['Delta'] = merged['Adjusted Rating'] - merged['avg_rating']
merged.sort_values(by = 'Delta', ascending=False, inplace = True)
merged.head()
Title Average Rating Adjusted Rating Delta Number of Ratings Movie Coefficient
23061 Alone for Christmas (2013) 0.5 2.968 2.468 1 -0.557
23062 Tom and Jerry: The Lost Dragon (2014) 0.5 2.968 2.468 1 -0.557
23063 Andrew Dice Clay: Indestructible (2012) 0.5 2.968 2.468 1 -0.557
23238 Rimini, Rimini: A Year Later (1988) 0.5 2.947 2.447 1 -0.579
23272 Age of Dinosaurs (2013) 0.5 2.941 2.441 1 -0.585
23271 Captive Women (1000 Years from Now) (3000 A.D.) (1952) 0.5 2.941 2.441 1 -0.585
23466 Beethoven’s 5th (2003) 0.5 2.918 2.418 1 -0.607
23461 Beethoven’s Treasure Tail (2014) 0.5 2.918 2.418 1 -0.607
23455 Alpha and Omega 2: A Howl-iday Adventure (Alpha & Omega 2) (2013) 0.5 2.918 2.418 1 -0.607
23465 Mr. Troop Mom (2009) 0.5 2.918 2.418 1 -0.607

Again, movies with only one rating comprise this list. These movies were all rated half of a star by a single user, and their respective movie coefficients (\(\beta_j\)) aren’t too large in magnitude, so the difference in ratings is high. Even if we try to do the same analysis with movies with at least 50 ratings, they’re not too recognizable, but I’d guess these are somewhat underrated due to cultural rating norms.

Title Average Rating Adjusted Rating Delta Number of Ratings Movie Coefficient
8622 As Tears Go By (Wong gok ka moon) (1988) 3.192 3.806 0.614 52 0.281
4389 Merchant of Four Seasons, The (Händler der vier Jahreszeiten) (1972) 3.406 4.009 0.603 64 0.483
15770 Pellet (Bola, El) (2000) 2.900 3.473 0.573 50 -0.053
11091 Separation, The (Séparation, La) (1994) 3.127 3.690 0.563 51 0.165
1299 Autumn Afternoon, An (Sanma no aji) (1962) 3.669 4.228 0.559 71 0.702
12790 See the Sea (1997) 3.054 3.611 0.557 56 0.086
6203 Dames du Bois de Boulogne, Les (Ladies of the Bois de Boulogne, The) (Ladies of the Park) (1945) 3.357 3.914 0.557 70 0.389
1728 Eureka (Yurîka) (2000) 3.630 4.185 0.555 96 0.660
8676 Lili Marleen (1981) 3.250 3.804 0.554 56 0.278
15663 Devils on the Doorstep (Guizi lai le) (2000) 2.926 3.479 0.553 61 -0.046

We see similar results when we look at the most overrated:

Title Average Rating Adjusted Rating Delta Number of Ratings Movie Coefficient
7696 Boys Diving, Honolulu (1901) 5.0 3.849 -1.151 1 0.324
7698 Ella Lola, a la Trilby (1898) 5.0 3.849 -1.151 1 0.324
7697 Barchester Chronicles, The (1982) 5.0 3.849 -1.151 1 0.324
7691 Keeping the Promise (Sign of the Beaver, The) (1997) 5.0 3.849 -1.151 1 0.324
7695 Boy Meets Boy (2008) 5.0 3.849 -1.151 1 0.324
7692 Oranges (2004) 5.0 3.849 -1.151 1 0.324
7693 Best of Ernie and Bert, The (1988) 5.0 3.849 -1.151 1 0.324
7694 Prom Queen: The Marc Hall Story (2004) 5.0 3.849 -1.151 1 0.324
7699 Junior Prom (1946) 5.0 3.849 -1.151 1 0.324
7700 Turkish Dance, Ella Lola (1898) 5.0 3.849 -1.151 1 0.324

Looking at the most overrated movies with at least 50 reviews, we notice something interesting:

Title Average Rating Adjusted Rating Delta Number of Ratings Movie Coefficient
23938 Big Green, The (1995) 2.855 2.858 0.003 956 -0.668
16408 Stefano Quantestorie (1993) 3.439 3.441 0.002 57 -0.085
15076 Captives (1994) 3.508 3.507 -0.001 181 -0.018
25637 Gordy (1995) 2.531 2.528 -0.003 439 -0.997
23952 Man of the House (1995) 2.871 2.856 -0.015 1181 -0.669
22917 Homeward Bound II: Lost in San Francisco (1996) 3.002 2.984 -0.018 2430 -0.542
21642 Sunset Park (1996) 3.135 3.105 -0.031 251 -0.421
13566 Talking About Sex (1994) 3.608 3.578 -0.031 106 0.052
25276 Santa with Muscles (1996) 2.666 2.625 -0.041 148 -0.901
19234 Lotto Land (1995) 3.328 3.279 -0.049 58 -0.247

It’s interesting that there are only eight movies with at least 50 ratings and with a negative “Delta” between their average and adjusted ratings. I’m curious what’s going on here.. if it’s truly a facet of the distribution of movie ratings, or if it’s due to the analysis and regularization, or what. To evaluate on the latter, the mean of all the movies is about \(3.53\), so there’s a lot more room for a movie to be rated below average as opposed to above average. The \(L^2\) regularization prevents the Movie and User Coefficients from becoming too large, so the estimated ratings will be closer to the mean, leading to larger deltas for poorly rated movies than for highly rated films. Looking at the distribution of \(\alpha\)s, \(\beta\)s, and the Deltas, we see this to be the case.

While the User Coefficients (\(\alpha_i\)) appear normally distributed, the Movie Coefficients (\(\beta_j\)) are left-skewed, with more large negative values compared to large positive values. I expect things to be more balanced (but perhaps not generalize as well) if we don’t use regularization.

Code
# setup and fit the model
# this cell will likely take a while to run, there > 20,000 parameters to fit in this optimization
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=True)

model = make_pipeline(
    ohe,
    Ridge(alpha=0.0)       # This being 0 removes the regularization
)

model.fit(ratings[['userId','movieId']], ratings['rating'])

encoder = model.named_steps['onehotencoder']

feature_names = encoder.get_feature_names_out(['userId', 'movieId'])

coef = model.named_steps['ridge'].coef_
intercept = model.named_steps['ridge'].intercept_

Movies that have few total ratings still dominate the leaderboard, so we still want to filter. Here are the top underrated and overrated movies with at least 50 ratings according to this non-regularized approach. For the overrated, we still see the same phenomenon, with there being very few movies making the list.

Title Average Rating Adjusted Rating Delta Number of Ratings Movie Coefficient
9332 As Tears Go By (Wong gok ka moon) (1988) 3.192 3.826 0.634 52 0.301
5466 Merchant of Four Seasons, The (Händler der vier Jahreszeiten) (1972) 3.406 4.031 0.624 64 0.505
15647 Pellet (Bola, El) (2000) 2.900 3.485 0.585 50 -0.041
2312 Autumn Afternoon, An (Sanma no aji) (1962) 3.669 4.253 0.584 71 0.728
11573 Separation, The (Séparation, La) (1994) 3.127 3.708 0.581 51 0.183
3930 I Hired a Contract Killer (1990) 3.548 4.127 0.579 52 0.602
7291 Dames du Bois de Boulogne, Les (Ladies of the Bois de Boulogne, The) (Ladies of the Park) (1945) 3.357 3.935 0.578 70 0.409
2813 Eureka (Yurîka) (2000) 3.630 4.206 0.575 96 0.680
13082 See the Sea (1997) 3.054 3.626 0.572 56 0.100
9436 Lili Marleen (1981) 3.250 3.821 0.571 56 0.296
Title Average Rating Adjusted Rating Delta Number of Ratings Movie Coefficient
23011 Big Green, The (1995) 2.855 2.865 0.010 956 -0.661
16292 Stefano Quantestorie (1993) 3.439 3.445 0.006 57 -0.081
15071 Captives (1994) 3.508 3.513 0.004 181 -0.013
24806 Gordy (1995) 2.531 2.532 0.002 439 -0.993
23031 Man of the House (1995) 2.871 2.863 -0.009 1181 -0.663
21995 Homeward Bound II: Lost in San Francisco (1996) 3.002 2.989 -0.013 2430 -0.536
13786 Talking About Sex (1994) 3.608 3.585 -0.024 106 0.059
20839 Sunset Park (1996) 3.135 3.109 -0.026 251 -0.416
24440 Santa with Muscles (1996) 2.666 2.624 -0.041 148 -0.901
18757 Lotto Land (1995) 3.328 3.279 -0.048 58 -0.246

Finally, here’s the distribution of the Movie Coefficients (\(\beta_j\)) under the non-regularized approach. We can see it’s more spread out and balanced, as expected.

We could conduct many hypothesis tests and pare the model down to solely those parameters that appear to be statistically significant, but I’ll save that analysis for another time.

Further approaches and questions

Stay tuned for future posts on other recommender systems where I’ll answer questions such as

  • How does one select the hyperparameter \(\lambda\)?
  • How can we measure the performance of the model?
  • What about personalized recommender systems?