Recommender System using Content Filtering
A simple movie recommender system using pandas and sklearn
Introduction
What are Recommendations Systems? Strictly speaking, Recommendations Systems (or RecSys as they are usually abbreviated) are a class of information filtering systems. Their goal is to improve the quality of results being delivered to the end user, by taking into account different heuristics related to the user itself.
Recommendation Systems power nearly every kind of content on the Internet, from search results on your favorite search engine to the Customers Who Bought This Item Also Bought
window on Amazon.com. Netflix uses recommender systems to help you find the next show or movie to watch. They're everywhere and highly effective because of their ability to narrow down content for a user, be it a movie to watch or an item to buy.
There are primarily three different kinds of RecSys, each of which is formulated differently and has different operating characteristics.
- Demographic
Demographic(or popularity) based RecSys offer generalized recommendations. They offer non-personalized recommendations, instead focusing on what's universally popular(within the problem domain) and assuming that generality holds for all users from the target demography. They are simple and easy to get started with and don't require any kind of user data to be effective. That said, they're not as effective as others types of RecSys that we'll look at.
- Content Based Filtering
Classification based RecSys try to group together similar items. If a user has seen or shopped from an item, they're more likely to be interested in other entities from the same bucket. Instead of relying on user information, they rely on information about the item itself to group similar items together in the same bucket. More information can be found here
- Collaborative Filtering
Collaborative Filtering adds to the idea of Content Based Filtering by using similarities between user data and item data to make recommendations. To put it simply, consider the following scenario: person A has seen movie X and person B has seen movie Y. Now, it turns out person A and B have similar tastes. So the system will then recommend movie Y to A accounting for that similarity. More information can be found here on the Google Developers blog
TMDB5000 Dataset
TMDB5000 is a Kaggle hosted database derived from the TMDB API. You can learn more about TMDB on their website. It consists of roughly 5000 data entries, each of which has 20 features that we can work with. The data has probably gone through some amount of pre-processing steps, though there's a lot to be done, as we'll see below.
Let's dive into some of the code, so we can get started.
#collapse-hide
#imports
from utils import build_word_cloud, clean_num, get_month, get_day, get_director, get_list, clean_list, create_feature
from pathlib import Path
import warnings
warnings.simplefilter('ignore')
import ast
import math
import datetime
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
dataPath = Path('data/')
movies = pd.read_csv(dataPath/'tmdb_5000_movies.csv')
credits = pd.read_csv(dataPath/'tmdb_5000_credits.csv')
movies.shape
movies.info()
We can see that we have data on 4803 different movies. The different features available include the budget, revenue of the movie, cast and crew as well as descriptive informatin about the genres and keywords.
Let's look at a few samples of the data, so we can get a sense of the work that we have to do before we can start extracting meaningful information.
movies.head(2)
Data Wrangling
As can be seen above, several features are in a stringified format, so we'll need to convert those back to their original formats. There's a homepage
feature which has no information present for a majority of the data points, so removing that feature shouldn't lead to any major data loss.
Almost all of the features are represented as generic object dtypes, rather than native data types, so we'll look into that as well. We already have certain features, namely vote_count
and vote_average
that we can use to gain some kind of insight and build a baseline recommendation system.
original_title
feature
Removing We have two features related to the title/name in the movie, namely title
and original_title
. There are however, a small number of cases when those two features aren't exactly the same (~6%) of the total dataset. A visual inspection tells us even in those cases, we can get the name of the movie by looking at title
and original_language
features.
Considering we're trying to build a recommender system, the name of the movie is going to be one of the lesser significant features and we can safely remove it from consideration.
cols = ['title', 'original_title', 'original_language']
movies[movies['original_title'] != movies['title']][cols].head()
movies = movies.drop('original_title', axis = 1)
movies['revenue'].describe()
print(movies[movies['revenue'] == 0].shape[0]/(movies.shape[0]))
We can see above that roughly 30% of the entries in our data have no revenue
ie revenue is 0. While it may come across as a not so useful feature, we can use it later. For now, let's sanitize the zero values so we get some info out of it.
movies['revenue'] = movies['revenue'].replace(0, math.nan)
movies['budget'].describe()
print(movies[movies['budget'] == 0].shape[0]/(movies.shape[0]))
We can see above that roughly 22% of the entries in our data have no budget
ie budget is 0. Let's sanitize the zero values, so we can use it later.
movies['budget'] = movies['budget'].replace(0, math.nan)
Let's sanitize the year column. Currently it's a generic object datatype, we'll convert it to a datetime
representation
movies['year'] = pd.to_datetime(movies['release_date'], errors='coerce').apply(\
lambda x: str(x).split('-')[0] if x != math.nan else math.nan)
movies['return'] = movies['revenue'] / movies['budget']
movies['return'].describe()
There's a lot more cleaning and aggregation that needs to be done. But without exploring the data first, and getting some kind of intuition for the data, we don't really know what to do. We'll explore the data first, and do any kind of cleaning and generating features on the fly later as needed.
movies['title'] = movies['title'].astype('str')
movies['overview'] = movies['overview'].astype('str')
build_word_cloud(movies, 'title')
build_word_cloud(movies, 'overview')
movies['original_language'].drop_duplicates().shape
There are 37 different languages, that's a wide range. Let's plot the occurence of each language
languages = pd.DataFrame(movies['original_language'].value_counts())
languages['language'] = languages.index
columns = ['count', 'language']
languages.columns = columns
languages.head(7)
plt.figure(figsize=(10,5))
sns.barplot(x = 'language', y = 'count', data = languages)
plt.show()
As expected, an overwhelmingly large number of the movies are in English. Let's look at a graph that makes it a bit easier for us to visualize the different languages present in the dataset.
plt.figure(figsize=(12,6))
sns.barplot(x = 'language', y = 'count', data = languages.iloc[1:])
plt.show()
movies['popularity'] = movies['popularity'].apply(clean_num).astype('float')
movies['vote_count'] = movies['vote_count'].apply(clean_num).astype('float')
movies['vote_average'] = movies['vote_average'].apply(clean_num).astype('float')
#list movies by popularity
movies[['title', 'popularity']].sort_values('popularity', ascending = False).head(5)
movies[['title', 'vote_count']].sort_values('vote_count', ascending = False).head(5)
movies[['title', 'vote_average']].sort_values('vote_average', ascending = False).head(5)
Release Dates
Amongst all the features, we could be looking at or creating, perhaps few are as important as time of release. When a movie is released tends to have a strong correlation with how well it does. Major franchisee movies tend to be released around the time of holidays/summer months. Conversely, movies released around the time of holidays go on to do a lot better than those released around the year. Let's try to plot this distribution
movies['day'] = movies['release_date'].apply(get_day)
movies['month'] = movies['release_date'].apply(get_month)
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
sns.countplot(x = 'month', data = movies, order = months)
print(movies['revenue'].mean())
The mean revenue for movies in our dataset is 82260638.65167603. Let's now try to plot only the release dates of the movies whose revenue is greater than the mean for the data.
means = pd.DataFrame(movies[movies['revenue'] > 82260638.65167603].groupby('month')['revenue'].mean())
means['month'] = means.index
plt.figure(figsize=(16,9))
plt.title('Average Revenue by Month')
sns.barplot(x = 'month', y = 'revenue', data = means, order = months)
From the chart above, we can see that the summer months(April - June) have the most number of movies who fare better than the numerically average movie in the dataset.
How similar/dissimilar is this chart from one where we graph all movies against the mean revenue?
means = pd.DataFrame(movies[movies['revenue'] > 1].groupby('month')['revenue'].mean())
means['month'] = means.index
plt.figure(figsize=(16,9))
plt.title('Average Revenue by Month')
sns.barplot(x = 'month', y = 'revenue', data = means, order = months)
plt.figure(figsize=(12,9))
sns.distplot(movies[movies['budget'].notnull()]['budget'])
As we can see above, the budgets of the movies in the dataset are highly skewed , suggesting that a rather large number of movies have extremely small budgets
sns.jointplot(x = 'budget', y = 'revenue', data = movies)
The graph above suggests that there's a strong correlation between the budget of a movie and its revenue.
movies['revenue'].describe()
best_revenue = movies[['title', 'budget', 'revenue']].sort_values('revenue', ascending = False)
best_revenue
A useful analysis that needs to be done here is to take inflation into account. The movies at the top are all from recent times, so maybe inflation analysis is something that's necessary to get meaningful information from this feature.
#convert string-ified lists to list representation
movies['genres'] = movies['genres'].fillna('[]').apply(
ast.literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
series = movies.apply(lambda x: pd.Series(x['genres']),axis = 1).stack().reset_index(level = 1, drop = True)
series.name = 'genres'
genres = movies.drop('genres', axis = 1).join(series)
genres['genres'].value_counts().shape[0]
There are 20 different genres, which we look at below
popular = pd.DataFrame(genres['genres'].value_counts()).reset_index()
popular.columns = ['genre', 'count']
popular.head()
plt.figure(figsize = (18,9))
sns.barplot(x = 'genre', y = 'count', data = popular)
plt.show()
Let's look at the trends over time for a specific set of genres
genre = ['Drama', 'Action', 'Comedy', 'Thriller', 'Romance', 'Adventure', 'Horror', 'Family']
genres['year'] = genres['year'].replace('NaT', math.nan)
genres['year'] = genres['year'].apply(clean_num)
trends = genres[(genres['genres'].isin(genre)) & (genres['year'] >= 2000) & (genres['year'] <= 2016)]
ctab = pd.crosstab([trends['year']], trends['genres']).apply(lambda x: x/x.sum(), axis = 1)
ctab[genre].plot(kind = 'line', stacked = False, colormap = 'jet', figsize = (18,9))
plt.legend(bbox_to_anchor=(1, 1))
There seems to be a sharp decline in the number of drama
movies from 2014-2016 while the number of movies under horror
and action
has gone up around the same time.
Modeling
There's a lot of interesting possibilities in terms of the viz that we could have done. Ideally, we will clean and analyze every possible feature and try to identify its importance towards the final task. However, for the purposes of keeping this clean and simple, I'll go ahead and build what we're actually here for: a recommendation system.
We'll be building a Content Filtering based RecSys. In a way, given the dataset, we are limited in terms of the approach we can take. We don't have user information available in the data, so collaborative filtering is unfortunately not an option. We do have metadata about the movies itself, and that fits in perfectly within the operating characteristics of a Content Filtering based recommendation system.
movies = pd.read_csv(dataPath/'tmdb_5000_movies.csv')
credits = pd.read_csv(dataPath/'tmdb_5000_credits.csv')
credits.head(2)
movies.head(2)
Join both the DataFrames on the id
attribute, so we can incorporate features from both datasets.
credits.columns = ['id', 'title', 'cast', 'crew']
data = movies.merge(credits, on = 'id')
data.info()
data['overview'] = data['overview'].fillna('')
overview = data['overview']
We'll convert each word in the overview to a word vector, so we can assign importance to each vector based on the number of occurences. This can be done using the TF-IDF Vectorizer. For more information, check out this blog
tfidf = TfidfVectorizer(stop_words = 'english')
mat = tfidf.fit_transform(overview)
Here, mat
is the matrix obtained from the TfidfVectorizer. The shape of the matrix indicates that there are 4803 movies, with a total of 20978 words being used to describe all the movies. For intuition, the matrix obtained is going to be a sparse matrix, with the output format as follow:
(document_id, token_id) score
cosine_sim = cosine_similarity(mat, mat)
Now that we've obtained our matrix, next we need to determine, using the TF-IDF scores, which two movies are alike. There are several ways we can try to obtain this similarity, one of which is cosine similarity.
Simply put, cosine similarity is the measure of distance between 2 vectors. It measure the cosine of the angle between the 2 vectors. The smaller the angle is, the more similar the 2 vectors are. It follows from trignometry that cos(0) is 1 and cos(90) is 0 ie. vectors that are parallel to each other are likely to be similar and vectors which are orthogonal are likely to be dissimilar.
Next, we'll do reverse indexing, so that if we are given a string representing a movie, we can get its id and index it in our dataset.
title2id = pd.Series(data.index, index = data['title_x'])
title2id.shape
def recommend(title, measure, npreds):
idx = title2id[title]
score = list(enumerate(measure[idx]))
score = sorted(score, key = lambda x: x[1], reverse = True)
score = score[:npreds]
idxs = [i[0] for i in score]
return data['title_x'].iloc[idxs]
recommend("Pirates of the Caribbean: At World's End", cosine_sim, 10)
recommend("El Mariachi", cosine_sim, 10)
recommend('The Dark Knight Rises', cosine_sim, 10)
#parse stringified objects
pd.set_option('display.max_colwidth', 100)
features = ['keywords', 'genres', 'cast', 'crew']
for feature in features:
data[feature] = data[feature].apply(ast.literal_eval)
data['director'] = data['crew'].apply(get_director)
data[['director', 'title_x']].head(5)
features = ['cast', 'keywords', 'genres']
for feature in features:
data[feature] = data[feature].apply(get_list)
cols = ['title_x', 'director', 'cast', 'keywords', 'genres']
data[cols].head(3)
for feature in features:
data[feature] = data[feature].apply(clean_list)
data[cols].head(3)
data['feature'] = data.apply(create_feature, axis = 1)
cols = ['title_x', 'director', 'cast', 'keywords', 'genres', 'feature']
data[cols].head(3)
count = CountVectorizer(stop_words = 'english')
count_mat = count.fit_transform(data['feature'])
cosine_sim1 = cosine_similarity(count_mat, count_mat)
data = data.reset_index()
idxs = pd.Series(data.index, index = data['title_x'])
recommend("Pirates of the Caribbean: At World's End", cosine_sim1, 10)
recommend("El Mariachi", cosine_sim1, 10)
recommend('The Dark Knight Rises', cosine_sim1, 10)