Book Recommendation Engine
Intro
This is the third project in the freeCodeCamp Machine Learning with Python Certification. For this project we have to create a book recommendation engine using K-Nearest Neighbors. We can use Tensorflow and scikit-learn to build our model. We will use the boilerplate code provided by freeCodeCamp. The Read more about it in Book Recommendation Engine using KNN.
Check out the full code for this project at https://colab.research.google.com/drive/1yYJ6QVESBLFJrX-zkK4huPX9zjM6Gtgo?usp=sharing
Planning
We will use Pandas DataFrame to import and filter the data according to project requirements. Then will use the NearestNeighbors from scikit-learn. The fit
method of Nearest Neighbors needs a sparse matrix as an input, so we will pivot the DataFrame with our data and fill in zeros in place of NaN
. We can then train our model. Finally, we will create a function that takes a book title as an argument and returns the top five suggestions using our model.
Code
I will put a note before the code blocks provided by freeCodeCamp and briefly go over them, so we can follow what is happening in the program.
Setup
Import the libraries we need
NOTE: This code block is provided by freeCodeCamp
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt
Get the zip containing images and unzip. For use in colaboratory notebooks, add an exclamation mark (!) before the shell commands
NOTE: This code block is provided by freeCodeCamp
wget https://cdn.freecodecamp.org/project-data/books/book-crossings.zip
unzip book-crossings.zip
Create variable to store filenames
NOTE: This code block is provided by freeCodeCamp
books_filename = 'BX-Books.csv'
ratings_filename = 'BX-Book-Ratings.csv'
Import CSV data from files above to Pandas DataFrame
NOTE: This code block is provided by freeCodeCamp
df_books = pd.read_csv(
books_filename,
encoding = "ISO-8859-1",
sep=";",
header=0,
names=['isbn', 'title', 'author'],
usecols=['isbn', 'title', 'author'],
dtype={'isbn': 'str', 'title': 'str', 'author': 'str'})
df_ratings = pd.read_csv(
ratings_filename,
encoding = "ISO-8859-1",
sep=";",
header=0,
names=['user', 'isbn', 'rating'],
usecols=['user', 'isbn', 'rating'],
dtype={'user': 'int32', 'isbn': 'str', 'rating': 'float32'})
Let’s check what our books data looks like. We can see the first five rows with
df_books.head()
isbn | title | author | |
---|---|---|---|
0 | 0195153448 | Classical Mythology | Mark P. O. Morford |
1 | 0002005018 | Clara Callan | Richard Bruce Wright |
2 | 0060973129 | Decision in Normandy | Carlo D’Este |
3 | 0374157065 | Flu: The Story of the Great Influenza Pandemic… | Gina Bari Kolata |
4 | 0393045218 | The Mummies of Urumchi | E. J. W. Barber |
We can get some stats about the books DataFrame
df_books.describe()
index | isbn | title | author |
---|---|---|---|
unique | 271379 | 242154 | 102042 |
count | 271379 | 271379 | 271378 |
freq | 1 | 27 | 632 |
top | 0195153448 | Selected Poems | Agatha Christie |
All the ISBNs are unique but the titles are not unique. There might be some books have that been added under multiple ISBNs. We might need to clean this data later.
Let’s check what our rating data looks like. We can see the first five rows with
df_ratings.head()
user | isbn | rating | |
---|---|---|---|
0 | 276725 | 034545104X | 0.0 |
1 | 276726 | 0155061224 | 5.0 |
2 | 276727 | 0446520802 | 0.0 |
3 | 276729 | 052165615X | 3.0 |
4 | 276729 | 0521795028 | 6.0 |
Data Filtering
We need to filter the data as per the project requirements.
First requirement is to remove the user with less than 200 reviews. We can group the ratings DataFrame by users. The number of rows in each group will give up the number of reviews for that user
df_user_count = df_ratings.groupby('user').size().reset_index(name='counts')
We can look at the count table we just built
df_user_count
user | counts | |
---|---|---|
0 | 2 | 1 |
1 | 7 | 1 |
2 | 8 | 18 |
3 | 9 | 3 |
4 | 10 | 2 |
… | … | … |
105278 | 278846 | 2 |
105279 | 278849 | 4 |
105280 | 278851 | 23 |
105281 | 278852 | 1 |
105282 | 278854 | 8 |
We can get the number of users by counting the unique users in the original ratings DataFrame or using the length or shape of the grouped DataFrame
len(df_ratings.user.unique())
105283
len(df_user_count)
105283
df_user_count.shape
(105283, 2)
Create a DataFrame with only the users with more than 200 ratings
df_user_filter = df_user_count[df_user_count['counts']>200]
Check what the filter looks like
df_user_filter.head()
user | counts | |
---|---|---|
95 | 254 | 314 |
791 | 2276 | 498 |
981 | 2766 | 274 |
1049 | 2977 | 232 |
1177 | 3363 | 901 |
We can do the same with the books DataFrame.
Group by ISBN
df_isbn_count = df_ratings.groupby('isbn').size().reset_index(name='counts')
Create a Dataframe with only the books that have more than 200 ratings
df_isbn_filter = df_isbn_count[df_isbn_count['counts']>100]
We can use the DataFrames we created as filters to keep only the ratings we want.
Keep only the ratings with the users we kept
df_ratings_filtered = df_ratings[df_ratings.user.isin(df_user_filter['user'])]
Keep only the rating with the ISBNs we kept
df_ratings_filtered = df_ratings_filtered[df_ratings_filtered.isbn.isin(df_isbn_filter['isbn'])]
We can check the shape of the filtered data and what the data looks like now
df_ratings_filtered.shape
(49254, 3)
df_ratings_filtered.head()
user | isbn | rating | |
---|---|---|---|
1456 | 277427 | 002542730X | 10.0 |
1469 | 277427 | 0060930535 | 0.0 |
1471 | 277427 | 0060934417 | 0.0 |
1474 | 277427 | 0061009059 | 9.0 |
1484 | 277427 | 0140067477 | 0.0 |
Let’s merge the filtered rating data with the books data to get the title and author in the same DataFrame
df_merged = pd.merge(left=df_ratings_filtered, right=df_books, on='isbn')
Check what the merged data looks like
df_merged.shape
(48990, 5)
df_merged.head()
user | isbn | rating | title | author | |
---|---|---|---|---|---|
0 | 277427 | 002542730X | 10.0 | Politically Correct Bedtime Stories: Modern Ta… | James Finn Garner |
1 | 3363 | 002542730X | 0.0 | Politically Correct Bedtime Stories: Modern Ta… | James Finn Garner |
2 | 11676 | 002542730X | 6.0 | Politically Correct Bedtime Stories: Modern Ta… | James Finn Garner |
3 | 12538 | 002542730X | 10.0 | Politically Correct Bedtime Stories: Modern Ta… | James Finn Garner |
4 | 13552 | 002542730X | 0.0 | Politically Correct Bedtime Stories: Modern Ta… | James Finn Garner |
Some users might have reviewed the same book listed under different ISBNs. Let’s get rid of the duplicate ratings based on user and title of the book
df_merged = df_merged.drop_duplicates(subset=['user', 'title'])
Check the shape after dropping duplicates
df_merged.shape
(48615, 5)
We got rid of over 300 duplicate ratings.
Training
As mentioned earlier, we need a sparse matrix to train scikit-learn Nearest Neighbor model.
Pivot the merged DataFrame
df_pivoted = pd.pivot(df_merged, index='title', columns=['user'], values='rating')
df_pivoted.head()
user | 254 | 2276 | 2766 | 2977 | 3363 | 4017 | 4385 | 6242 | 6251 | 6323 | … | 274004 | 274061 | 274301 | 274308 | 274808 | 275970 | 277427 | 277478 | 277639 | 278418 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
title | |||||||||||||||||||||
1984 | 9.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | … | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN |
1st to Die: A Novel | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | … | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2nd Chance | NaN | 10.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | … | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | 0.0 | NaN |
4 Blondes | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | … | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A Beautiful Mind: The Life of Mathematical Genius and Nobel Laureate John Nash | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | … | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
We see a lot of NaN
, indicating missing ratings. This is fine, since every user only rates a few books. We can set them to zeros
df_final = df_pivoted.fillna(0)
df_final.head()
user | 254 | 2276 | 2766 | 2977 | 3363 | 4017 | 4385 | 6242 | 6251 | 6323 | … | 274004 | 274061 | 274301 | 274308 | 274808 | 275970 | 277427 | 277478 | 277639 | 278418 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
title | |||||||||||||||||||||
1984 | 9.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | … | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1st to Die: A Novel | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | … | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2nd Chance | 0.0 | 10.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | … | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 Blondes | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | … | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
A Beautiful Mind: The Life of Mathematical Genius and Nobel Laureate John Nash | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | … | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Let’s create our model. We will use cosine
as metric
nn = NearestNeighbors(metric='cosine')
Train the model with the fit
method
nn.fit(df_final)
Testing
Let’s test our model before we start making the recommendation function. We can ask our model for six recommendations
nn.kneighbors([df_final.loc["The Weight of Water"]], 6)
(array([[0. , 0.6642004 , 0.68558145, 0.7087431 , 0.7105186 ,
0.71307 ]], dtype=float32),
array([[606, 204, 473, 140, 321, 406]]))
The first array gives the distance, zero being the closest. The second array is the location of the book in the df_final
DataFrame. The first result is also the book matching with itself, hence the zero distance.
This is what the empty function looks like
def get_recommends(book = ""):
return recommended_books
We will need to add some more code inside the function. Search for the 1D vector in the df_final
DataFrame using the book name
input = df_final.loc[book]
Ask our model to search six closes entries to the 1D vector we give it. We use six because one of the results it returns will be the book we passed in
distance, index = nn.kneighbors([input], 6)
Create a list and add the original book as the first element
recommended_books = []
recommended_books.append(book)
The freeCodeCamp test needs the books in the reverse order of their likeliness, so we will just iterate in reverse
for i in range (5, 0, -1):
recommended_books[1].append([df_final.index[index[0][i]], distance[0][i]])
The complete function now looks like
def get_recommends(book = ""):
input = df_final.loc[book]
distance, index = nn.kneighbors([input], 6)
recommended_books = []
recommended_books.append(book)
recommended_books.append([])
for i in range (5, 0, -1):
recommended_books[1].append([df_final.index[index[0][i]], distance[0][i]])
return recommended_books
Let’s see if our model passes the freeCodeCamp test
NOTE: This code block is provided by freeCodeCamp
def test_book_recommendation():
test_pass = True
recommends = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")
if recommends[0] != "Where the Heart Is (Oprah's Book Club (Paperback))":
test_pass = False
recommended_books = ["I'll Be Seeing You", 'The Weight of Water', 'The Surgeon', 'I Know This Much Is True']
recommended_books_dist = [0.8, 0.77, 0.77, 0.77]
for i in range(2):
if recommends[1][i][0] not in recommended_books:
test_pass = False
if abs(recommends[1][i][1] - recommended_books_dist[i]) >= 0.05:
test_pass = False
if test_pass:
print("You passed the challenge! 🎉🎉🎉🎉🎉")
else:
print("You haven't passed yet. Keep trying!")
Print the recommendations for “Where the Heart Is (Oprah’s Book Club (Paperback))”
books = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")
print(books)
["Where the Heart Is (Oprah's Book Club (Paperback))", [["I'll Be Seeing You", 0.8016211], ['The Weight of Water', 0.77085835], ['The Surgeon', 0.7699411], ['I Know This Much Is True', 0.7677075], ['The Lovely Bones: A Novel', 0.7230184]]]
Running the freeCodeCamp test function
test_book_recommendation()
You passed the challenge! 🎉🎉🎉🎉🎉
Thank you for reading. You can also check out my other projects for this series below.