Book Recommendation Engine

Intro

This is the third project in the freeCodeCamp Machine Learning with Python Certification. For this project we have to create a book recommendation engine using K-Nearest Neighbors. We can use Tensorflow and scikit-learn to build our model. We will use the boilerplate code provided by freeCodeCamp. The Read more about it in Book Recommendation Engine using KNN.

Check out the full code for this project at https://colab.research.google.com/drive/1yYJ6QVESBLFJrX-zkK4huPX9zjM6Gtgo?usp=sharing

Planning

We will use Pandas DataFrame to import and filter the data according to project requirements. Then will use the NearestNeighbors from scikit-learn. The fit method of Nearest Neighbors needs a sparse matrix as an input, so we will pivot the DataFrame with our data and fill in zeros in place of NaN. We can then train our model. Finally, we will create a function that takes a book title as an argument and returns the top five suggestions using our model.

Code

I will put a note before the code blocks provided by freeCodeCamp and briefly go over them, so we can follow what is happening in the program.

Setup

Import the libraries we need

NOTE: This code block is provided by freeCodeCamp

import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt

Get the zip containing images and unzip. For use in colaboratory notebooks, add an exclamation mark (!) before the shell commands

NOTE: This code block is provided by freeCodeCamp

wget https://cdn.freecodecamp.org/project-data/books/book-crossings.zip
unzip book-crossings.zip

Create variable to store filenames

NOTE: This code block is provided by freeCodeCamp

books_filename = 'BX-Books.csv'
ratings_filename = 'BX-Book-Ratings.csv'

Import CSV data from files above to Pandas DataFrame

NOTE: This code block is provided by freeCodeCamp

df_books = pd.read_csv(
    books_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['isbn', 'title', 'author'],
    usecols=['isbn', 'title', 'author'],
    dtype={'isbn': 'str', 'title': 'str', 'author': 'str'})

df_ratings = pd.read_csv(
    ratings_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['user', 'isbn', 'rating'],
    usecols=['user', 'isbn', 'rating'],
    dtype={'user': 'int32', 'isbn': 'str', 'rating': 'float32'})

Let’s check what our books data looks like. We can see the first five rows with

df_books.head()
isbntitleauthor
00195153448Classical MythologyMark P. O. Morford
10002005018Clara CallanRichard Bruce Wright
20060973129Decision in NormandyCarlo D’Este
30374157065Flu: The Story of the Great Influenza Pandemic…Gina Bari Kolata
40393045218The Mummies of UrumchiE. J. W. Barber

We can get some stats about the books DataFrame

df_books.describe()
indexisbntitleauthor
unique271379242154102042
count271379271379271378
freq127632
top0195153448Selected PoemsAgatha Christie

All the ISBNs are unique but the titles are not unique. There might be some books have that been added under multiple ISBNs. We might need to clean this data later.

Let’s check what our rating data looks like. We can see the first five rows with

df_ratings.head()
userisbnrating
0276725034545104X0.0
127672601550612245.0
227672704465208020.0
3276729052165615X3.0
427672905217950286.0

Data Filtering

We need to filter the data as per the project requirements.

First requirement is to remove the user with less than 200 reviews. We can group the ratings DataFrame by users. The number of rows in each group will give up the number of reviews for that user

df_user_count = df_ratings.groupby('user').size().reset_index(name='counts')

We can look at the count table we just built

df_user_count
usercounts
021
171
2818
393
4102
1052782788462
1052792788494
10528027885123
1052812788521
1052822788548

We can get the number of users by counting the unique users in the original ratings DataFrame or using the length or shape of the grouped DataFrame

len(df_ratings.user.unique())
105283

len(df_user_count)
105283

df_user_count.shape
(105283, 2)

Create a DataFrame with only the users with more than 200 ratings

df_user_filter = df_user_count[df_user_count['counts']>200]

Check what the filter looks like

df_user_filter.head()
usercounts
95254314
7912276498
9812766274
10492977232
11773363901

We can do the same with the books DataFrame.

Group by ISBN

df_isbn_count = df_ratings.groupby('isbn').size().reset_index(name='counts')

Create a Dataframe with only the books that have more than 200 ratings

df_isbn_filter = df_isbn_count[df_isbn_count['counts']>100]

We can use the DataFrames we created as filters to keep only the ratings we want.

Keep only the ratings with the users we kept

df_ratings_filtered = df_ratings[df_ratings.user.isin(df_user_filter['user'])]

Keep only the rating with the ISBNs we kept

df_ratings_filtered = df_ratings_filtered[df_ratings_filtered.isbn.isin(df_isbn_filter['isbn'])]

We can check the shape of the filtered data and what the data looks like now

df_ratings_filtered.shape
(49254, 3)
df_ratings_filtered.head()
userisbnrating
1456277427002542730X10.0
146927742700609305350.0
147127742700609344170.0
147427742700610090599.0
148427742701400674770.0

Let’s merge the filtered rating data with the books data to get the title and author in the same DataFrame

df_merged = pd.merge(left=df_ratings_filtered, right=df_books, on='isbn')

Check what the merged data looks like

df_merged.shape
(48990, 5)
df_merged.head()
userisbnratingtitleauthor
0277427002542730X10.0Politically Correct Bedtime Stories: Modern Ta…James Finn Garner
13363002542730X0.0Politically Correct Bedtime Stories: Modern Ta…James Finn Garner
211676002542730X6.0Politically Correct Bedtime Stories: Modern Ta…James Finn Garner
312538002542730X10.0Politically Correct Bedtime Stories: Modern Ta…James Finn Garner
413552002542730X0.0Politically Correct Bedtime Stories: Modern Ta…James Finn Garner

Some users might have reviewed the same book listed under different ISBNs. Let’s get rid of the duplicate ratings based on user and title of the book

df_merged = df_merged.drop_duplicates(subset=['user', 'title'])

Check the shape after dropping duplicates

df_merged.shape
(48615, 5)

We got rid of over 300 duplicate ratings.

Training

As mentioned earlier, we need a sparse matrix to train scikit-learn Nearest Neighbor model.

Pivot the merged DataFrame

df_pivoted = pd.pivot(df_merged, index='title', columns=['user'], values='rating')
df_pivoted.head()
user254227627662977336340174385624262516323274004274061274301274308274808275970277427277478277639278418
title
19849.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.0NaNNaNNaNNaN
1st to Die: A NovelNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2nd ChanceNaN10.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.0NaNNaNNaNNaN0.0NaN
4 BlondesNaNNaNNaNNaNNaNNaNNaNNaN0.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
A Beautiful Mind: The Life of Mathematical Genius and Nobel Laureate John NashNaNNaNNaNNaNNaNNaNNaNNaN0.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

We see a lot of NaN, indicating missing ratings. This is fine, since every user only rates a few books. We can set them to zeros

df_final = df_pivoted.fillna(0)
df_final.head()
user254227627662977336340174385624262516323274004274061274301274308274808275970277427277478277639278418
title
19849.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
1st to Die: A Novel0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
2nd Chance0.010.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
4 Blondes0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
A Beautiful Mind: The Life of Mathematical Genius and Nobel Laureate John Nash0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0

Let’s create our model. We will use cosine as metric

nn = NearestNeighbors(metric='cosine')

Train the model with the fit method

nn.fit(df_final)

Testing

Let’s test our model before we start making the recommendation function. We can ask our model for six recommendations

nn.kneighbors([df_final.loc["The Weight of Water"]], 6)
(array([[0.        , 0.6642004 , 0.68558145, 0.7087431 , 0.7105186 ,
         0.71307   ]], dtype=float32),
 array([[606, 204, 473, 140, 321, 406]]))

The first array gives the distance, zero being the closest. The second array is the location of the book in the df_final DataFrame. The first result is also the book matching with itself, hence the zero distance.

This is what the empty function looks like

def get_recommends(book = ""):
  return recommended_books

We will need to add some more code inside the function. Search for the 1D vector in the df_final DataFrame using the book name

input = df_final.loc[book]

Ask our model to search six closes entries to the 1D vector we give it. We use six because one of the results it returns will be the book we passed in

distance, index = nn.kneighbors([input], 6)

Create a list and add the original book as the first element

recommended_books = []
recommended_books.append(book)

The freeCodeCamp test needs the books in the reverse order of their likeliness, so we will just iterate in reverse

for i in range (5, 0, -1):
  recommended_books[1].append([df_final.index[index[0][i]], distance[0][i]])

The complete function now looks like

def get_recommends(book = ""):
  
  input = df_final.loc[book]

  distance, index = nn.kneighbors([input], 6)

  recommended_books = []

  recommended_books.append(book)
  recommended_books.append([])
  for i in range (5, 0, -1):
    recommended_books[1].append([df_final.index[index[0][i]], distance[0][i]])

  return recommended_books

Let’s see if our model passes the freeCodeCamp test

NOTE: This code block is provided by freeCodeCamp

def test_book_recommendation():
  test_pass = True
  recommends = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")
  if recommends[0] != "Where the Heart Is (Oprah's Book Club (Paperback))":
    test_pass = False
  recommended_books = ["I'll Be Seeing You", 'The Weight of Water', 'The Surgeon', 'I Know This Much Is True']
  recommended_books_dist = [0.8, 0.77, 0.77, 0.77]
  for i in range(2):
    if recommends[1][i][0] not in recommended_books:
      test_pass = False
    if abs(recommends[1][i][1] - recommended_books_dist[i]) >= 0.05:
      test_pass = False
  if test_pass:
    print("You passed the challenge! 🎉🎉🎉🎉🎉")
  else:
    print("You haven't passed yet. Keep trying!")

Print the recommendations for “Where the Heart Is (Oprah’s Book Club (Paperback))”

books = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")
print(books)

["Where the Heart Is (Oprah's Book Club (Paperback))", [["I'll Be Seeing You", 0.8016211], ['The Weight of Water', 0.77085835], ['The Surgeon', 0.7699411], ['I Know This Much Is True', 0.7677075], ['The Lovely Bones: A Novel', 0.7230184]]]

Running the freeCodeCamp test function

test_book_recommendation()

You passed the challenge! 🎉🎉🎉🎉🎉

Thank you for reading. You can also check out my other projects for this series below.