Book Recommendation Engine

Dec 20, 2023

freeCodeCamp Machine Learning Certification

keras machine learning ml python tensorflow

Intro

This is the third project in the freeCodeCamp Machine Learning with Python Certification. For this project we have to create a book recommendation engine using K-Nearest Neighbors. We can use Tensorflow and scikit-learn to build our model. We will use the boilerplate code provided by freeCodeCamp. The Read more about it in Book Recommendation Engine using KNN.

Check out the full code for this project at https://colab.research.google.com/drive/1yYJ6QVESBLFJrX-zkK4huPX9zjM6Gtgo?usp=sharing

Planning

We will use Pandas DataFrame to import and filter the data according to project requirements. Then will use the NearestNeighbors from scikit-learn. The fit method of Nearest Neighbors needs a sparse matrix as an input, so we will pivot the DataFrame with our data and fill in zeros in place of NaN. We can then train our model. Finally, we will create a function that takes a book title as an argument and returns the top five suggestions using our model.

Code

I will put a note before the code blocks provided by freeCodeCamp and briefly go over them, so we can follow what is happening in the program.

Setup

Import the libraries we need

NOTE: This code block is provided by freeCodeCamp

import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt

Get the zip containing images and unzip. For use in colaboratory notebooks, add an exclamation mark (!) before the shell commands

NOTE: This code block is provided by freeCodeCamp

wget https://cdn.freecodecamp.org/project-data/books/book-crossings.zip
unzip book-crossings.zip

Create variable to store filenames

NOTE: This code block is provided by freeCodeCamp

books_filename = 'BX-Books.csv'
ratings_filename = 'BX-Book-Ratings.csv'

Import CSV data from files above to Pandas DataFrame

NOTE: This code block is provided by freeCodeCamp

df_books = pd.read_csv(
    books_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['isbn', 'title', 'author'],
    usecols=['isbn', 'title', 'author'],
    dtype={'isbn': 'str', 'title': 'str', 'author': 'str'})

df_ratings = pd.read_csv(
    ratings_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['user', 'isbn', 'rating'],
    usecols=['user', 'isbn', 'rating'],
    dtype={'user': 'int32', 'isbn': 'str', 'rating': 'float32'})

Let’s check what our books data looks like. We can see the first five rows with

df_books.head()

	isbn	title	author
0	0195153448	Classical Mythology	Mark P. O. Morford
1	0002005018	Clara Callan	Richard Bruce Wright
2	0060973129	Decision in Normandy	Carlo D’Este
3	0374157065	Flu: The Story of the Great Influenza Pandemic…	Gina Bari Kolata
4	0393045218	The Mummies of Urumchi	E. J. W. Barber

We can get some stats about the books DataFrame

df_books.describe()

index	isbn	title	author
unique	271379	242154	102042
count	271379	271379	271378
freq	1	27	632
top	0195153448	Selected Poems	Agatha Christie

All the ISBNs are unique but the titles are not unique. There might be some books have that been added under multiple ISBNs. We might need to clean this data later.

Let’s check what our rating data looks like. We can see the first five rows with

df_ratings.head()

	user	isbn	rating
0	276725	034545104X	0.0
1	276726	0155061224	5.0
2	276727	0446520802	0.0
3	276729	052165615X	3.0
4	276729	0521795028	6.0

Data Filtering

We need to filter the data as per the project requirements.

First requirement is to remove the user with less than 200 reviews. We can group the ratings DataFrame by users. The number of rows in each group will give up the number of reviews for that user

df_user_count = df_ratings.groupby('user').size().reset_index(name='counts')

We can look at the count table we just built

df_user_count

	user	counts
0	2	1
1	7	1
2	8	18
3	9	3
4	10	2
…	…	…
105278	278846	2
105279	278849	4
105280	278851	23
105281	278852	1
105282	278854	8

We can get the number of users by counting the unique users in the original ratings DataFrame or using the length or shape of the grouped DataFrame

len(df_ratings.user.unique())
105283

len(df_user_count)
105283

df_user_count.shape
(105283, 2)

Create a DataFrame with only the users with more than 200 ratings

df_user_filter = df_user_count[df_user_count['counts']>200]

Check what the filter looks like

df_user_filter.head()

	user	counts
95	254	314
791	2276	498
981	2766	274
1049	2977	232
1177	3363	901

We can do the same with the books DataFrame.

Group by ISBN

df_isbn_count = df_ratings.groupby('isbn').size().reset_index(name='counts')

Create a Dataframe with only the books that have more than 200 ratings

df_isbn_filter = df_isbn_count[df_isbn_count['counts']>100]

We can use the DataFrames we created as filters to keep only the ratings we want.

Keep only the ratings with the users we kept

df_ratings_filtered = df_ratings[df_ratings.user.isin(df_user_filter['user'])]

Keep only the rating with the ISBNs we kept

df_ratings_filtered = df_ratings_filtered[df_ratings_filtered.isbn.isin(df_isbn_filter['isbn'])]

We can check the shape of the filtered data and what the data looks like now

df_ratings_filtered.shape
(49254, 3)

df_ratings_filtered.head()

	user	isbn	rating
1456	277427	002542730X	10.0
1469	277427	0060930535	0.0
1471	277427	0060934417	0.0
1474	277427	0061009059	9.0
1484	277427	0140067477	0.0

Let’s merge the filtered rating data with the books data to get the title and author in the same DataFrame

df_merged = pd.merge(left=df_ratings_filtered, right=df_books, on='isbn')

Check what the merged data looks like

df_merged.shape
(48990, 5)

df_merged.head()

	user	isbn	rating	title	author
0	277427	002542730X	10.0	Politically Correct Bedtime Stories: Modern Ta…	James Finn Garner
1	3363	002542730X	0.0	Politically Correct Bedtime Stories: Modern Ta…	James Finn Garner
2	11676	002542730X	6.0	Politically Correct Bedtime Stories: Modern Ta…	James Finn Garner
3	12538	002542730X	10.0	Politically Correct Bedtime Stories: Modern Ta…	James Finn Garner
4	13552	002542730X	0.0	Politically Correct Bedtime Stories: Modern Ta…	James Finn Garner

Some users might have reviewed the same book listed under different ISBNs. Let’s get rid of the duplicate ratings based on user and title of the book

df_merged = df_merged.drop_duplicates(subset=['user', 'title'])

Check the shape after dropping duplicates

df_merged.shape
(48615, 5)

We got rid of over 300 duplicate ratings.

Training

As mentioned earlier, we need a sparse matrix to train scikit-learn Nearest Neighbor model.

Pivot the merged DataFrame

df_pivoted = pd.pivot(df_merged, index='title', columns=['user'], values='rating')
df_pivoted.head()

user	254	2276	2766	2977	3363	4017	4385	6242	6251	6323	…	274004	274061	274301	274308	274808	275970	277427	277478	277639	278418
title
1984	9.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	…	NaN	NaN	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN
1st to Die: A Novel	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2nd Chance	NaN	10.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	…	NaN	NaN	NaN	0.0	NaN	NaN	NaN	NaN	0.0	NaN
4 Blondes	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	NaN	…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
A Beautiful Mind: The Life of Mathematical Genius and Nobel Laureate John Nash	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	NaN	…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

We see a lot of NaN, indicating missing ratings. This is fine, since every user only rates a few books. We can set them to zeros

df_final = df_pivoted.fillna(0)
df_final.head()

user	254	2276	2766	2977	3363	4017	4385	6242	6251	6323	…	274004	274061	274301	274308	274808	275970	277427	277478	277639	278418
title
1984	9.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	…	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1st to Die: A Novel	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	…	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2nd Chance	0.0	10.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	…	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4 Blondes	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	…	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
A Beautiful Mind: The Life of Mathematical Genius and Nobel Laureate John Nash	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	…	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

Let’s create our model. We will use cosine as metric

nn = NearestNeighbors(metric='cosine')

Train the model with the fit method

nn.fit(df_final)

Testing

Let’s test our model before we start making the recommendation function. We can ask our model for six recommendations

nn.kneighbors([df_final.loc["The Weight of Water"]], 6)
(array([[0.        , 0.6642004 , 0.68558145, 0.7087431 , 0.7105186 ,
         0.71307   ]], dtype=float32),
 array([[606, 204, 473, 140, 321, 406]]))

The first array gives the distance, zero being the closest. The second array is the location of the book in the df_final DataFrame. The first result is also the book matching with itself, hence the zero distance.

This is what the empty function looks like

def get_recommends(book = ""):
  return recommended_books

We will need to add some more code inside the function. Search for the 1D vector in the df_final DataFrame using the book name

input = df_final.loc[book]

Ask our model to search six closes entries to the 1D vector we give it. We use six because one of the results it returns will be the book we passed in

distance, index = nn.kneighbors([input], 6)

Create a list and add the original book as the first element

recommended_books = []
recommended_books.append(book)

The freeCodeCamp test needs the books in the reverse order of their likeliness, so we will just iterate in reverse

for i in range (5, 0, -1):
  recommended_books[1].append([df_final.index[index[0][i]], distance[0][i]])

The complete function now looks like

def get_recommends(book = ""):
  
  input = df_final.loc[book]

  distance, index = nn.kneighbors([input], 6)

  recommended_books = []

  recommended_books.append(book)
  recommended_books.append([])
  for i in range (5, 0, -1):
    recommended_books[1].append([df_final.index[index[0][i]], distance[0][i]])

  return recommended_books

Let’s see if our model passes the freeCodeCamp test

NOTE: This code block is provided by freeCodeCamp

def test_book_recommendation():
  test_pass = True
  recommends = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")
  if recommends[0] != "Where the Heart Is (Oprah's Book Club (Paperback))":
    test_pass = False
  recommended_books = ["I'll Be Seeing You", 'The Weight of Water', 'The Surgeon', 'I Know This Much Is True']
  recommended_books_dist = [0.8, 0.77, 0.77, 0.77]
  for i in range(2):
    if recommends[1][i][0] not in recommended_books:
      test_pass = False
    if abs(recommends[1][i][1] - recommended_books_dist[i]) >= 0.05:
      test_pass = False
  if test_pass:
    print("You passed the challenge! 🎉🎉🎉🎉🎉")
  else:
    print("You haven't passed yet. Keep trying!")

Print the recommendations for “Where the Heart Is (Oprah’s Book Club (Paperback))”

books = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")
print(books)

["Where the Heart Is (Oprah's Book Club (Paperback))", [["I'll Be Seeing You", 0.8016211], ['The Weight of Water', 0.77085835], ['The Surgeon', 0.7699411], ['I Know This Much Is True', 0.7677075], ['The Lovely Bones: A Novel', 0.7230184]]]

Running the freeCodeCamp test function

test_book_recommendation()

You passed the challenge! 🎉🎉🎉🎉🎉

Thank you for reading. You can also check out my other projects for this series below.

Book Recommendation Engine

Intro

Planning

Code

Setup

Data Filtering

Training

Testing

Posts in this series