SMS Text Classifier

Jan 6, 2024

freeCodeCamp Machine Learning Certification

keras machine learning ml neural network python tensorflow

Intro

This is the fifth project in the freeCodeCamp Machine Learning with Python Certification. For this project we have to create a machine learning model that classifies a text message as ham (normal message) or spam. We have to build a neural network with Tensorflow. We will use the boilerplate code provided by freeCodeCamp. Read more about it in Neural Network SMS Text Classifier.

Check out the full code for this project at https://colab.research.google.com/drive/1suzdQ37KgFOZ71l_yo3bpW-N-8T8f3Bu?usp=sharing

Planning

We are going to build a neural network for this project. Our data is in textual format. Before being able to use the data for training, we will need to convert it into the right format. We can clean the text data by making it all lowercase and removing punctuations, formatting, numbers, etc. We can then create a “bag of words” or “vocabulary” from the training dataset and use it to map the text data to integer vectors. Since the text messages are of different lengths, the integer vectors will be of different sizes and the order of words will be different too. We will need to create embeddings of same size for all the vectors.

We also don’t want to manually do these steps every time we want to classify a new text message. We can offload these tasks to layers in our model. The first two layers in our model can be for vectorizing and embedding the data.

The next few layers would be regular neural network layers and the output layer will give up a probability of the message being ham or spam.

Code

Fetch the dataset by running in shell

wget https://cdn.freecodecamp.org/project-data/sms/train-data.tsv
wget https://cdn.freecodecamp.org/project-data/sms/valid-data.tsv

Import some libraries in our python code

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt

Data

Load the training and validation datasets to DataFrames

train_file_path = "train-data.tsv"
test_file_path = "valid-data.tsv"
train_dataset = pd.read_csv(train_file_path, delimiter='\t', header=None, names=['label', 'message'])
valid_dataset = pd.read_csv(test_file_path, delimiter='\t', header=None, names=['label', 'message'])

Check if our data is imported correctly

train_dataset.head()

	label	message
0	ham	ahhhh…just woken up!had a bad dream about u …
1	ham	you can never do nothing
2	ham	now u sound like manky scouse boy steve,like! …
3	ham	mum say we wan to go then go… then she can s…
4	ham	never y lei… i v lazy… got wat? dat day ü …

valid_dataset.head()

	label	message
0	ham	i am in hospital da. . i will return home in e…
1	ham	not much, just some textin’. how bout you?
2	ham	i probably won’t eat at all today. i think i’m…
3	ham	don‘t give a flying monkeys wot they think and…
4	ham	who are you seeing?

Pop off the label column to generate labels for training and validation

train_labels = train_dataset.pop('label')
valid_labels = valid_dataset.pop('label')

The labels are “ham” and “spam”. We will need numerical values instead of string to train our model. We can factorize the training labels

train_labels, label_index = train_labels.factorize()
train_labels

array([0, 0, 0, ..., 1, 1, 0])

Check our label to int mapping

label_dict = {label:idx for idx, label in enumerate(label_index)}
label_dict

{'ham': 0, 'spam': 1}

We will also need to factorize the validation set with the same indices

valid_labels = np.array(valid_labels.apply(lambda label: label_dict[label]))
valid_labels

```samp
array([0, 0, 0, ..., 0, 1, 1])
```

Model

We will create our neural network using Keras sequential model

model = keras.models.Sequential()

Our first layer should take in the SMS text as input and convert it into list of token indices (vectorize). This layer is also going to standardize the text input.

Let’s create a function to standardize to pass to our first layer . We will convert everything to lower case and remove everything other than alphabets

def sms_standardize(input):
  retVal = tf.strings.lower(input)
  retVal = tf.strings.regex_replace(retVal, '[^a-z]', ' ')
  return retVal

Create the vectorize layer and pass in the function

vectorize_layer = tf.keras.layers.TextVectorization(standardize=sms_standardize)

Call the adapt method on the layer with the tokenized train dataset to compute the vocabulary

vectorize_layer.adapt(train_dataset.message.to_numpy())

vocab_size = vectorize_layer.vocabulary_size()
vocab_size

6684

Add this layer to the model

model.add(vectorize_layer)

Add an Embedding layer to convert the token indices from previous layer to fixed size embedding

embedding_dim=32
model.add(keras.layers.Embedding(vocab_size, embedding_dim))

Now we can add some generic layers like Pooling, Dense and Dropout

model.add(keras.layers.Dropout(0.2))
model.add(keras.layers.Dense(2))
model.add(keras.layers.Dropout(0.2))

For the final layer, add a Dense layer with one unit and sigmoid activation. Sigmoid activation gives us an output in the range [0, 1]. This would be the probability of a text message being “spam” since the labels were zero for ham and one for spam

model.add(keras.layers.Dense(1, activation="sigmoid"))

Check the model summary

model.summary()

Model: "sequential_17"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 text_vectorization (TextVe  (None, None)              0         
 ctorization)                                                    
                                                                 
 embedding_17 (Embedding)    (None, None, 32)          213888    
                                                                 
 global_average_pooling1d_1  (None, 32)                0         
 7 (GlobalAveragePooling1D)                                      
                                                                 
 dropout_18 (Dropout)        (None, 32)                0         
                                                                 
 dense_34 (Dense)            (None, 2)                 66        
                                                                 
 dropout_19 (Dropout)        (None, 2)                 0         
                                                                 
 dense_35 (Dense)            (None, 1)                 3         
                                                                 
=================================================================
Total params: 213957 (835.77 KB)
Trainable params: 213957 (835.77 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Compile the model. We will use binary crossentropy for loss function since we only have two labels and this is a classification model

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
    )

Training

Train the model for 25 epochs by calling the fit method

history = model.fit(
    x=train_dataset,
    y=train_labels,
    validation_data=(valid_dataset, valid_labels),
    epochs=25
)

Epoch 1/25
131/131 [==============================] - 2s 8ms/step - loss: 0.4912 - accuracy: 0.8397 - val_loss: 0.3271 - val_accuracy: 0.8657
Epoch 2/25
131/131 [==============================] - 1s 6ms/step - loss: 0.3126 - accuracy: 0.8744 - val_loss: 0.2568 - val_accuracy: 0.8858
Epoch 3/25
131/131 [==============================] - 1s 7ms/step - loss: 0.2583 - accuracy: 0.9069 - val_loss: 0.1951 - val_accuracy: 0.9167
...
Epoch 23/25
131/131 [==============================] - 1s 9ms/step - loss: 0.0341 - accuracy: 0.9921 - val_loss: 0.0383 - val_accuracy: 0.9878
Epoch 24/25
131/131 [==============================] - 1s 9ms/step - loss: 0.0329 - accuracy: 0.9904 - val_loss: 0.0377 - val_accuracy: 0.9907
Epoch 25/25
131/131 [==============================] - 1s 7ms/step - loss: 0.0317 - accuracy: 0.9907 - val_loss: 0.0381 - val_accuracy: 0.9885

Take a look at the training loss history

fig, axes = plt.subplots(figsize=(5,3))
axes.plot(history.history['loss'])
axes.plot(history.history['val_loss'])
axes.set_xlabel('Epochs')
axes.set_ylabel('Loss')
axes.legend(['Loss', 'Validation Loss'])

loss

Testing

Create a function to get the predictions

def predict_message(pred_text):
  result = model.predict([pred_text])
  return ([result, 'ham' if result<0.5 else 'spam'])

Test the prediction function

pred_text = "how are you doing today?"

prediction = predict_message(pred_text)
print(prediction)

1/1 [==============================] - 0s 133ms/step
[array([[6.8955176e-12]], dtype=float32), 'ham']

This message was correctly identified. Let’s run the freeCodeCamp test to check if the model passes

# Run this cell to test your function and model. Do not modify contents.
def test_predictions():
  test_messages = ["how are you doing today",
                   "sale today! to stop texts call 98912460324",
                   "i dont want to go. can we try it a different day? available sat",
                   "our new mobile video service is live. just install on your phone to start watching.",
                   "you have won £1000 cash! call to claim your prize.",
                   "i'll bring it tomorrow. don't forget the milk.",
                   "wow, is your arm alright. that happened to me one time too"
                  ]

  test_answers = ["ham", "spam", "ham", "spam", "spam", "ham", "ham"]
  passed = True

  for msg, ans in zip(test_messages, test_answers):
    prediction = predict_message(msg)
    if prediction[1] != ans:
      passed = False

  if passed:
    print("You passed the challenge. Great job!")
  else:
    print("You haven't passed yet. Keep trying.")

test_predictions()

1/1 [==============================] - 0s 39ms/step
1/1 [==============================] - 0s 42ms/step
1/1 [==============================] - 0s 40ms/step
1/1 [==============================] - 0s 42ms/step
1/1 [==============================] - 0s 33ms/step
1/1 [==============================] - 0s 33ms/step
1/1 [==============================] - 0s 42ms/step
You passed the challenge. Great job!

The model correctly classifies the messages as ham or spam!

Thank you for reading. You can also check out my other projects for this series below.