Health Costs Calculator

Intro

This is the fourth project in the freeCodeCamp Machine Learning with Python Certification. For this project we have to create a book recommendation engine using K-Nearest Neighbors. We can use Tensorflow and scikit-learn to build our model. We will use the boilerplate code provided by freeCodeCamp. The Read more about it in Linear Regression Health Costs Calculator.

Check out the full code for this project at https://colab.research.google.com/drive/1gO9UJpHcYH04fEK4wMZ9R9N-6c9PLX4g?usp=sharing

Planning

For this project we will use linear regression algorithm. We can implement the algorithm using Keras sequential model with a single Dense with output size of one. Before training the model, we have to get our data ready for it.

We will use Pandas DataFrame to import the health costs data. We will check the data to make sure there are no missing or incorrect values, convert the categorical data to numeric data and filter the data if needed. We will format the data so it can be used with linear regression model. We will then do a 80-20 train-test split. We will also normalize our numerical data by adding a normalizer layer to our model.

Code

Let’s start by downloading the data we need. Run in shell

wget https://cdn.freecodecamp.org/project-data/health-costs/insurance.csv

Now in our python project, import the libraries we need

import numpy as np
import pandas as pd

import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers

Data

Read the data to a Pandas DataFrame and check what it looks like

df = pd.read_csv('insurance.csv')
df.tail()
agesexbmichildrensmokerregionexpenses
133350male31.03nonorthwest10600.55
133418female31.90nonortheast2205.98
133518female36.90nosoutheast1629.83
133621female25.80nosouthwest2007.95
133761female29.10yesnorthwest29141.36
df.describe()
agebmichildrenexpenses
count1338.0000001338.0000001338.0000001338.000000
mean39.20702530.6654711.09491813270.422414
std14.0499606.0983821.20549312110.011240
min18.00000016.0000000.0000001121.870000
25%27.00000026.3000000.0000004740.287500
50%39.00000030.4000001.0000009382.030000
75%51.00000034.7000002.00000016639.915000
max64.00000053.1000005.00000063770.430000
df.shape

(1338, 7)

Check if there are any incorrect values

df.sex.unique()

array(['female', 'male'], dtype=object)
df.smoker.unique()

array(['yes', 'no'], dtype=object)
df.region.unique()

array(['southwest', 'southeast', 'northwest', 'northeast'], dtype=object)

Check if there are any missing values

df.isna().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
expenses    0
dtype: int64

Our data looks clean. There are no missing values or typos in categorical values. Let’s format the data for use with our model.

For categorical data, we should use one hot encoding instead of converting to enumerated type because we are using regression and the enumerated values might not have a linear relation with the health costs.

We will one hot encode the children column too because we are not sure if there is a linear relation between the number of children and health costs. We will also add categories for age and BMI for the same reason.

One hot encode the sex column and drop it after. This will add two columns, female and male.

df = df.join(pd.get_dummies(df.sex))
df = df.drop('sex', axis=1)
df
agebmichildrensmokerregionexpensesfemalemale
01927.90yessouthwest16884.9210
11833.81nosoutheast1725.5501
22833.03nosoutheast4449.4601
33322.70nonorthwest21984.4701
43228.90nonorthwest3866.8601
13335031.03nonorthwest10600.5501
13341831.90nonortheast2205.9810
13351836.90nosoutheast1629.8310
13362125.80nosouthwest2007.9510
13376129.10yesnorthwest29141.3610

Similarly, one hot encode children, smoker and region columns

df = df.join(pd.get_dummies(df.smoker, prefix='smoker'))
df = df.drop('smoker', axis=1)

df = df.join(pd.get_dummies(df.region, prefix='region'))
df = df.drop('region', axis=1)

df = df.join(pd.get_dummies(df.children, prefix='children'))
df = df.drop('children', axis=1)

We can’t create one hot encoding for age and bmi directly since they are numeric values. We will use bins and labels to encode the data

df['overweight'] = df.bmi > 30
df = df.join(pd.get_dummies(df.overweight, prefix='overweight'))
df = df.drop('overweight', axis=1)
df = df.join(pd.get_dummies(pd.cut(df['bmi'], bins=[0,18,25,30,100], labels=['w_low', 'w_normal', 'w_heigh', 'w_over'])))
df = df.join(pd.get_dummies(pd.cut(df['age'], bins=[10,25,35,45,55,60,100], labels=['yt25', 'yt35', 'yt45', 'yt55', 'yt60', 'o60'])))
df.head()
agebmiexpensesfemalemalesmoker_nosmoker_yesregion_northeastregion_northwestregion_southeastregion_southwestoverweight_Falseoverweight_Truew_loww_normalw_heighw_overchildren_0children_1children_2
01927.916884.9210010001100010100
11833.81725.5501100010010001010
22833.04449.4601100010010001000
33322.721984.4701100100100100100
43228.93866.8601100100100010100

Training

Create the training and testing DataFrames

train_dataset = df.sample(frac=0.8)
test_dataset = df.drop(train_dataset.index)
train_labels = train_dataset.pop('expenses')
test_labels = test_dataset.pop('expenses')
train_dataset.shape
(1070, 28)

test_dataset.shape
(268, 28)

We can start creating our model now

model = keras.Sequential()

Create a normalization layer and pass it the mean and variance using its adapt method. Then add it to our model

normalizer = layers.Normalization(axis=-1)
normalizer.adapt(np.array(train_dataset))

model.add(normalizer)

Add one Dense layer with one unit

model.add(layers.Dense(1))

Compile the model. We will use the Adam for optimizer, mean absolute error for loss function and mean absolute error and mean square error for metrics

model.compile(optimizer=keras.optimizers.Adam(learning_rate=5.0),
              loss='mae', metrics=['mae', 'mse'])

model.build()
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 normalization (Normalizati  (None, 28)                57        
 on)                                                             
                                                                 
 dense (Dense)               (None, 1)                 29        
                                                                 
=================================================================
Total params: 86 (348.00 Byte)
Trainable params: 29 (116.00 Byte)
Non-trainable params: 57 (232.00 Byte)
_________________________________________________________________

Train the model for 400 steps

history = model.fit(
    train_dataset, train_labels,
    epochs=400,
    validation_split = 0.2,
    )

Epoch 1/400
27/27 [==============================] - 2s 24ms/step - loss: 12980.1807 - mae: 12980.1807 - mse: 310381984.0000 - val_loss: 13748.0254 - val_mae: 13748.0254 - val_mse: 346313440.0000
Epoch 2/400
27/27 [==============================] - 0s 8ms/step - loss: 12835.0674 - mae: 12835.0674 - mse: 307191456.0000 - val_loss: 13636.8164 - val_mae: 13636.8164 - val_mse: 343705664.0000
Epoch 3/400
27/27 [==============================] - 0s 5ms/step - loss: 12694.8037 - mae: 12694.8037 - mse: 303253088.0000 - val_loss: 13522.0586 - val_mae: 13522.0586 - val_mse: 340197760.0000
Epoch 4/400
27/27 [==============================] - 0s 10ms/step - loss: 12555.8711 - mae: 12555.8711 - mse: 299622240.0000 - val_loss: 13407.2188 - val_mae: 13407.2188 - val_mse: 337578304.0000
...
27/27 [==============================] - 0s 3ms/step - loss: 3059.3418 - mae: 3059.3418 - mse: 46594180.0000 - val_loss: 3746.8494 - val_mae: 3746.8494 - val_mse: 58849572.0000
Epoch 398/400
27/27 [==============================] - 0s 4ms/step - loss: 3059.8738 - mae: 3059.8738 - mse: 46530496.0000 - val_loss: 3750.7312 - val_mae: 3750.7312 - val_mse: 58936692.0000
Epoch 399/400
27/27 [==============================] - 0s 4ms/step - loss: 3058.9065 - mae: 3058.9065 - mse: 46630444.0000 - val_loss: 3750.1858 - val_mae: 3750.1858 - val_mse: 59066800.0000
Epoch 400/400
27/27 [==============================] - 0s 5ms/step - loss: 3061.2319 - mae: 3061.2319 - mse: 46695008.0000 - val_loss: 3757.2839 - val_mae: 3757.2839 - val_mse: 58933712.0000

Plot the loss and validation loss

fig, axes = plt.subplots(figsize=(5,3))
axes.plot(np.array(history.history['loss'])/10**8)
axes.plot(np.array(history.history['val_loss'])/10**8)
axes.set_xlabel('Epochs')
axes.set_ylabel('Loss in 10^8')
axes.legend(['Loss', 'Validation Loss'])

Loss

Testing

Time to test our model and see if it passed the freeCodeCamp requirements

NOTE: This code block is provided by freeCodeCamp

loss, mae, mse = model.evaluate(test_dataset, test_labels, verbose=2)

print("Testing set Mean Abs Error: {:5.2f} expenses".format(mae))

if mae < 3500:
  print("You passed the challenge. Great job!")
else:
  print("The Mean Abs Error must be less than 3500. Keep trying.")

# Plot predictions.
test_predictions = model.predict(test_dataset).flatten()

a = plt.axes(aspect='equal')
plt.scatter(test_labels, test_predictions)
plt.xlabel('True values (expenses)')
plt.ylabel('Predictions (expenses)')
lims = [0, 50000]
plt.xlim(lims)
plt.ylim(lims)
_ = plt.plot(lims,lims)

9/9 - 0s - loss: 3199.8831 - mae: 3199.8831 - mse: 49219736.0000 - 53ms/epoch - 6ms/step
Testing set Mean Abs Error: 3199.88 expenses
You passed the challenge. Great job!
9/9 [==============================] - 0s 2ms/step

Predictions

We have less than $3200 MSE and passed the challenge!

Thank you for reading. You can also check out my other projects for this series below.