Predicting Housing Prices

This project involved predicting housing prices in the suburbs of Boston, MA. To predict housing prices, a decision tree regressor was utilized. Additionally, Grid Search and Cross Validation were used to optimize the model. The dataset from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Housing), was used as the training and testing data for the model. This project was completed as part of Udacity's Machine Learning Nanodegree and Python code was used to create the models.

In [2]:
#Import libraries necessary for this project
import numpy as np
import pandas as pd
from sklearn.cross_validation import ShuffleSplit

# Import supplementary visualizations code visuals.py
import visuals as vs

#Display for notebooks
%matplotlib inline

#Load the housing dataset
data = pd.read_csv('housing.csv')
prices = data['MEDV']
features = data.drop('MEDV', axis = 1)

print ("Boston housing dataset has {} data points with {} variables each.".format(*data.shape))
Boston housing dataset has 489 data points with 4 variables each.
C:\Users\Stephen Chen\Anaconda3\lib\site-packages\sklearn\learning_curve.py:23: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the functions are moved. This module will be removed in 0.20
  DeprecationWarning)

Exploring the Data

Before building my model, using the decision tree regressor, I explored the dataset values by calculating summary statistics for the target price, and understanding the features included in the data. They are the average number of rooms in the neighborhood (RM), percentage of homeowners in the neighborhood considered "below income" (LSTAT), and ratio of students to teachers in primary and secondary schools (PTRATIO).

In [4]:
from IPython.display import display

#Minimum price of the data
minimum_price = np.amin(prices)

#Maximum price of the data
maximum_price = np.amax(prices)

#Mean price of the data
mean_price = np.mean(prices)

#Median price of the data
median_price = np.median(prices)

#Standard deviation of prices of the data
std_price = np.std(prices)

#Show the calculated statistics
print ("Statistics for Boston housing dataset:\n")
print ("Minimum price: ${:,.2f}".format(minimum_price))
print ("Maximum price: ${:,.2f}".format(maximum_price))
print ("Mean price: ${:,.2f}".format(mean_price))
print ("Median price ${:,.2f}".format(median_price))
print ("Standard deviation of prices: ${:,.2f}".format(std_price))
display(data.head(n=10))
Statistics for Boston housing dataset:

Minimum price: $105,000.00
Maximum price: $1,024,800.00
Mean price: $454,342.94
Median price $438,900.00
Standard deviation of prices: $165,171.13
RM LSTAT PTRATIO MEDV
0 6.575 4.98 15.3 504000.0
1 6.421 9.14 17.8 453600.0
2 7.185 4.03 17.8 728700.0
3 6.998 2.94 18.7 701400.0
4 7.147 5.33 18.7 760200.0
5 6.430 5.21 18.7 602700.0
6 6.012 12.43 15.2 480900.0
7 6.172 19.15 15.2 569100.0
8 5.631 29.93 15.2 346500.0
9 6.004 17.10 15.2 396900.0

Developing The Model

Defining a Performance Metric

In creating the model, I first defined the performance metric. Since I am using a decision tree regressor, I will be using the coefficient of determination R2, to quantify my model's performance. The value explains what percent of the target variable of price can be explained by the features of RM, LSTAT, and PTRATIO.

In [9]:
#Import 'r2_score'
from sklearn.metrics import r2_score

def performance_metric(y_true, y_predict):   
    #Calculate the performance score between 'y_true' and 'y_predict'
    score = r2_score(y_true, y_predict)
    
    #Return the score
    return score

Shuffle and Split Data

After I defined my performance metric, I split my data set into training and testing subsets. I used the an 80% and 20% split. Testing data is important for assessing the accuracy of my model on unseen data.

In [5]:
#Import 'train_test_split'
from sklearn.model_selection import train_test_split

#Shuffle and split the data into training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size = 0.20, random_state = 0)

print ("Training and testing split was successful.")
Training and testing split was successful.

Analyzing Model Performance

Next, I assessed how model complexity will affects my model's performance. I did this by adjusting my decision tree regressor's 'max_depth' parameter on the training set, and graphing out the resulting Learning Curves and Complexity Curves.

Learning Curves

The curves below show the model's performance on the training and testing data and the corresponding scores as the number of training points change.

In [7]:
# Produces learning Curves for varying training set sizes and maximum depths
vs.ModelLearning(features, prices)

Based on the results, I chose a max_depth of 3. For this graph, the training curve initially starts out with a score of 1 with a few training points and slowly decreases leveling off at around 300 training points with a score of 0.80. This implies that more training data after this amount will not significantly improve the results. Similarly, the testing curve starts out with a score of 0 and slowly levels off at around 300 points with a score of 0.80. This is the point at which the testing and training curves converge.

Complexity Curves

In addition to the Learning Curves, I also graphed out the Complexity Curves. At a max depth of 1, the model has high bias because the training scores are low. When the model is at a depth of 10, there is high variance because the testing score is low. A depth of around 3 or 4 appears to be the optimal point.

In [8]:
vs.ModelComplexity(X_train, y_train)

Evaluating Model Performance

After visualizing the Learning and Complexity Curves, I implemented the decision tree regressor model. I also used Grid Search and Cross Validation to select the max depth which will be 3 or 4 based on the visualizations.

In [7]:
#Import 'make_scorer', 'DecisionTreeRegressor', and 'GridSearchCV'
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import make_scorer
from sklearn.grid_search import GridSearchCV

#function fitting and splitting the model using the dataset with 10 cross validation sets
def fit_model(X, y):
    #Creates cross-validation sets from the training data
    cv_sets = ShuffleSplit(X.shape[0], n_iter = 10, test_size = 0.20, random_state = 0)

    #Creates a decision tree regressor object
    regressor = DecisionTreeRegressor()

    #Creates a dictionary for the parameter 'max_depth' with a range from 1 to 10
    params = {'max_depth': list(range(1,11))}

    #Transforms 'performance_metric' into a scoring function using 'make_scorer' 
    scoring_fnc = make_scorer(performance_metric)

    #Creates the grid search cv object
    grid = GridSearchCV(regressor, params, scoring = scoring_fnc, cv = cv_sets)

    #Fits the grid search object to the data to compute the optimal model
    grid = grid.fit(X, y)

    #Returns the optimal model after fitting the data
    return grid.best_estimator_
C:\Users\Stephen Chen\Anaconda3\lib\site-packages\sklearn\grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)

Optimal Model

In [10]:
#Fitting the training data to the model using grid search
reg = fit_model(X_train, y_train)

#Produces the value for 'max_depth' that provides the optimal model
print ("The optimal 'max_depth' for the model is {}.".format(reg.get_params()['max_depth']))
The optimal 'max_depth' for the model is 4.

Predicting Selling Prices

After optimizing my model, I used my model to predict example home prices.

Feature Client 1 Client 2 Client 3
Total number of rooms in home 5 rooms 4 rooms 8 rooms
Neighborhood poverty level (as %) 17% 32% 3%
Student-teacher ratio of nearby schools 15-to-1 22-to-1 12-to-1
In [11]:
#Produces a matrix for client data for each of the three features.
client_data = [[5, 17, 15], # Client 1
               [4, 32, 22], # Client 2
               [8, 3, 12]]  # Client 3

#Shows predictions
for i, price in enumerate(reg.predict(client_data)):
    print ("Predicted selling price for Client {}'s home: ${:,.2f}".format(i+1, price))
Predicted selling price for Client 1's home: $391,183.33
Predicted selling price for Client 2's home: $189,123.53
Predicted selling price for Client 3's home: $942,666.67