This project involved predicting housing prices in the suburbs of Boston, MA. To predict housing prices, a decision tree regressor was utilized. Additionally, Grid Search and Cross Validation were used to optimize the model. The dataset from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Housing), was used as the training and testing data for the model. This project was completed as part of Udacity's Machine Learning Nanodegree and Python code was used to create the models.
#Import libraries necessary for this project
import numpy as np
import pandas as pd
from sklearn.cross_validation import ShuffleSplit
# Import supplementary visualizations code visuals.py
import visuals as vs
#Display for notebooks
%matplotlib inline
#Load the housing dataset
data = pd.read_csv('housing.csv')
prices = data['MEDV']
features = data.drop('MEDV', axis = 1)
print ("Boston housing dataset has {} data points with {} variables each.".format(*data.shape))
Before building my model, using the decision tree regressor, I explored the dataset values by calculating summary statistics for the target price, and understanding the features included in the data. They are the average number of rooms in the neighborhood (RM), percentage of homeowners in the neighborhood considered "below income" (LSTAT), and ratio of students to teachers in primary and secondary schools (PTRATIO).
from IPython.display import display
#Minimum price of the data
minimum_price = np.amin(prices)
#Maximum price of the data
maximum_price = np.amax(prices)
#Mean price of the data
mean_price = np.mean(prices)
#Median price of the data
median_price = np.median(prices)
#Standard deviation of prices of the data
std_price = np.std(prices)
#Show the calculated statistics
print ("Statistics for Boston housing dataset:\n")
print ("Minimum price: ${:,.2f}".format(minimum_price))
print ("Maximum price: ${:,.2f}".format(maximum_price))
print ("Mean price: ${:,.2f}".format(mean_price))
print ("Median price ${:,.2f}".format(median_price))
print ("Standard deviation of prices: ${:,.2f}".format(std_price))
display(data.head(n=10))
In creating the model, I first defined the performance metric. Since I am using a decision tree regressor, I will be using the coefficient of determination R2, to quantify my model's performance. The value explains what percent of the target variable of price can be explained by the features of RM, LSTAT, and PTRATIO.
#Import 'r2_score'
from sklearn.metrics import r2_score
def performance_metric(y_true, y_predict):
#Calculate the performance score between 'y_true' and 'y_predict'
score = r2_score(y_true, y_predict)
#Return the score
return score
After I defined my performance metric, I split my data set into training and testing subsets. I used the an 80% and 20% split. Testing data is important for assessing the accuracy of my model on unseen data.
#Import 'train_test_split'
from sklearn.model_selection import train_test_split
#Shuffle and split the data into training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size = 0.20, random_state = 0)
print ("Training and testing split was successful.")
Next, I assessed how model complexity will affects my model's performance. I did this by adjusting my decision tree regressor's 'max_depth' parameter on the training set, and graphing out the resulting Learning Curves and Complexity Curves.
The curves below show the model's performance on the training and testing data and the corresponding scores as the number of training points change.
# Produces learning Curves for varying training set sizes and maximum depths
vs.ModelLearning(features, prices)
Based on the results, I chose a max_depth of 3. For this graph, the training curve initially starts out with a score of 1 with a few training points and slowly decreases leveling off at around 300 training points with a score of 0.80. This implies that more training data after this amount will not significantly improve the results. Similarly, the testing curve starts out with a score of 0 and slowly levels off at around 300 points with a score of 0.80. This is the point at which the testing and training curves converge.
In addition to the Learning Curves, I also graphed out the Complexity Curves. At a max depth of 1, the model has high bias because the training scores are low. When the model is at a depth of 10, there is high variance because the testing score is low. A depth of around 3 or 4 appears to be the optimal point.
vs.ModelComplexity(X_train, y_train)
After visualizing the Learning and Complexity Curves, I implemented the decision tree regressor model. I also used Grid Search and Cross Validation to select the max depth which will be 3 or 4 based on the visualizations.
#Import 'make_scorer', 'DecisionTreeRegressor', and 'GridSearchCV'
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import make_scorer
from sklearn.grid_search import GridSearchCV
#function fitting and splitting the model using the dataset with 10 cross validation sets
def fit_model(X, y):
#Creates cross-validation sets from the training data
cv_sets = ShuffleSplit(X.shape[0], n_iter = 10, test_size = 0.20, random_state = 0)
#Creates a decision tree regressor object
regressor = DecisionTreeRegressor()
#Creates a dictionary for the parameter 'max_depth' with a range from 1 to 10
params = {'max_depth': list(range(1,11))}
#Transforms 'performance_metric' into a scoring function using 'make_scorer'
scoring_fnc = make_scorer(performance_metric)
#Creates the grid search cv object
grid = GridSearchCV(regressor, params, scoring = scoring_fnc, cv = cv_sets)
#Fits the grid search object to the data to compute the optimal model
grid = grid.fit(X, y)
#Returns the optimal model after fitting the data
return grid.best_estimator_
#Fitting the training data to the model using grid search
reg = fit_model(X_train, y_train)
#Produces the value for 'max_depth' that provides the optimal model
print ("The optimal 'max_depth' for the model is {}.".format(reg.get_params()['max_depth']))
After optimizing my model, I used my model to predict example home prices.
Feature | Client 1 | Client 2 | Client 3 |
---|---|---|---|
Total number of rooms in home | 5 rooms | 4 rooms | 8 rooms |
Neighborhood poverty level (as %) | 17% | 32% | 3% |
Student-teacher ratio of nearby schools | 15-to-1 | 22-to-1 | 12-to-1 |
#Produces a matrix for client data for each of the three features.
client_data = [[5, 17, 15], # Client 1
[4, 32, 22], # Client 2
[8, 3, 12]] # Client 3
#Shows predictions
for i, price in enumerate(reg.predict(client_data)):
print ("Predicted selling price for Client {}'s home: ${:,.2f}".format(i+1, price))