Identifying Donors¶

This project involves employing several supervised learning algorithms and then optimizing the selected model to predict and classify whether an individual makes more than \$50K a year. To build this model, I used the dataset from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Census+Income). I completed this project as part of Udacity's Machine Learning Nanodegree.

Exploring the Data¶

Before building my model, I first explored that dataset provided from the repository to understand the features and underlying data available.

#Import standard python libraries
import numpy as np
import pandas as pd
from time import time
from IPython.display import display

#Import supplementary visualization code visuals.py
import visuals as vs

#Pretty display for notebooks
%matplotlib inline

#Load the provided Census dataset
data = pd.read_csv("census.csv")

#Display the first ten records to understand what is included in the dataset
display(data.head(n=10))

Next, I investigated how many records are included in the dataset and how the dataset is labeled between individuals currently making greater or less than 50 thousand a year.

#Number of records
n_records = data.shape[0]

#Number of records where individual's income is more than $50,000
n_greater_50k = data[(data.income == '>50K')].shape[0]

#Number of records where individual's income is at most $50,000
n_at_most_50k = data[(data.income == '<=50K')].shape[0]

#Percentage of individuals whose income is more than $50,000
greater_percent = (n_greater_50k/n_records)*100.0

#Print the results
print ("Total number of records in the dataset: {}".format(n_records))
print ("Individuals making more than $50,000: {}".format(n_greater_50k))
print ("Individuals making at most $50,000: {}".format(n_at_most_50k))
print ("Percentage of individuals making more than $50,000: {:.2f}%".format(greater_percent))

Total number of records in the dataset: 45222
Individuals making more than $50,000: 11208
Individuals making at most $50,000: 34014
Percentage of individuals making more than $50,000: 24.78%

Additionally, I looked at the data to understand the types of features included. There were continuous and categorical variables. Below is a summary of the features included in the dataset.

Features

age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: Black, White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

Preprocessing the Data¶

Next, I began preprocessing the data so that the data inputted into my final model is accurate and reliable.

Transforming Skewed Continuous Features¶

Sometimes the data in some features may be all near a single value with some values as larger outliers. These distributions can impact a model's performance if it is not normalized. In this dataset, there were two features. These were capital-gain and capital-loss. Below is a visual distribution of the data for these two features.

#First separate the into features and the target label
income_raw = data['income']
features_raw = data.drop('income', axis = 1)

# Visualize skewed continuous features of original data
vs.distribution(data)

As shown above, the capital gain and capital loss features are highly skewed. To transform the features, I applied a logarithmic transformation. This will help reduce the impact caused by the outliers.

#Log-transformation of the skewed features
skewed = ['capital-gain', 'capital-loss']
features_log_transformed = pd.DataFrame(data = features_raw)
features_log_transformed[skewed] = features_raw[skewed].apply(lambda x: np.log(x + 1))

#Visualization of the new log distributions
vs.distribution(features_log_transformed, transformed = True)

Normalizing Numerical Features¶

After transforming skewed features, I also normalized the continuous variables identified during the data exploration. I applied normalization to ensure that my model does not treat the larger valued features more than the smaller valued features. Normalization ensures that all features will be treated equally.

#Import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

#Set up scaler for normalization
scaler = MinMaxScaler()
numerical = ['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']

#Create new dataframe and apply scaling on the numerical data
features_log_minmax_transform = pd.DataFrame(data = features_log_transformed)
features_log_minmax_transform[numerical] = scaler.fit_transform(features_log_transformed[numerical])

#Check to ensure normalization was applied
display(features_log_minmax_transform.head(n = 10))

One-Hot Encoding Categorical Data¶

In addition to scaling the numerical data, I also one-hot encoded the several categorical features into 0's and 1's.

#Panda function to one-hot encode categorical variables including the income target variable
features_final = pd.get_dummies(features_log_minmax_transform)
income = income_raw.apply(lambda x: 0 if x == '<=50K' else 1)

#Print the number of features after one-hot encoding
encoded = list(features_final.columns)
print ("There are now {} total features after one-hot encoding.".format(len(encoded)))

There are now 103 total features after one-hot encoding.

Training and Testing Data¶

Now that the dataset has been preprocessed, I split the data set into training and testing sets. The threshold I used was 80% for training and 20% for testing.

#Import train_test_split class
from sklearn.cross_validation import train_test_split

#Split the 'features' and 'income' data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features_final, 
                                                    income, 
                                                    test_size = 0.2, 
                                                    random_state = 0)

#Show the results of the split
print ("The training set has {} samples.".format(X_train.shape[0]))
print ("The testing set has {} samples.".format(X_test.shape[0]))

The training set has 36177 samples.
The testing set has 9045 samples.

Evaluating Model Performance¶

Metrics and the Naive Predictor¶

For this project, the organization is interested in knowing which individuals make more than \$50,000 because they are most likely to donate. In selecting our metrics, I used the accuracy score and the F-beta score with a 0.5 beta to place more emphasis on precision since the organization would want to more precisely predict which individuals make more than $50,000.

Additionally, to obtain a baseline comparison to my model performance, I first calculated the naive predictor. The naive predictor essentiatially predicts that every individual makes over \$50,000.

#True positive is the sum of the 1's in the income column.
TP = np.sum(income)
#False positive is the 0's in the income column, the amount we predicted as positive when it is really false.
FP = income.count() - TP

#True negative and false negative are 0 as there are no predicted negatives in the naive predictor
TN = 0 # No predicted negatives in the naive case
FN = 0 # No predicted negatives in the naive case

#Calculate the accuracy, precision and recall
accuracy = (TP+TN)/(TP+FP+TN+FN)
recall = (TP)/(TP+FN)
precision = (TP)/(TP+FP)

#Calculate F-score using the formula above for beta = 0.5 and correct values for precision and recall.
beta = 0.5
fscore = (1 + beta**2) * (precision * recall) / ((beta**2 * precision) + recall)

# Print the results 
print ("The Naive Predictor results are the following: [Accuracy score: {:.4f}, F-score: {:.4f}]".format(accuracy, fscore))

The Naive Predictor results are the following: [Accuracy score: 0.2478, F-score: 0.2917]

Model Application¶

Now that I have a baseline, I selected several algorithms to test initial results. The three algorithms I selected were Decision Trees, Random Forest, and Support Vector Machines. I used these algorithms because our data contains both continuous and categorical data. Additionally, our dataset involves 103 features with thousands of samples.

Below, I have defined a function to obtain the results from an algorithm for fbeta and accuracy.

#Import two metrics from sklearn - fbeta_score and accuracy_score
from sklearn.metrics import fbeta_score, accuracy_score

def train_predict(learner, sample_size, X_train, y_train, X_test, y_test): 

    results = {}
    
    #train learner
    start = time() # Get start time
    learner = learner.fit(X_train[:sample_size], y_train[:sample_size])
    end = time() # Get end time
    
    #Calculate the training time
    results['train_time'] = end - start
        
    #Gets the predictions on the test set(X_test),
    #then gets predictions on the first 300 training samples(X_train) using .predict()
    start = time() # Get start time
    predictions_test = learner.predict(X_test)
    predictions_train = learner.predict(X_train[:300])
    end = time() # Get end time
    
    #Calculates the total prediction time
    results['pred_time'] = end - start
            
    #Computes accuracy on the first 300 training samples which is y_train[:300]
    results['acc_train'] = accuracy_score(y_train[:300], predictions_train)
        
    #Computes accuracy on test set using accuracy_score()
    results['acc_test'] =  accuracy_score(y_test, predictions_test)
    
    #Computes F-score on the the first 300 training samples using fbeta_score()
    results['f_train'] = fbeta_score(y_train[:300], predictions_train, beta =0.5)
        
    #Computes F-score on the test set which is y_test
    results['f_test'] = fbeta_score(y_test, predictions_test, beta =0.5)

    # Returns the results
    return results

Initial Model Evaluation¶

With the results function created, I then implemented the three different algorithms to understand their initial model performance.

#Import the three supervised learning models from sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB 

#Initialize the three models
clf_A = DecisionTreeClassifier(random_state=0)
clf_B = RandomForestClassifier(random_state=0)
clf_C = GaussianNB()

#Calculate the number of samples for to be used for training at 1%, 10%, and 100% of the training data
samples_100 = len(y_train)
samples_10 = int((samples_100*0.10))
samples_1 = int((samples_100*0.01))

#Collect results on the learners
results = {}
for clf in [clf_A, clf_B, clf_C]:
    clf_name = clf.__class__.__name__
    results[clf_name] = {}
    for i, samples in enumerate([samples_1, samples_10, samples_100]):
        results[clf_name][i] = train_predict(clf, samples, X_train, y_train, X_test, y_test)

#Show the initial results for visualization
vs.evaluate(results, accuracy, fscore)

Initial Results¶

In looking at the initial results, the RandomForest Classifier has the best performance on both the testing and training data regardless of the training size used. This was followed by Decision Trees and Gaussian Naive Bayes. The RandomForest Classifier, however, takes the most amount of time for predicting compared to the other two algorithms.

Model Tuning¶

To improve our results, I then selected the RandomForest Classifier and tuned the parameters using GridSearchCV.

#Import 'GridSearchCV', 'make_scorer', and any other necessary libraries
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer

#Initializes the classifier
clf = RandomForestClassifier()

#Parameters for GridSearchCV
parameters = {'n_estimators':[10, 50, 100, 200], 'min_samples_split': [2, 5, 10, 25] }

#Creates fbeta_score scoring object using make_scorer()
scorer = make_scorer(fbeta_score, beta = 0.5)

#Performs grid search on the classifier using 'scorer' as the scoring method using GridSearchCV()
grid_obj = grid_obj = GridSearchCV(clf, parameters, scoring=scorer)

#Fits the grid search object to the training data and find the optimal parameters using fit()
grid_fit = grid_obj.fit(X_train, y_train)

# Gets the estimator
best_clf = grid_fit.best_estimator_

# Makes predictions using the unoptimized and model
predictions = (clf.fit(X_train, y_train)).predict(X_test)
best_predictions = best_clf.predict(X_test)

# Reporst the before-and-afterscores
print ("Unoptimized model\n------")
print ("Accuracy score on testing data: {:.4f}".format(accuracy_score(y_test, predictions)))
print ("F-score on testing data: {:.4f}".format(fbeta_score(y_test, predictions, beta = 0.5)))
print ("\nOptimized Model\n------")
print ("Final accuracy score on the testing data: {:.4f}".format(accuracy_score(y_test, best_predictions)))
print ("Final F-score on the testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta = 0.5)))

Unoptimized model
------
Accuracy score on testing data: 0.8401
F-score on testing data: 0.6776

Optimized Model
------
Final accuracy score on the testing data: 0.8607
Final F-score on the testing data: 0.7292

Final Model Evaluation¶

Below are the results after optimizing the model. As we can see the Accuracy score and the F-score increased to 86% and 73% after optimization.

Results:¶

Metric	Benchmark Predictor	Unoptimized Model	Optimized Model
Accuracy Score	0.2478	0.8401	0.8607
F-score	0.2917	0.6776	0.7292

#print out paramaters used for optimized model
print (best_clf)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=25,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

	age	workclass	education_level	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	income
0	39	State-gov	Bachelors	13.0	Never-married	Adm-clerical	Not-in-family	White	Male	2174.0	40.0	United-States	<=50K
1	50	Self-emp-not-inc	Bachelors	13.0	Married-civ-spouse	Exec-managerial	Husband	White	Male	0.0	13.0	United-States	<=50K
2	38	Private	HS-grad	9.0	Divorced	Handlers-cleaners	Not-in-family	White	Male	0.0	40.0	United-States	<=50K
3	53	Private	11th	7.0	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0.0	40.0	United-States	<=50K
4	28	Private	Bachelors	13.0	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0.0	40.0	Cuba	<=50K
5	37	Private	Masters	14.0	Married-civ-spouse	Exec-managerial	Wife	White	Female	0.0	40.0	United-States	<=50K
6	49	Private	9th	5.0	Married-spouse-absent	Other-service	Not-in-family	Black	Female	0.0	16.0	Jamaica	<=50K
7	52	Self-emp-not-inc	HS-grad	9.0	Married-civ-spouse	Exec-managerial	Husband	White	Male	0.0	45.0	United-States	>50K
8	31	Private	Masters	14.0	Never-married	Prof-specialty	Not-in-family	White	Female	14084.0	50.0	United-States	>50K
9	42	Private	Bachelors	13.0	Married-civ-spouse	Exec-managerial	Husband	White	Male	5178.0	40.0	United-States	>50K

	age	workclass	education_level	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country
0	0.301370	State-gov	Bachelors	0.800000	Never-married	Adm-clerical	Not-in-family	White	Male	0.667492	0.397959	United-States
1	0.452055	Self-emp-not-inc	Bachelors	0.800000	Married-civ-spouse	Exec-managerial	Husband	White	Male	0.000000	0.122449	United-States
2	0.287671	Private	HS-grad	0.533333	Divorced	Handlers-cleaners	Not-in-family	White	Male	0.000000	0.397959	United-States
3	0.493151	Private	11th	0.400000	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0.000000	0.397959	United-States
4	0.150685	Private	Bachelors	0.800000	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0.000000	0.397959	Cuba
5	0.273973	Private	Masters	0.866667	Married-civ-spouse	Exec-managerial	Wife	White	Female	0.000000	0.397959	United-States
6	0.438356	Private	9th	0.266667	Married-spouse-absent	Other-service	Not-in-family	Black	Female	0.000000	0.153061	Jamaica
7	0.479452	Self-emp-not-inc	HS-grad	0.533333	Married-civ-spouse	Exec-managerial	Husband	White	Male	0.000000	0.448980	United-States
8	0.191781	Private	Masters	0.866667	Never-married	Prof-specialty	Not-in-family	White	Female	0.829751	0.500000	United-States
9	0.342466	Private	Bachelors	0.800000	Married-civ-spouse	Exec-managerial	Husband	White	Male	0.742849	0.397959	United-States