Enron Bonuses Feature Scaling and Prediction¶

Description¶

Deploy K-Means Clustering on the Enron financial features data, with 2 clusters specified as a parameter. There is also feature scaling section at the bottom.

Origin¶

This is Python 3 version of a mini-project from Udacity's Intro to Machine Learning free course.

Steps to Prepare¶

none

Additional Information¶

none

import sys
from time import time
import pickle
import numpy
import matplotlib.pyplot as plt
from feature_format import featureFormat, targetFeatureSplit

%matplotlib inline

### Load in the dict of dicts containing all the data on each person in the dataset
data_dict = pickle.load(open("final_project_dataset.pkl", "rb"))
### There's an outlier - remove it! 
data_dict.pop("TOTAL", 0)

{'salary': 26704229,
 'to_messages': 'NaN',
 'deferral_payments': 32083396,
 'total_payments': 309886585,
 'loan_advances': 83925000,
 'bonus': 97343619,
 'email_address': 'NaN',
 'restricted_stock_deferred': -7576788,
 'deferred_income': -27992891,
 'total_stock_value': 434509511,
 'expenses': 5235198,
 'from_poi_to_this_person': 'NaN',
 'exercised_stock_options': 311764000,
 'from_messages': 'NaN',
 'other': 42667589,
 'from_this_person_to_poi': 'NaN',
 'poi': False,
 'long_term_incentive': 48521928,
 'shared_receipt_with_poi': 'NaN',
 'restricted_stock': 130322299,
 'director_fees': 1398517}

def Draw(pred, features, poi, mark_poi=False, name="13_K-MeansClustering.png", f1_name="feature 1", f2_name="feature 2"):
    """ Some plotting code designed to help you visualize your clusters """

    ### Plot each cluster with a different color--add more colors for
    ### drawing more than five clusters
    colors = ["b", "c", "k", "m", "g"]
    for ii, pp in enumerate(pred):
        plt.scatter(features[ii][0], features[ii][1], color = colors[pred[ii]])

    ### if you like, place red stars over points that are POIs (just for funsies)
    if mark_poi:
        for ii, pp in enumerate(pred):
            if poi[ii]:
                plt.scatter(features[ii][0], features[ii][1], color="r", marker="*")
    plt.xlabel(f1_name)
    plt.ylabel(f2_name)
    plt.savefig(name)
    plt.show()

### The input features we want to use 
### Can be any key in the person-level dictionary (salary, director_fees, etc.) 
feature_1 = "salary"
feature_2 = "exercised_stock_options"
poi  = "poi"
features_list = [poi, feature_1, feature_2]
data = featureFormat(data_dict, features_list)
poi, finance_features = targetFeatureSplit(data)

### In the "clustering with 3 features" part of the mini-project,
### you'll want to change this line to 
### for f1, f2, _ in finance_features:
### (as it's currently written, the line below assumes 2 features)
for f1, f2 in finance_features:
    plt.scatter(f1, f2)
plt.savefig('13_K-MeansClustering1a.png')
plt.show()

### Cluster here; create predictions of the cluster labels
### for the data and store them to a list called pred
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=0)
pred = kmeans.fit_predict(finance_features)

### Rename the "name" parameter when you change the number of features
### so that the figure gets saved to a different file
try:
    #Draw(pred, finance_features, poi, mark_poi=False, name="clusters.pdf", f1_name=feature_1, f2_name=feature_2)
    Draw(pred, finance_features, poi, mark_poi=False, name="13_K-MeansClustering1b.png", f1_name=feature_1, f2_name=feature_2)
except NameError:
    print("No predictions object named pred found, no clusters to plot")

# Now rerun clustering using 3 features
### The input features we want to use 
### Can be any key in the person-level dictionary (salary, director_fees, etc.) 
feature_1 = "salary"
feature_2 = "exercised_stock_options"
feature_3 = "total_payments"
poi  = "poi"
features_list = [poi, feature_1, feature_2, feature_3]
data = featureFormat(data_dict, features_list)
poi, finance_features = targetFeatureSplit(data)

Add a third feature to features_list, “total_payments". Now rerun clustering, using 3 input features instead of 2 (obviously we can still only visualize the original 2 dimensions). Compare the plot with the clusterings to the one you obtained with 2 input features. Do any points switch clusters? How many? This new clustering, using 3 features, couldn’t have been guessed by eye--it was the k-means algorithm that identified it.

(You'll need to change the code that makes the scatterplot to accommodate 3 features instead of 2, see the comments below for instructions on how to do this.)

### In the "clustering with 3 features" part of the mini-project,
### you'll want to change this line to 
### for f1, f2, _ in finance_features:
### (as it's currently written, the line below assumes 2 features)
for f1, f2, _ in finance_features:
    plt.scatter(f1, f2)
plt.savefig('13_K-MeansClustering2a.png')
plt.show()

### Cluster here; create predictions of the cluster labels
### for the data and store them to a list called pred
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=0)
pred = kmeans.fit_predict(finance_features)

### Rename the "name" parameter when you change the number of features
### so that the figure gets saved to a different file
try:
    Draw(pred, finance_features, poi, mark_poi=False, name="13_K-MeansClustering2b.png", f1_name=feature_1, f2_name=feature_2)
except NameError:
    print("No predictions object named pred found, no clusters to plot")

In the next lesson, we’ll talk about feature scaling. It’s a type of feature preprocessing that you should perform before some classification and regression tasks. Here’s a sneak preview that should call your attention to the general outline of what feature scaling does.

What are the maximum and minimum values taken by the “exercised_stock_options” feature used in this example?

(NB: if you look at finance_features, there are some "NaN" values that have been cleaned away and replaced with zeroes--so while those might look like the minima, it's a bit deceptive because they're more like points for which we don't have information, and just have to put in a number. So for this question, go back to data_dict and look for the maximum and minimum numbers that show up there, ignoring all the "NaN" entries.)

stocks = []
for key, value in data_dict.items():
    if value['exercised_stock_options'] != 'NaN':
        stocks.append(value['exercised_stock_options'])

print(min(stocks), max(stocks))

3285 34348384

What are the maximum and minimum values taken by “salary”?

(NB: same caveat as in the last quiz. If you look at finance_features, there are some "NaN" values that have been cleaned away and replaced with zeroes--so while those might look like the minima, it's a bit deceptive because they're more like points for which we don't have information, and just have to put in a number. So for this question, go back to data_dict and look for the maximum and minimum numbers that show up there, ignoring all the "NaN" entries.)

stocks = []
for key, value in data_dict.items():
    if value['salary'] != 'NaN':
        stocks.append(value['salary'])

print(min(stocks), max(stocks))

477 1111258

# Feature Scaling 

# What would be the rescaled value of a "salary" feature 
# that had an original value of $200,000, and 
# an "exercised_stock_options" feature of $1 million?

# Apply feature scaling to your k-means clustering code from the last lesson, 
# on the “salary” and “exercised_stock_options” features (use only these two features). 
# What would be the rescaled value of a "salary" feature that had an original value of $200,000, 
# and an "exercised_stock_options" feature of $1 million? (Be sure to represent these numbers as floats, not integers!)

from sklearn.preprocessing import MinMaxScaler

salary = []
ex_stock = []

for users in data_dict:
    val = data_dict[users]["salary"]
    if val == 'NaN':
        continue
    salary.append(float(val))
    val = data_dict[users]["exercised_stock_options"]
    if val == 'NaN':
        continue
    ex_stock.append(float(val))
    
salary = [min(salary), 200000.0, max(salary)]
ex_stock = [min(ex_stock), 1000000.0, max(ex_stock)]

print('Salary: {}'.format(salary))
print('Exercised stock options: {}'.format(ex_stock))

salary = numpy.array([[e] for e in salary])
ex_stock = numpy.array([[e] for e in ex_stock])

scaler_salary = MinMaxScaler()
scaler_stock = MinMaxScaler()

rescaled_salary = scaler_salary.fit_transform(salary)
rescaled_stock = scaler_salary.fit_transform(ex_stock)

print('Rescaled salary: {}'.format(rescaled_salary))
print('Rescaled exercised stock options: {}'.format(rescaled_stock))

Salary: [477.0, 200000.0, 1111258.0]
Exercised stock options: [17378.0, 1000000.0, 34348384.0]
Rescaled salary: [[0.        ]
 [0.17962407]
 [1.        ]]
Rescaled exercised stock options: [[0.      ]
 [0.028622]
 [1.      ]]