Enron Bonuses Feature Scaling and Prediction

From: https://github.com/ksatola

Description

Deploy K-Means Clustering on the Enron financial features data, with 2 clusters specified as a parameter. There is also feature scaling section at the bottom.

Origin

This is Python 3 version of a mini-project from Udacity's Intro to Machine Learning free course.

Steps to Prepare

none

Additional Information

none

In [1]:
import sys
from time import time
import pickle
import numpy
import matplotlib.pyplot as plt
from feature_format import featureFormat, targetFeatureSplit

%matplotlib inline
In [2]:
### Load in the dict of dicts containing all the data on each person in the dataset
data_dict = pickle.load(open("final_project_dataset.pkl", "rb"))
### There's an outlier - remove it! 
data_dict.pop("TOTAL", 0)
Out[2]:
{'salary': 26704229,
 'to_messages': 'NaN',
 'deferral_payments': 32083396,
 'total_payments': 309886585,
 'loan_advances': 83925000,
 'bonus': 97343619,
 'email_address': 'NaN',
 'restricted_stock_deferred': -7576788,
 'deferred_income': -27992891,
 'total_stock_value': 434509511,
 'expenses': 5235198,
 'from_poi_to_this_person': 'NaN',
 'exercised_stock_options': 311764000,
 'from_messages': 'NaN',
 'other': 42667589,
 'from_this_person_to_poi': 'NaN',
 'poi': False,
 'long_term_incentive': 48521928,
 'shared_receipt_with_poi': 'NaN',
 'restricted_stock': 130322299,
 'director_fees': 1398517}
In [3]:
def Draw(pred, features, poi, mark_poi=False, name="13_K-MeansClustering.png", f1_name="feature 1", f2_name="feature 2"):
    """ Some plotting code designed to help you visualize your clusters """

    ### Plot each cluster with a different color--add more colors for
    ### drawing more than five clusters
    colors = ["b", "c", "k", "m", "g"]
    for ii, pp in enumerate(pred):
        plt.scatter(features[ii][0], features[ii][1], color = colors[pred[ii]])

    ### if you like, place red stars over points that are POIs (just for funsies)
    if mark_poi:
        for ii, pp in enumerate(pred):
            if poi[ii]:
                plt.scatter(features[ii][0], features[ii][1], color="r", marker="*")
    plt.xlabel(f1_name)
    plt.ylabel(f2_name)
    plt.savefig(name)
    plt.show()
In [4]:
### The input features we want to use 
### Can be any key in the person-level dictionary (salary, director_fees, etc.) 
feature_1 = "salary"
feature_2 = "exercised_stock_options"
poi  = "poi"
features_list = [poi, feature_1, feature_2]
data = featureFormat(data_dict, features_list)
poi, finance_features = targetFeatureSplit(data)
In [5]:
### In the "clustering with 3 features" part of the mini-project,
### you'll want to change this line to 
### for f1, f2, _ in finance_features:
### (as it's currently written, the line below assumes 2 features)
for f1, f2 in finance_features:
    plt.scatter(f1, f2)
plt.savefig('13_K-MeansClustering1a.png')
plt.show()
In [6]:
### Cluster here; create predictions of the cluster labels
### for the data and store them to a list called pred
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=0)
pred = kmeans.fit_predict(finance_features)
In [7]:
### Rename the "name" parameter when you change the number of features
### so that the figure gets saved to a different file
try:
    #Draw(pred, finance_features, poi, mark_poi=False, name="clusters.pdf", f1_name=feature_1, f2_name=feature_2)
    Draw(pred, finance_features, poi, mark_poi=False, name="13_K-MeansClustering1b.png", f1_name=feature_1, f2_name=feature_2)
except NameError:
    print("No predictions object named pred found, no clusters to plot")
In [8]:
# Now rerun clustering using 3 features
### The input features we want to use 
### Can be any key in the person-level dictionary (salary, director_fees, etc.) 
feature_1 = "salary"
feature_2 = "exercised_stock_options"
feature_3 = "total_payments"
poi  = "poi"
features_list = [poi, feature_1, feature_2, feature_3]
data = featureFormat(data_dict, features_list)
poi, finance_features = targetFeatureSplit(data)

Add a third feature to features_list, “total_payments". Now rerun clustering, using 3 input features instead of 2 (obviously we can still only visualize the original 2 dimensions). Compare the plot with the clusterings to the one you obtained with 2 input features. Do any points switch clusters? How many? This new clustering, using 3 features, couldn’t have been guessed by eye--it was the k-means algorithm that identified it.

(You'll need to change the code that makes the scatterplot to accommodate 3 features instead of 2, see the comments below for instructions on how to do this.)

In [9]:
### In the "clustering with 3 features" part of the mini-project,
### you'll want to change this line to 
### for f1, f2, _ in finance_features:
### (as it's currently written, the line below assumes 2 features)
for f1, f2, _ in finance_features:
    plt.scatter(f1, f2)
plt.savefig('13_K-MeansClustering2a.png')
plt.show()
In [10]:
### Cluster here; create predictions of the cluster labels
### for the data and store them to a list called pred
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=0)
pred = kmeans.fit_predict(finance_features)
In [11]:
### Rename the "name" parameter when you change the number of features
### so that the figure gets saved to a different file
try:
    Draw(pred, finance_features, poi, mark_poi=False, name="13_K-MeansClustering2b.png", f1_name=feature_1, f2_name=feature_2)
except NameError:
    print("No predictions object named pred found, no clusters to plot")

In the next lesson, we’ll talk about feature scaling. It’s a type of feature preprocessing that you should perform before some classification and regression tasks. Here’s a sneak preview that should call your attention to the general outline of what feature scaling does.

What are the maximum and minimum values taken by the “exercised_stock_options” feature used in this example?

(NB: if you look at finance_features, there are some "NaN" values that have been cleaned away and replaced with zeroes--so while those might look like the minima, it's a bit deceptive because they're more like points for which we don't have information, and just have to put in a number. So for this question, go back to data_dict and look for the maximum and minimum numbers that show up there, ignoring all the "NaN" entries.)

In [12]:
stocks = []
for key, value in data_dict.items():
    if value['exercised_stock_options'] != 'NaN':
        stocks.append(value['exercised_stock_options'])

print(min(stocks), max(stocks))
3285 34348384

What are the maximum and minimum values taken by “salary”?

(NB: same caveat as in the last quiz. If you look at finance_features, there are some "NaN" values that have been cleaned away and replaced with zeroes--so while those might look like the minima, it's a bit deceptive because they're more like points for which we don't have information, and just have to put in a number. So for this question, go back to data_dict and look for the maximum and minimum numbers that show up there, ignoring all the "NaN" entries.)

In [13]:
stocks = []
for key, value in data_dict.items():
    if value['salary'] != 'NaN':
        stocks.append(value['salary'])

print(min(stocks), max(stocks))
477 1111258
In [14]:
# Feature Scaling 

# What would be the rescaled value of a "salary" feature 
# that had an original value of $200,000, and 
# an "exercised_stock_options" feature of $1 million?

# Apply feature scaling to your k-means clustering code from the last lesson, 
# on the “salary” and “exercised_stock_options” features (use only these two features). 
# What would be the rescaled value of a "salary" feature that had an original value of $200,000, 
# and an "exercised_stock_options" feature of $1 million? (Be sure to represent these numbers as floats, not integers!)

from sklearn.preprocessing import MinMaxScaler

salary = []
ex_stock = []

for users in data_dict:
    val = data_dict[users]["salary"]
    if val == 'NaN':
        continue
    salary.append(float(val))
    val = data_dict[users]["exercised_stock_options"]
    if val == 'NaN':
        continue
    ex_stock.append(float(val))
    
salary = [min(salary), 200000.0, max(salary)]
ex_stock = [min(ex_stock), 1000000.0, max(ex_stock)]

print('Salary: {}'.format(salary))
print('Exercised stock options: {}'.format(ex_stock))

salary = numpy.array([[e] for e in salary])
ex_stock = numpy.array([[e] for e in ex_stock])

scaler_salary = MinMaxScaler()
scaler_stock = MinMaxScaler()

rescaled_salary = scaler_salary.fit_transform(salary)
rescaled_stock = scaler_salary.fit_transform(ex_stock)

print('Rescaled salary: {}'.format(rescaled_salary))
print('Rescaled exercised stock options: {}'.format(rescaled_stock))
Salary: [477.0, 200000.0, 1111258.0]
Exercised stock options: [17378.0, 1000000.0, 34348384.0]
Rescaled salary: [[0.        ]
 [0.17962407]
 [1.        ]]
Rescaled exercised stock options: [[0.      ]
 [0.028622]
 [1.      ]]