The code is in the form of a Jupyter notebook and can be accessed at github.
You have two possibilities to get the data:
a) You can get the train.csv and test.csv data from Kaggle, but in order to get the data you will need to create a free account with Kaggle.
b) You can also get similar data from my github account. Although the columns are the same, the passengers in the test and train files are not the same as in the Kaggle files. These files were derived from data obtained from Vanderbilt University.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.ensemble import RandomForestClassifier
%matplotlib inline
We are going to look at the following:¶
- Exploratory analysis
- Shape of the data
- Figure out missing data
- Quick and fast prediction with random forests.
- Perform K-Fold validation to get a more robust score.
- Feature importance
- What important data is missing?
- How can we improve the data?
- Optimize the classification parameters
- Prepare a submit file
First let's load the data and do a little bit of exploration¶
train = pd.read_csv('train.csv')
print("Number of rows: %d" % (train.shape[0]))
print("Number of columns: %d" % (train.shape[1]))
train.head()
A quick look shows that cabin and age are often missing¶
# Count the number of missing Cabin and age values
cabin_nan_count = train[train['Cabin'].isnull()]['PassengerId'].count()
age_nan_count = train[train['Age'].isnull()]['PassengerId'].count()
print("There are %d missing cabin values and %d missing age values" % (cabin_nan_count, age_nan_count))
train[train.isnull().any(axis=1)].head(20)
What is apparent at this point:
- We cannot use the name and ticket features AS IS for classification.
- A lot of cabin and age entries are missing
- The PassengerId doesn't help for classification
Let's do a quick RandomForest classification to set a baseline¶
# The following 2 functions are just useful:
# Drop all the columns in a list of column names
def drop_columns(df, col_names):
for col in col_names:
try:
df = df.drop(col, axis=1)
except:
pass
return df
# Encode a list of features
def encode_features(df, col_names):
from sklearn import preprocessing
labelEncoder = preprocessing.LabelEncoder()
for col in col_names:
try:
labelEncoder.fit(df[col])
df[col] = labelEncoder.transform(df[col])
except:
pass
return df
We'll drop the features that we cannot use and encode categorical features.
At this point we won't care much about losing some data because we are just trying to set a baseline.
train = pd.read_csv('train.csv')
train = drop_columns(train, ['PassengerId', 'Ticket', 'Name', 'Cabin'])
train = encode_features(train, ['Sex', 'Embarked'])
train = train.dropna()
# Check that we have not lost too much data:
train.shape
Let's run a RandomForest classifier on this data:
from sklearn.cross_validation import train_test_split
y = train['Survived']
X = train.drop('Survived', axis= 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=41)
rf = RandomForestClassifier(n_estimators=100, max_depth=3)
rf.fit(X_train, y_train)
print (rf.score(X_test, y_test), rf.score(X_train, y_train))
- This result is not bad considering that there was no feature engineering up to this point.
- But we lost some rows when we dropped the nan values, so the classification could be better.
We should perform K-Fold validation to find a more reliable score¶
from sklearn.cross_validation import cross_val_score
from scipy.stats import sem
def mean_score(scores):
return ("Mean score: {0:.3f} (+/-{1:.3f})").format(np.mean(scores), sem(scores))
scores = cross_val_score(rf, X, y, cv=10)
print (mean_score(scores))
Is age relevant to predict survival at the Titanic?¶
In the baseline classification I removed a lot of columns because they had an undefined age.
I'll use RandomForests to get feature importance values.
# This function will return two lists:
# features: A list of feature names
# f_importances: A list of float values that indicate how important the respective feature is.
#
def get_feature_importance(rf, df):
importances = rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in rf.estimators_],
axis=0)
indices = np.argsort(importances)[::-1]
indices, importances
features = []
f_importance = []
for f in range(df.shape[1]):
#print("%s. feature %d (%f)" % (X.columns[f], indices[f], importances[indices[f]]))
features.append(df.columns[f])
f_importance.append(importances[indices[f]])
return [features, f_importance]
features, importances = get_feature_importance(rf, X)
# This function returns a plot for the feature importances
#
def get_feature_importance_plot():
import matplotlib.pyplot as plt
plt.figure()
plt.title("Feature importance")
# set the locations and labels of the xticks
plt.xticks( range(len(features)), features)
plt.bar(range(len(features)), importances, color="b", align="center", alpha = 0.4)
return plt
plt = get_feature_importance_plot()
plt.show()
Age seems to be important, so let's estimate it¶
Since the age is often missing and is also one of the most important features, we should estimate it.
I noticed that names with title Master. are young, while Mr. is often not so young, so I will use the title to estimate the age of the passenger.
# Let's set the age based on the title: Mr. Mrs. Master, etc.
def set_titles(df):
titles = ['Master.', 'Mr. ', 'Miss.', 'Mrs', 'Rev.', 'Dr'] #There are also Sir, Rev, Dr, and Ms, but we'll assume they are all adults
regex_titles = [ r'.*(Master\.).*', r'.*(Mr\.).*', r'.*(Miss\.).*', r'.*(Mrs\.).*', r'.*(Rev\.).*', r'.*(Dr\.).*']
df['Title'] = 'Mr.'
i=0
for title in titles:
regex = regex_titles[i]
df.loc[df['Name'].str.contains(regex),['Title']] = title
i = i + 1
return df
def set_ages_based_on_title(df):
titles = ['Master.', 'Mr. ', 'Miss.', 'Mrs', 'Rev.', 'Dr']
for title in titles:
median_age = df.loc[df['Title'].str.startswith(title) & df['Title'].str.startswith(title), ['Age']].median()
df.loc[(pd.isnull(df['Age'])==True) & (df['Title'].str.startswith(title)), ['Age']] = float(median_age)
return df
def preprocess_improved(df, target_included=True):
# Drop the columns that clearly don't influence the decision
X = df
X = X.drop('PassengerId', axis= 1)
X = X.drop('Ticket', axis= 1)
# Set embarked to 'S' since it is the most common value.
X["Embarked"] = X["Embarked"].fillna("S")
X = set_titles(X)
X = set_ages_based_on_title(X)
X = X.drop('Name', axis= 1)
# Non-numerical values are not accepted by scikit-learn - replace them with a label encoder
X = encode_features(X, ['Embarked', 'Sex', 'Title', 'Cabin'])
if (target_included):
# Create the target DF and remove it from the training set
y = X['Survived']
X = X.drop('Survived', axis= 1)
return (X, y)
else:
return X
train = pd.read_csv('train.csv')
X, y = preprocess_improved(train, target_included=True)
X.head()
rf = RandomForestClassifier(n_estimators=100, max_depth=3)
scores = cross_val_score(rf, X, y, cv=10)
print (mean_score(scores))
Find the best parameters for RandomForest classification¶
I will try to reduce the chances of overfitting by selecting the optimal max_depth. A deeper tree leads to a higher chance of overfitting. By performing a grid search with a 20-fold cross-validation I reduce the chances of overfitting.
from sklearn import grid_search
from sklearn.cross_validation import train_test_split
rf = RandomForestClassifier(max_features=7)
parameters = {'max_depth':[2,3,4,5,6,7,8,9,10]}
clf_grid = grid_search.GridSearchCV(rf, parameters, cv=20)
clf_grid.fit(X, y)
print (clf_grid.grid_scores_)
print (clf_grid.best_estimator_ )
Prepare to submit¶
I think this is an acceptable first take at this data, so I'll create a submit file.
train = pd.read_csv('train.csv')
X, y = preprocess_improved(train, target_included=True)
test = pd.read_csv('test.csv')
coursera_X_test = preprocess_improved(test, target_included=False)
coursera_X_test[coursera_X_test.isnull().any(axis=1)]
Apparently a couple of rows have nulls. I'll use median values to set them.
# One passenger is missing the fare - let's set it to the median
median_fare = coursera_X_test['Fare'].median()
coursera_X_test.loc[coursera_X_test.isnull().any(axis=1), ['Fare']] = median_fare
# One passenger is missing the age - let's set it to the median
median_age = coursera_X_test['Age'].median()
coursera_X_test.loc[coursera_X_test.isnull().any(axis=1), ['Age']] = median_age
Let's create a submit file. I will use the same predictor found in the previous step ("Find the best parameters for RandomForest classification") to make the prediction for the submission.
prediction = clf_grid.predict(coursera_X_test)
pd.DataFrame({
"PassengerId": test["PassengerId"],
"Survived": prediction
}).to_csv("submit.csv", index=False)
No comments:
Post a Comment