Saturday, September 10, 2016

Titanic with scikit-learn and RandomForests

Titanic - RandomForests-blog This post highlights the use of scikit-learn RandomForests to classify Titanic passengers by their chances to survive.

The code is in the form of a Jupyter notebook and can be accessed at github.

You have two possibilities to get the data:


a) You can get the train.csv and test.csv data from Kaggle, but in order to get the data you will need to create a free account with Kaggle.  

b) You can also get similar data from my github account.  Although the columns are the same, the passengers in the test and train files are not the same as in the Kaggle files.  These files were derived from data obtained from Vanderbilt University.
In [140]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.ensemble import RandomForestClassifier
%matplotlib inline

We are going to look at the following:

  • Exploratory analysis
    • Shape of the data
    • Figure out missing data
  • Quick and fast prediction with random forests.
  • Perform K-Fold validation to get a more robust score.
  • Feature importance
  • What important data is missing?
  • How can we improve the data?
  • Optimize the classification parameters
  • Prepare a submit file

First let's load the data and do a little bit of exploration

In [141]:
train = pd.read_csv('train.csv')
print("Number of rows: %d" % (train.shape[0]))
print("Number of columns: %d" % (train.shape[1]))
train.head()
Number of rows: 891
Number of columns: 12
Out[141]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

A quick look shows that cabin and age are often missing

In [142]:
# Count the number of missing Cabin and age  values
cabin_nan_count = train[train['Cabin'].isnull()]['PassengerId'].count()
age_nan_count = train[train['Age'].isnull()]['PassengerId'].count()
print("There are %d missing cabin values and %d missing age values" % (cabin_nan_count, age_nan_count))

train[train.isnull().any(axis=1)].head(20)
There are 687 missing cabin values and 177 missing age values
Out[142]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S
16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q
17 18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 0 345763 18.0000 NaN S
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
20 21 0 2 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.0000 NaN S
22 23 1 3 McGowan, Miss. Anna "Annie" female 15.0 0 0 330923 8.0292 NaN Q
24 25 0 3 Palsson, Miss. Torborg Danira female 8.0 3 1 349909 21.0750 NaN S
25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1 5 347077 31.3875 NaN S
26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C

What is apparent at this point:

  • We cannot use the name and ticket features AS IS for classification.
  • A lot of cabin and age entries are missing
  • The PassengerId doesn't help for classification

Let's do a quick RandomForest classification to set a baseline

In [143]:
# The following 2 functions are just useful:

# Drop all the columns in a list of column names 
def drop_columns(df, col_names):    
    for col in col_names:
        try:
            df = df.drop(col, axis=1)
        except:
            pass
    return df

# Encode a list of features
def encode_features(df, col_names):  
    from sklearn import preprocessing
    labelEncoder = preprocessing.LabelEncoder()
    for col in col_names:
        try:
            labelEncoder.fit(df[col])
            df[col] = labelEncoder.transform(df[col]) 
        except:
            pass
    return df

We'll drop the features that we cannot use and encode categorical features.
At this point we won't care much about losing some data because we are just trying to set a baseline.

In [144]:
train = pd.read_csv('train.csv')
train = drop_columns(train, ['PassengerId', 'Ticket', 'Name', 'Cabin'])
train = encode_features(train, ['Sex', 'Embarked'])
train = train.dropna()

# Check that we have not lost too much data:
train.shape
Out[144]:
(714, 8)

Let's run a RandomForest classifier on this data:

In [145]:
from sklearn.cross_validation import train_test_split

y = train['Survived']
X = train.drop('Survived', axis= 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=41)
rf = RandomForestClassifier(n_estimators=100,  max_depth=3)
rf.fit(X_train, y_train)
print (rf.score(X_test, y_test), rf.score(X_train, y_train))
(0.82122905027932958, 0.82990654205607473)
  • This result is not bad considering that there was no feature engineering up to this point.
  • But we lost some rows when we dropped the nan values, so the classification could be better.

We should perform K-Fold validation to find a more reliable score

In [146]:
from sklearn.cross_validation import cross_val_score
from scipy.stats import sem
def mean_score(scores):
    return ("Mean score: {0:.3f} (+/-{1:.3f})").format(np.mean(scores), sem(scores))

scores = cross_val_score(rf, X, y, cv=10)
print (mean_score(scores))
Mean score: 0.801 (+/-0.017)

Is age relevant to predict survival at the Titanic?

In the baseline classification I removed a lot of columns because they had an undefined age.

I'll use RandomForests to get feature importance values.

In [147]:
# This function will return two lists:
#   features:       A list of feature names
#   f_importances:  A list of float values that indicate how important the respective feature is.
#
def get_feature_importance(rf, df):
    importances = rf.feature_importances_
    std = np.std([tree.feature_importances_ for tree in rf.estimators_],
                 axis=0)
    indices = np.argsort(importances)[::-1]
    indices, importances
    features = []
    f_importance = []
    for f in range(df.shape[1]):
        #print("%s. feature %d (%f)" % (X.columns[f], indices[f], importances[indices[f]]))
        features.append(df.columns[f])
        f_importance.append(importances[indices[f]])
    return [features, f_importance]
In [148]:
features, importances = get_feature_importance(rf, X)


# This function returns a plot for the feature importances
#
def get_feature_importance_plot():
    import matplotlib.pyplot as plt
    plt.figure()
    plt.title("Feature importance")

    # set the locations and labels of the xticks
    plt.xticks( range(len(features)), features)

    plt.bar(range(len(features)), importances, color="b",  align="center", alpha = 0.4)
    return plt

plt = get_feature_importance_plot()
plt.show()

Age seems to be important, so let's estimate it

Since the age is often missing and is also one of the most important features, we should estimate it.
I noticed that names with title Master. are young, while Mr. is often not so young, so I will use the title to estimate the age of the passenger.

In [149]:
# Let's set the age based on the title: Mr. Mrs. Master, etc.
def set_titles(df):
    titles = ['Master.', 'Mr. ', 'Miss.', 'Mrs', 'Rev.', 'Dr']  #There are also Sir, Rev, Dr, and Ms, but we'll assume they are all adults
    regex_titles = [ r'.*(Master\.).*',  r'.*(Mr\.).*',  r'.*(Miss\.).*',  r'.*(Mrs\.).*',  r'.*(Rev\.).*',  r'.*(Dr\.).*']
    df['Title'] = 'Mr.'
    i=0
    for title in titles:
        regex = regex_titles[i]
        df.loc[df['Name'].str.contains(regex),['Title']] = title
        i = i + 1
    
    return df

def set_ages_based_on_title(df):
    titles = ['Master.', 'Mr. ', 'Miss.', 'Mrs', 'Rev.', 'Dr']      
    for title in titles:
        median_age = df.loc[df['Title'].str.startswith(title) & df['Title'].str.startswith(title), ['Age']].median()
        df.loc[(pd.isnull(df['Age'])==True) & (df['Title'].str.startswith(title)), ['Age']] = float(median_age)
    return df
        
In [150]:
def preprocess_improved(df, target_included=True):
    # Drop the columns that clearly don't influence the decision
    X = df
    X = X.drop('PassengerId', axis= 1)
    X = X.drop('Ticket', axis= 1)
    # Set embarked to 'S' since it is the most common value.
    X["Embarked"] = X["Embarked"].fillna("S")
    X = set_titles(X)
    X = set_ages_based_on_title(X)
    X = X.drop('Name', axis= 1)
    
    # Non-numerical values are not accepted by scikit-learn - replace them with a label encoder
    X = encode_features(X, ['Embarked', 'Sex', 'Title', 'Cabin'])
    
    if (target_included):
        # Create the target DF and remove it from the training set
        y = X['Survived']
        X = X.drop('Survived', axis= 1)
        return (X, y)
    else:
        return X

train = pd.read_csv('train.csv')
X, y = preprocess_improved(train, target_included=True)
X.head()
c:\users\owner\appdata\local\enthought\canopy\user\lib\site-packages\ipykernel\__main__.py:9: UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
Out[150]:
Pclass Sex Age SibSp Parch Fare Cabin Embarked Title
0 3 1 22.0 1 0 7.2500 0 2 4
1 1 0 38.0 1 0 71.2833 82 0 5
2 3 0 26.0 0 0 7.9250 0 2 2
3 1 0 35.0 1 0 53.1000 56 2 5
4 3 1 35.0 0 0 8.0500 0 2 4
In [151]:
rf = RandomForestClassifier(n_estimators=100,  max_depth=3)
scores = cross_val_score(rf, X, y, cv=10)
print (mean_score(scores))
Mean score: 0.818 (+/-0.010)

Find the best parameters for RandomForest classification

I will try to reduce the chances of overfitting by selecting the optimal max_depth. A deeper tree leads to a higher chance of overfitting. By performing a grid search with a 20-fold cross-validation I reduce the chances of overfitting.

In [152]:
from sklearn import grid_search
from sklearn.cross_validation import train_test_split
rf = RandomForestClassifier(max_features=7)
parameters = {'max_depth':[2,3,4,5,6,7,8,9,10]}

clf_grid = grid_search.GridSearchCV(rf, parameters, cv=20)
clf_grid.fit(X, y)
    
print (clf_grid.grid_scores_)
print (clf_grid.best_estimator_ )
[mean: 0.79125, std: 0.04908, params: {'max_depth': 2}, mean: 0.82267, std: 0.05320, params: {'max_depth': 3}, mean: 0.80920, std: 0.05275, params: {'max_depth': 4}, mean: 0.83502, std: 0.05917, params: {'max_depth': 5}, mean: 0.82716, std: 0.06116, params: {'max_depth': 6}, mean: 0.81481, std: 0.04635, params: {'max_depth': 7}, mean: 0.82379, std: 0.05335, params: {'max_depth': 8}, mean: 0.81481, std: 0.05518, params: {'max_depth': 9}, mean: 0.81481, std: 0.05590, params: {'max_depth': 10}]
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features=7, max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Prepare to submit

I think this is an acceptable first take at this data, so I'll create a submit file.

In [153]:
train = pd.read_csv('train.csv')
X, y = preprocess_improved(train, target_included=True)
test = pd.read_csv('test.csv')
coursera_X_test = preprocess_improved(test, target_included=False)
coursera_X_test[coursera_X_test.isnull().any(axis=1)]
c:\users\owner\appdata\local\enthought\canopy\user\lib\site-packages\ipykernel\__main__.py:9: UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
Out[153]:
Pclass Sex Age SibSp Parch Fare Cabin Embarked Title
88 3 0 NaN 0 0 7.75 0 1 3
152 3 1 60.5 0 0 NaN 0 2 4

Apparently a couple of rows have nulls. I'll use median values to set them.

In [154]:
# One passenger is missing the fare - let's set it to the median
median_fare = coursera_X_test['Fare'].median()
coursera_X_test.loc[coursera_X_test.isnull().any(axis=1), ['Fare']] = median_fare

# One passenger is missing the age - let's set it to the median
median_age = coursera_X_test['Age'].median()
coursera_X_test.loc[coursera_X_test.isnull().any(axis=1), ['Age']] = median_age

Let's create a submit file. I will use the same predictor found in the previous step ("Find the best parameters for RandomForest classification") to make the prediction for the submission.

In [155]:
prediction = clf_grid.predict(coursera_X_test)

pd.DataFrame({
        "PassengerId": test["PassengerId"],
        "Survived": prediction
        }).to_csv("submit.csv", index=False)

No comments:

Post a Comment