Image by S. Hermann & F. Richter from Pixabay
Image by S. Hermann & F. Richter from Pixabay

How to place top 10% in the Titanic Kaggle competition

0 Shares
0
0
0
0

Disclosure: This post may contain affiliate links, meaning I recommend products and services I've used or know well and may receive a commission if you purchase them, at no additional cost to you. Learn more.



When starting out with your Kaggle journey, you might stumble across Kaggle competitions. The place to challenge yourself. To compete for the highest accuracy. And to learn how to try every machine learning algorithm in existence.

One of these Kaggle competitions is the infamous Titanic ML competition. It’s where most beginners (like myself) start off, and also where the leader board is filled with undeniably fake 100% accuracy.

As a beginner in machine learning and data science, I thought it’ll be a good idea to have a crack at the competition. While it didn’t offer any luxurious price, it did provide a lot of learning points.

Like all data science initiatives, let’s start off with an explanatory data analysis on the Kaggle Titanic dateset.


Explanatory Data Analysis

The most fundamental step with all data science work is data analysis. Analysing and understanding our data will help us tremendously when building our model.

But of course, before we dive into that, let’s import all our relevant Python libraries.

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import os

from sklearn import tree
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline

import string
import math

seed = 88888888 # eight eights for good luck

Once we’ve imported our libraries, we can import the data and have a quick look at the data using the .head() function. This will show us the first 5 entries in our data.

tdf = pd.read_csv('/kaggle/input/titanic/train.csv')
tdf.head()
Titanic Kaggle – Dataset (head)

As we can see from the data, there are about 12 variables:

  • PassengerId; Just an ID, not very useful.
  • Survived; 0 = No, 1 = Yes
  • Pclass: a proxy for socioeconomic status; 1 = Upper, 2 = Middle, 3 = Lower.
  • Name: a name.
  • Sex: yes.
  • Age: age is fractional if less than 1. If the age is estimated, it is in the form of xx.5.
  • SibSp: number of Siblings/Spouses on board.
  • Parch: number of Parents/Children on board.
  • Ticket.
  • Fare: the price of their ticket.
  • Cabin: the cabin the person was staying in.
  • Embarked: where the person embarked onto the Titanic; C = Cherbourg, Q = Queenstown, S = Southampton.

Besides using .head(), it’s also important to use the .tail() function, which will show us the 5 last entries in the data. This would allow us to identify if the data is loaded correctly and there are no formatting issues.

tdf.tail()
Titanic Kaggle – Dataset (tail)

As we can see, there doesn’t seem like there are any formatting issues. While this is probably expected since the data comes from an official Kaggle competition, it doesn’t hurt to be careful.

Besides that, we also want to know if any of our columns contain null values. Null values could be troublesome as machine learning models tend to do poorly with them. Finding them can easily be done by using the following code:

tdf.columns[tdf.isnull().any()] # Index(['Age', 'Cabin', 'Embarked'], dtype='object')

From the code, it seems like “Age”, “Cabin”, and “Embarked” have missing values. That’s good to know and we’ll deal with before we start creating models.


Variables Analysis

While we do have 12 variables to use, they would most likely have a difference in importance to predict who has survived or not.

Let’s create some graphs and find relationships between the variables and the chance of someone surviving.

We can first take a look a survived-gender matrix.

sex_survived = tdf.groupby(['Sex', 'Survived']).size().unstack('Survived', fill_value=0) # Please don't demonetize me :'(
sex_survived
Titanic Kaggle – Survived-Gender Matrix

According to our matrix, it seems that more females survived the Titanic in comparison to their male counterparts.

We can also create a bar chart if we wanted to analyse the data visually.

fig = plt.figure()
axi = fig.add_axes([0, 0, 1, 1])

# Died 
axi.bar(sex_survived.index, sex_survived[0], color = '#ea1730')

# Survived
axi.bar(sex_survived.index, sex_survived[1], bottom = sex_survived[0], color = '#1792ea')
axi.set_ylabel('Count')
axi.set_xlabel('Sex')
axi.set_title('Titantic - Survivorship by gender')
axi.legend(labels=['Deceased', 'Survived'])
Titanic Kaggle – Survivorship by Gender

By modifying our a little, we can also create other bar charts for the other variables.

Titanic Kaggle – Survivorship by embarked port

Feature Engineering

After thoroughly analysing the data, it’s time to process and select the appropriate variables for our model.

If you remember previously, we have 3 variables that had null values. These were: “Age”, “Cabin”, and “Embarked”.

For Age and Embarked, we would just replace any missing values with the mean and mode respectively.

tdf['Age'].fillna(tdf['Age'].mean(), inplace = True)
tdf['Embarked'].fillna(tdf['Embarked'].mode()[0], inplace = True)

As for the “Cabin” variable, we don’t plan to use it as every cabin the values are quite unique. Instead, we can extract the passenger’s allocated deck from the cabin prefix. This might be helpful as some decks might have a higher chance of surviving.

def substrings_in_string(big_string, substrings):
    big_string = str(big_string)
    for substring in substrings:
        if str.find(big_string, substring) != -1:
            return substring
    return np.nan
  
cabin_list = {
    'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'T': 6, 'G': 7, 'nan': 8
}
tdf['Deck'] = tdf['Cabin'].apply(substrings_in_string, args=(cabin_list.keys(),))
tdf['DeckNum'] = tdf['Deck'].apply(cabin_list.get)

Now that we’ve dealt with missing values, we can convert these string variables to numerical variables. We do this because most machine learning models can only read numerical inputs, but not strings.

We can easily deal with this by mapping strings with numbers that represent the string.

def mf(arg):
    return 0 if arg == 'male' else 1

def emb(arg):
    map_attr = {
        'C': 0,
        'Q': 1,
        'S': 2
    }
    return map_attr.get(arg)

tdf['SexNum'] = tdf['Sex'].apply(mf)
tdf['EmbarkedNum'] = tdf['Embarked'].apply(emb)

Another feature we can try extracting is the titles from the passenger’s names.

If you take a look at the names in the data, the passengers have various titles like “Mr”, “Capt”, and much more. Some passengers have interesting titles like “Countess” or “Don”, which indicates that they are of royalty and might have a higher chance of survival.

We can extract the titles as follows:

title_list = ['Mrs', 'Mr', 'Master', 'Miss', 'Major', 'Rev',
                    'Dr', 'Ms', 'Mlle','Col', 'Capt', 'Mme', 'Countess',
                    'Don', 'Jonkheer']

tdf['Title'] = tdf['Name'].apply(substrings_in_string, args=(title_list,))

def similar_title(title):
    if title in ('Miss', 'Mlle', 'Mme'):
        return 'Ms'
    elif title in ('Major', 'Col', 'Capt'):
        return 'Military'
    elif title in ('Countess', 'Don', 'Jonkheer'):
        return 'Noble'
    return title

tdf['Title'] = tdf['Title'].apply(similar_title)

title_num_list = {
    'Mrs': 0,
    'Mr': 1,
    'Master': 2,
    'Ms': 3,
    'Military': 4,
    'Noble': 5,
    'Dr': 6,
    'Rev': 7
}

tdf['TitleNum'] = tdf['Title'].apply(title_num_list.get)

Finally, we can also determine whether an individual’s family size and whether they were alone or not through two variables, “SibSp” and “Parch”.

def alone_or_not(num):
    if (num > 0): return 0
    else: return 1

tdf['FamilySize'] = tdf['SibSp'] + tdf['Parch'] + 1
tdf['Alone'] = tdf['FamilySize'].apply(alone_or_not)

Now that we have all the features we need, and possibility has extracted most of the information from the data. We can create our own training and testing data as follows:

x_val = [
    'Pclass', 
    'SexNum', 
    'Age', 
    'FamilySize',
    'Alone',
    'TitleNum',
    'DeckNum',
    'EmbarkedNum', 
    'Fare']
y_val = 'Survived'

X = tdf[x_val]
Y = tdf[y_val]

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size= 0.05, random_state=seed)

Modelling & Prediction

Now that we have the data ready, we can use a method called stacking to create an ensemble model. Stacking is a method to aggregates predictions from two or more machine learning algorithms to (hopefully) make the best prediction.

We can easily create an ensemble model using stacking with the following code:

estimators = [
    ('svc', make_pipeline(StandardScaler(), SVC(gamma='auto'))),
    ('tree', tree.DecisionTreeClassifier()),
    ('rf', RandomForestClassifier(criterion = "gini", 
                                  min_samples_leaf = 1, 
                                  min_samples_split = 10, 
                                  n_estimators=50, random_state = seed)),
 ]

ensemble_clf = StackingClassifier(
     estimators = estimators, final_estimator = LogisticRegression()
)

ensemble_clf.fit(x_train, y_train)

y_predict = ensemble_clf.predict(x_test)

accuracy_score(y_test, y_predict) #0.84444444

Using the stacking method, we managed to score 84% on our self-made training set. But how well will it do in the actual test set?

Just like what we have done for our training set, all the modifications we made to the data needs to be done to future data as well. This can be done as follows:

test_df = pd.read_csv('/kaggle/input/titanic/test.csv')

test_df['SexNum'] = test_df['Sex'].apply(mf)
test_df['EmbarkedNum'] = test_df['Embarked'].apply(emb)

# Replace missing values
test_df['Age'].fillna(test_df['Age'].mean(), inplace = True)
test_df['Embarked'].fillna(test_df['Embarked'].mode()[0], inplace = True)
test_df['Fare'].fillna(test_df['Fare'].median(), inplace = True)

# Creating new variables
test_df['Title'] = test_df['Title'].apply(similar_title)
test_df['TitleNum'] = test_df['Title'].apply(title_num_list.get)
test_df['Deck'] = test_df['Cabin'].apply(substrings_in_string, args=(cabin_list.keys(),))
test_df['DeckNum'] = test_df['Deck'].apply(cabin_list.get)
test_df['FamilySize'] = test_df['SibSp'] + test_df['Parch'] + 1
test_df['Alone'] = test_df['FamilySize'].apply(alone_or_not)

# Prepare test data for prediction
x_comp_indx = test_df['PassengerId']
x_comp_test = test_df[x_val]

# Prediction
y_comp_test = ensemble_clf.predict(x_comp_test)

data = {'PassengerId': x_comp_indx, 'Survived': y_comp_test}
final_result = pd.DataFrame(data)

# Create submission file
result = 'results.csv'
if (os.path.exists(result)):
    os.remove(result)
    
final_result.to_csv(result, header=True, index=False)

After submitting our results and through our best efforts, we managed to place 2,389th place out of 21,871 teams (at this time). Which is about the top 10% of teams, so pretty good I would say.

Titanic Kaggle Competition leaderboard

Summary

While we did achieve a decent position in the Kaggle Titanic competition, we most likely could have done better if we analysed the data more, and also took a better look at other machine learning algorithms such as neural networks to do better.

The competition is good in the sense that it allows users to practice and compete in a safe environment. Despite the many fake results on the top of the leaderboard, it’s a casual competition so there’s no drama.


Recommend Materials

Articles

Data

0 Shares
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments