Titanic: Machine Learning from Disaster (The kaggle challenge)


Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

What particularly we need to do in this challenge?

In this challenge, we need to complete the analysis of what sorts of people were likely to survive. In particular, we apply the tools of machine learning to predict which passengers survived the tragedy?

First time over Kaggle this will surely help you: Beginners guide to Kaggle

You can found all the code as a jupyter notebook here : code


The data has been split into two groups:

  • training set (train.csv)
  • test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic. Download both data sets.

Importing the data and Libraries

What does this data set mean

The data has been split into two groups:

  • training set (train.csv)
  • test set(test.csv)

The training set includes passengers survival status(also known as the ground truth from the titanic tragedy) which along with other features like gender, class, fare, and pclass is used to create machine learning model.

The test set should be used to see how well my model performs on unseen data. The test set does not provide passengers survival status. We are going to use our model to predict passenger survival status.
Let’s describe whats the meaning of the features given both train & test datasets.

Variable Definition Key.

  • Survival
    • 0= No
    • 1= Yes
  • pclass (Ticket class)
    • 1=1st
    • 2=2nd
    • 3=3rd
  • sex ,age
  • sibsp (# of siblings / spouses aboard the Titanic)
  • parch (# of parents/children aboard the Titanic)
  • tickets
  • fare
  • cabin
  • embarked Port of Embarkation.
    • C = Cherbourg,
    • Q = Queenstown,
    • S = Southampton
  • pclass: A proxy for socio-economic status (SES)

    This is important to remember and will come in handy for later analysis.

    • 1st = Upper
    • 2nd = Middle
    • 3rd = Lower

The given data set is like junk so many missing values, outliners, unnecessary features are present so before using this data to train over an algorithm, need to clean this data.

Cleaning the data

For  cleaning the data first start with:

  • Filling the missing values
  • Mapping the data
  • Removing the outliners
  • Creating the correlation plot
  • Finding the relationship between the features
  • Creating the new features

Filling the missing values using KNN Imputation technique:

Converting the Features like Fare, Age, Name to Categorical Feature.

Exploratory data analysis

Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.data exploratory

Correlation :

Creating the correlation matrix and heatmap to explain the relationship between different feature set.

Positive Correlation Features:

  • Fare and Survived: 0.26.

There is a positive correlation between Fare and Survived rated. This can be explained by saying that, the passenger who paid more money for their ticket were more likely to survive.

Negative Correlation Features:

  • Fare and Pclass: -0.55
    • This relationship can be explained by saying that first class passenger(1) paid more for fare then second class passenger(2), similarly second class passenger paid more than the third class passenger(3).
  • Gender and Survived: -0.54
    • Basically is the info of whether the passenger was male or female.
  • Pclass and Survived: -0.34

After all training and transforming the features now, it’s time to train the model.

Modeling the Data

I will train the data with the following models:

  • Logistic Regression
  • Gaussian Naive Bayes
  • Support Vector Machines
  • Decision Tree Classifier
  • K-Nearest Neighbors(KNN)
    • and many other…..

Classifier Comparision

By Classifier Comparison we choose which model best for the given data.

Accuracy plot

Visualizing the accuracy plot for the different algorithm to get the idea which algorithm performs best for the given data and them choosing that algorithm for creating the prediction on test data.

From the above barplot, we can clearly see that the accuracy of the SVC classifier is best out of all other classifiers.

Let’s apply this to our test data.

Prediction on test data:

Let’s use the SVC classifier to predict our data.

So here we create the prediction on titanic test data, Now you can submit this prediction to kaggle competition probably you will get a decent rank.


About the author

Vikram singh

Founder of Ai Venture,
An artificial intelligence specialist who teaches developers how to get results with modern AI methods via hands-on tutorials.
GANs are my favorite one.

View all posts