Titanic data analysis and prediction

1. Data Cleaning

We import the packages we will use for the data analysis part

Let's open the dataset, for the analysis part we will work only on the training data that contains values for the "Survived" variable.

We can start by creating a missing data matrix to see what kind of data we are missing

We will start by cleaning the data, we will also create variables that I think can be useful for the analysis and the prediction later. Let's work on both the training and test data so we don't have to go through the process two times. For the analysis, we will work only on the part of the training data as it contains values for the "Survived" variable.

We can go through variables to see what we can use, drop, create for the prediction and analysis part. But our intuition may be wrong that's why we need the analysis part to confirm, reject or discover relations.

PassengerId: doesn't bring any information as the Id is certainly randomly given or in order of registration. Actually the registration order may give some information such as socio-economic status or status if tickets were sold to people with higher status first. In our case we have other varaibles that gives the same information and we don't know if our assumption holds. So we will most likely not use the Id in our model

Survived: It is the predicted variable.

Pclass: Categorical feature that we will most likely use. Chances that it correlates with Fare is high but we will see that the later will most likely be transformed. Some passengers paid a higher fare than others while being in a lower class. That is because the fare is the price paid for one purchase, meaning that 5 people from the second class may in total pay more than 2 person in 1st class but that doesn't mean they are richer or that their ticket is worth more. That is why we will transform ths variable.

Name: This variable is useless as it is but using titles in people names and create categories may be helpful. Titles are a sign of wealth and status and will probably correlate with Pclass and Fare (transformed).

Sex: Obvious with the save woman and children first policy.

Age: Same reason as above. Also we have a lot of missing values and some outliers. We could just drop those rows but they account for nearly 1/5 of our dataset. So we decide to use the encyclopedia where after some research people were able to find the age of some of those persons. We will certainly webscrape to retrieve the ages.

SibSp & Parch: We can use these variables to create the variable family size. We can assume that trying to save a family may lead to lwoer chances of surviving. We could also use Ticket for groups that are not blood-related. We will see later which variable leads to better results.

Ticket: Can also be used to create groups but also to split the Fare to the real value per person.

Fare: The fare once split will show the price per passenger. So we will see what is paid by each passenger and the real value of their ticket.

Cabin: The cabin location can be the reason someone lives or dies. After some research, we learned that the cabin number gives information about the deck. Most of the passengers in 2nd and 3rd class don't have a cabin number.

Embarked: Should not have any effect

Let's start with the name variable and take the title from it. The name itself is not interesting but the title may give information about the importance of someone. There is no rule saying that we must save important people first but who knows ? maybe their prestige played a role in their survival or the variable is related to another factor that we are missing.

Next on the line is the Age variable where we have a lot of missing data. One way to solve this would be to replace missing values by the mean or median but missing values account for 20% of our data, the range goes from 0,17 to 80 but the distribution is pretty spread out. Another option would be to use the mean for people for each title because it separates people by age

In our case we will use webscraping to get values we are missing from https://en.wikipedia.org/wiki/Passengers_of_the_Titanic . We could use www.encyclopedia-titanica.org but the way the information is displayed in wikipedia makes it easier to webscrap. Our only downside is that we do not have the ticket number information to confirm the identity of a passenger but knowing how famous our dataset is, it is unlikely that there are wrong data in the wiki.

Let's import what we need for the webscraping

Let's fill our dataframe with the data we found on the web fortunately the names in the wiki page seems to follow the same template

We are able to retrieve the age of 148 person out of 263, we could do more by changing the way the name is written in our dataframe but it is good enough, we will use the mean by title to fill the rest

In restrospect, using the encyclopedia would have been better as we could match the ticket number and avoid having to match the name. Maybe next time I will try this method

We reduced our missing Ages to 8,7% with real values wich is not bad i guess. Let's fill the rest with the mean per title method

Now let's take care of our SibSp & Parch variables, we can use those variables to create the "family-size" variable. As we said earlier having a family on board may lead to people risking their live to save them and more chances to die or the opposite as they may get help

We will also try another version with the ticket variable as people may be in the same group without being blood-related and the will to save the other may be the same

We will create the "FareCap" column in order to have the real value of a ticket but apparently the fare variable is much more complicated as the prices vary depending on where the ticket is bought, there were discounts for childrens but we do not know the value of the discount. Some passengers have a fare = 0 but it seems that those are not mistakes but people who got free tickets. The mean of the Pclass variable can be used to estimate the amount of their Fare because if we keep a Fare of 0, the model may underestimate the "value" of those people.

Let's work on our "Cabin" variable, we will separate it in decks to translate the effect of being below or on top in the Titanic

Fare per capita will be a variable where we split the Fare by the number of similar ticket, so we get the real price of the ticket

We can now split our data and analyse the one where we have our "Survived" variable, after or during the analysis we may need to create new variables so these datasets are not the final ones

2. Data Analysis

Gender

We can clearly see the difference in survival rates between women and men

Women in 1st class have the highest survival rate while men in 3rd class have the lowest. Well being a women making you more likely to survive was kinda obvious but having more chances to survive because you have more money is less logical. The effect may be indirect and through another variable like the deck variable. We don't really need to know the exact reason to predict later but we can look at a correlation matrix to see how variables interact with each other.

Looking at this dataframe it may seem that the survival rate goes down as people get older but we will separate into male and female and other variables to see if it is always the case

Here we see that our previous insight was not completely true, the survival rate goes down only for men but because they account for roughly 65% of the dataset. If we look at the graph on the right we see that for women, being older correlates with higher chances of survival. But does it make sense ? Why would an old woman have more chances to survive ? Younger women are more likely to be in shape and physically stronger.

But we also saw that survival rate is linked to pclass, let's see if there is a relation between age and pclass, I would guess that older women are generally in higher classes and that gives them more chances to survive (maybe being in 1st class gives you the best location, we will check that with our deck variable).

We see that people in more expensive classes are on average older and the share of people in 1st class goes up with age. So the reason why survival rate went up for older women is actually linked to the fact that they are mainly in 1st class. For men it should be the same but it looks like women and children first policy's effect comes before the effect of classes. It can make sense as the advantage of being in best classes is linked to the deck your cabin is in and that being in the best location only gives an advantage at the start of the incident (people going from their location to the rescue boats) but after that and thus for men, the physical abilities matters more.

To see if that hypothesis makes sense, let's look at the effect of being in different decks on survival rate per gender. We expect that for women, being on higher decks leads to higher survival rates and for men the effect should also be positive but smaller.

Unfortunately we don't have enough data about the repartition in decks and most of our data is stored in deck 8 which is where we put all of our unknown data. But according to the website below, we know that upper decks were attributed to 1st class and lower decks to other classes. And unless there was a "rich people first policy", the only reason to have more chances to survive in 1st class has to be the cabin location.

https://www.encyclopedia-titanica.org/cabins.html

Let's look at the effect of group size on survival rate. We will also check if the variable explains survival rate better if we separate in 'has a group' vs 'has no group'. Again we will separate by gender or else the men variable will drop survival rates of some category where men are the majority. For example, most of the Group = 0, people who travel alone group are men and we may think that travelling alone is a huge disadvantage but it may also be because people travelling alone are men so they are subject to the women and children first policy. Or maybe men are only at a disadvantage because they are mainly alone but this hypothesis doesn't hold as the survival rate for men that are not alone is way below the one of women in the same case.

Being alone is certainly not the best case scenario and it seems that having groups bigger than 3 leads to lower survival rates. We also know from our correlation matrix that there is a weak negative relationship between group size and age, and between group size and sex. It means that bigger groups are more often made up of women and younger persons.

People from Cherbourg seem to have more chances to survive and if we look at our correlation matrix, we see that embarking from Cherbourd is correlated with being in upper classes and embarking from Queenstown is the opposite. We can also look at the average fare and see that people from Chersbourg are way richer. So the of port of embarkation certainly affect survival rate because of its correlation to the Pclass variable.

We could look at many more variables and relationships but we have enough knowledge to build a predictive model in my opinion and this dataset is too famous for me to find new insights that we have never seen.

3. Prediction

We will try multiple models and use GridSearch to try multiple parameter combinations

Decision tree

It looks like those 5 variables are the best to predict our outcome it makes sense after the analysis. The suprising part is that the family variable is better than group. The only difference between the two were the fact the family takes only family into account but group looks at people who bought the ticket together.

Linear regression

Neural Network

Random Forest

SVC

Let's predict with our best model

We create our best model with the best parameters

Here we add the values we predicted for the survived variable

We export it in a CSV to upload the results on Kaggle and get a score

I got a score of 0.79425 (ranked 850th, top 1,69%) with this model on Kaggle. I could try more models and change some of the data but it is pretty good. The goal of this exercise is to train myself and not to get the best score, in a real scenario you don't get to resubmit your work multiple times so being able to get a good model after a good analysis is key.

Thinking back about the way I did it, maybe the webscraping of ages may be considered as cheating as I used external data and people getting a score of 1 find the answers on the web and create a perfect submission file. Nonetheless I wanted to add a webscraping part to train myself as it can be really useful in some cases to gather data.