[kaggle数据] 泰坦尼克号生存预测分析

1912年，泰坦尼克号在第一次航行中就与冰山相撞沉没，导致了大部分乘客和船员身亡。泰坦尼克号生存预测分析作为kaggle入门项目，我们将根据部分泰坦尼克号旅客名单，来筛选哪些特征可以最好地预测一个人是否会生还。

regina67

5375人浏览 · 2017-09-12 02:38:04

regina67 · 2017-09-12 02:38:04 发布

一、案列背景

泰坦尼克号全长约269.06米（882.75英尺），宽28.19米（92.5英尺）。共有五层甲板（分别为A-E甲板），分为头等舱，二等舱，三等舱三个等级的船舱，从英国南安普敦出发开往美国纽约，途径法国瑟堡-奥克特维尔，爱尔兰昆士敦两个港口。在途中与冰山相撞，安排救生艇时，由头等舱到三等舱就进行撤离。妇女孩子优先离开。

以上为查阅所得信息，可以更好的理解数据集中的特征。

从上述中我们可以得出几个可能对生存预测有很大影响的结论：
1、妇女、小孩生还概率较高
2、1代表头等舱，2代表二等舱，3代表三等舱
3、A甲板为头等舱，B、C甲板可能为二等舱，D、E甲板可能为三等舱
4、C 代表从法国瑟堡-奥克特维尔港口（Cherbourg），Q 代表爱尔兰昆士敦（Queenstown ），S 代表英国南安普敦（Southampton ）

二、数据加载

#numpy,pandas
import numpy as np
import pandas as pd
from pandas import Series,DataFrame

#plot
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

#load
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
full_df=pd.concat([train_df,test_df],ignore_index=True)
full_df.head()

这里写图片描述

Survived：是否存活（0代表否，1代表是）
Pclass：船舱阶级
Name：乘客的名字
Sex：乘客的性别
Age：乘客的年龄
SibSp：乘客在船上的兄弟姐妹和配偶的数量
Parch：乘客在船上的父母以及小孩的数量
Ticket：乘客船票的编号
Fare：船票费用
Cabin：船舱的座位号
Embarked：乘客上船的港口

三、观察数据

对数据集进行整体观察

# 查看数据框的大小
train_df.shape    #(891, 12)

test_df.shape     #(418, 11)

full_df.shape     #(1309, 12)

训练集和测试集合并后共有条1309条观测，12个变量

#查看各个变量的观测数和变量类型
full_df.info()

这里写图片描述
数据集有浮点型变量3个，整数变量4个，共7个数值型变量，字符串变量5个。

#检查异常和极端值（观察最大值和最小值）
full_df.describe().T

这里写图片描述

#查看各个变量的缺失个数
full_df.isnull().sum()

这里写图片描述
数据集中，缺失值的变量为：Age、Cabin、Embarked、Fare。其中Cabin几乎都缺失。

四、特征工程

因为是对生存预测进行分析，所以主要是对Survived和其他变量进行可视化分析，可以在一定程度上观察他们间是否相关。

1、Pclass

fig, (axis1,axis2,axis3) = plt.subplots(1,3,figsize=(15,4))

train_df.Pclass.value_counts().plot(kind="bar",color='steelblue',ax=axis1)
axis1.set_xticklabels([3,1,2],rotation=0)
axis1.set_ylabel(u'人数',fontproperties='SimHei')
axis1.set_xlabel('Pclass')

pd.crosstab(train_df.Pclass,train_df.Survived).plot.bar(stacked=True,color=['#FA2479','steelblue'],ax=axis2)
axis2.set_xticklabels([1,2,3], rotation=0)
axis2.set_ylabel(u'人数',fontproperties='SimHei')


pd.pivot_table(train_df,index=['Pclass'],values=['Survived']).plot.bar(ax=axis3,color='steelblue')
axis3.set_xticklabels([1,2,3], rotation=0)

axis1.set_title(u"乘客等级分布",fontproperties='SimHei')
axis2.set_title(u"各乘客等级的获救情况",fontproperties='SimHei')
axis3.set_title(u"生存概率",fontproperties='SimHei')

这里写图片描述
从图中可以看到，不同等级的舱位，生存几率也不同。高等舱位生存几率越大。

2、Name

数据集中的姓名有个特点，姓名中间是称谓名称，并且用符号与其他部分隔开。
联系现实生活，称谓可以在一定程度上反映一个人的职业，年龄。所以我对名字中的称谓进行了提取。

#对字符串按符号进行分裂提取
full_df['Title']=full_df['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())  
full_df.Title.value_counts()

Mr              757
Miss            260
Mrs             197
Master           61
Dr                8
Rev               8
Col               4
Major             2
Ms                2
Mlle              2
Capt              1
the Countess      1
Jonkheer          1
Sir               1
Don               1
Lady              1
Mme               1
Dona              1
Name: Title, dtype: int64

提取出来的称谓中包含英，法，西班牙语的不同叫法，把他们进行归类。
最终分成了5类。

full_df['Title'] = full_df['Title'].replace(['Capt', 'Don', 'Major', 'Col', 'Sir','Rev'], 'Mr')
full_df['Title'] = full_df['Title'].replace(['Ms','Mlle'], 'Miss')
full_df['Title'] = full_df['Title'].replace('Jonkheer', 'Master')
full_df['Title'] = full_df['Title'].replace(['Mme','Dona', 'Lady', 'the Countess'], 'Mrs')
full_df.Title.value_counts()  

Mr        774
Miss      264
Mrs       201
Master     62
Dr          8
Name: Title, dtype: int64

#plot
full_df.boxplot(column='Age', by='Title')
plt.title(u"各称谓的年龄分布情况",fontproperties='SimHei')
plt.ylabel(u"年龄",fontproperties='SimHei')

这里写图片描述
从图中可以看到Dr的年龄是偏大；Miss的年龄相较Mrs偏小；Master这一分类，年龄较小；Mr几乎把所有男性都归在这一类，所以不好分析。

full_df[:891].groupby(['Title'])[['Title','Survived']].mean().plot(kind='bar',figsize=(6,3),color='#FA2479')
plt.xticks(rotation=0)

这里写图片描述
从图中可以看出，女性和年龄较小的人生存几率较高。

#把数据进行数值化处理
title_map = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5}
full_df['Title'] = full_df['Title'].map(title_map)

#查看是否全部更改完成
full_df['Title'].value_counts()  

1    774
2    264
3    201
4     62
5      8
Name: Title, dtype: int64

3、Sex

fig, (axis1,axis2,axis3) = plt.subplots(1,3,figsize=(15,4))

train_df.Sex.value_counts().plot(kind="bar",color='steelblue',ax=axis1)
axis1.set_ylabel(u'人数',fontproperties='SimHei')
axis1.set_xlabel('Sex')


pd.crosstab(train_df.Sex,train_df.Survived).plot.bar(stacked=True,color=['#FA2479','steelblue'],ax=axis2)
axis2.set_ylabel(u'人数',fontproperties='SimHei')


pd.pivot_table(train_df,index=['Sex'],values=['Survived']).plot.bar(ax=axis3,color='steelblue')

axis1.set_title(u"男女人数分布",fontproperties='SimHei')
axis2.set_title(u"男女的获救情况",fontproperties='SimHei')
axis3.set_title(u"生存概率",fontproperties='SimHei')

这里写图片描述
从图中看出女性生存几率很高，和之前的结论一致。

#数值化处理
full_df['Sex'] = full_df['Sex'].map( {'female': 0, 'male': 1} ).astype(int)

4、Age

年龄中存在一部分缺失值，所以要对年龄进行填补，在这里采用的是随机填充。

fig, (axis1,axis2) = plt.subplots(1,2,figsize=(15,4))
axis1.set_title('Original Age values - Titanic')
axis2.set_title('New Age values - Titanic')

average_age_full   = full_df["Age"].mean()
std_age_full       = full_df["Age"].std()
count_nan_age_full = full_df["Age"].isnull().sum()

#随机填充法
rand = np.random.randint(average_age_full - std_age_full, average_age_full + std_age_full, size = count_nan_age_full)

full_df['Age'].dropna().astype(int).hist(bins=70, ax=axis1)

full_df["Age"][np.isnan(full_df["Age"])] = rand

full_df['Age'] = full_df['Age'].astype(int)

full_df['Age'].hist(bins=70, ax=axis2)

这里写图片描述
这两幅图是填补前后的年龄分布图，可以看出填补前后分布差别不大。

full_df.Age[full_df.Pclass == 1].plot(kind='kde')   
full_df.Age[full_df.Pclass == 2].plot(kind='kde')
full_df.Age[full_df.Pclass == 3].plot(kind='kde')
plt.xlabel(u"年龄",fontproperties='SimHei')
plt.ylabel(u"密度",fontproperties='SimHei') 
plt.title(u"各等级的乘客年龄分布",fontproperties='SimHei')
plt.legend(('Level1', 'Level2','Level3'),loc='best')

填补缺失值前

填补缺失值后
这里写图片描述
这两幅是年龄关联等级舱的年龄分布图，也可以看出两幅图差别不大。
图中可以看出舱位等级越高，年龄相对较大。

#离散化处理
full_df.loc[ full_df['Age'] <= 14, 'Age'] = 0
full_df.loc[(full_df['Age'] > 14) & (full_df['Age'] <= 28), 'Age'] = 1
full_df.loc[(full_df['Age'] > 28) & (full_df['Age'] <= 42), 'Age'] = 2
full_df.loc[(full_df['Age'] > 42) & (full_df['Age'] <= 56), 'Age'] = 3
full_df.loc[ full_df['Age'] > 56, 'Age'] = 4

full[['Age','Survived']].groupby(['Age']).mean().plot.bar(figsize=(8,5))

这里写图片描述
从图中可以看出，小孩年龄段的人生存几率较高，其他年龄段的差别不大

5、SibSp和Parch

这个两个变量都表示家人的数量，所以把他们合并成一个变量Family

#变量合并
full_df['Family'] = full_df['SibSp'] + full_df["Parch"]
full_df['Family'].value_counts()

0     790
1     235
2     159
3      43
5      25
4      22
6      16
10     11
7       8
Name: Family, dtype: int64

#分类（有无家人）
full_df['Family'][full_df.Family==0] = 0
full_df['Family'][full_df.Family>0] = 1

fig, (axis1,axis2,axis3) = plt.subplots(1,3,figsize=(15,5))

sns.countplot(x='Family', data=full_df[:891], ax=axis1,color='blue')

sns.countplot(x='Family', hue="Survived", data=full_df[:891], ax=axis2)

embark_perc = full_df[:891][["Family","Survived"]].groupby(['Family'],as_index=False).mean()
sns.barplot(x='Family', y='Survived', data=embark_perc,ax=axis3,color='blue')

#axis1.set_xticklabels(["Alone","With Family"], rotation=0)
#axis2.set_xticklabels(["Alone","With Family"], rotation=0)
#axis3.set_xticklabels(["Alone","With Family"], rotation=0)

这里写图片描述
可以看出，有家人比独自一人的的生存几率高.

6、Fare

由之前的数据观察，有一个缺失值。

#众数填补缺失值
mod=full_df.Fare.mode()
mod
0    8.05
dtype: float64

full_df["Fare"] = full_df["Fare"].fillna(8.05)

#离散化处理
full_df.loc[ full_df['Fare'] <= 7.854, 'Fare'] = 0
full_df.loc[(full_df['Fare'] > 7.854) & (full_df['Fare'] <= 10.5), 'Fare'] = 1
full_df.loc[(full_df['Fare'] > 10.5) & (full_df['Fare'] <= 21.558), 'Fare'] = 2
full_df.loc[(full_df['Fare'] > 21.558) & (full_df['Fare'] <= 41.579), 'Fare'] = 3
full_df.loc[ full_df['Fare'] > 41.579, 'Fare'] = 4

full_df[['Fare','Survived']].groupby(['Fare']).mean().plot.bar(figsize=(6,4),color='steelblue')

这里写图片描述
从图中可以看出，船票越贵，生存几率越高。船票反映出的是船舱的等级，船票越贵，船舱等级越高。

7、Cabin

由数据观察知道，这个变量有超过一半的缺失值，不好进行填补。一般遇到这种情况，在不能进行收集完善数据的情况下，只能把这个变量删除。
在这里我把变量进行有无舱号二分类处理，观察他们与Survived是否相关

full_df.loc[full_df.Cabin.notnull(),'Cabin']=1
full_df.loc[full_df.Cabin.isnull(),'Cabin']=0

fig, (axis1,axis2,axis3) = plt.subplots(1,3,figsize=(15,4))

full_df[:891].Cabin.value_counts().plot(kind="bar",color='steelblue',ax=axis1)
axis1.set_ylabel(u'人数',fontproperties='SimHei')
axis1.set_xlabel('Cabin')

pd.crosstab(full_df[:891].Cabin,full_df[:891].Survived).plot.bar(stacked=True,color=['#FA2479','steelblue'],ax=axis2)
axis2.set_ylabel(u'人数',fontproperties='SimHei')

pd.pivot_table(full_df[:891],index=['Cabin'],values=['Survived']).plot.bar(ax=axis3,color='steelblue')

axis1.set_title(u"有无Cabin的分布",fontproperties='SimHei')
axis2.set_title(u"有无Cabin的获救情况",fontproperties='SimHei')
axis3.set_title(u"生存概率",fontproperties='SimHei')

axis1.set_xticklabels(["no Cabin","With Cabin"], rotation=0)
axis2.set_xticklabels(["no Cabin","With Cabin"], rotation=0)
axis3.set_xticklabels(["no Cabin","With Cabin"], rotation=0)

这里写图片描述
从图中可以看出，有舱号记录的人生存几率较大

8、Embarked

full_df.Embarked[full_df.Embarked.isnull()]

61     NaN
829    NaN

mod=full_df.Embarked.mode()
mod
0    S
dtype: object

full_df['Embarked'].fillna('S',inplace=True)

fig, (axis1,axis2,axis3) = plt.subplots(1,3,figsize=(15,5))

sns.countplot(x='Embarked', data=train_df,order=['S','C','Q'], ax=axis1)

sns.countplot(x='Embarked', hue="Survived", data=train_df, order=['S','C','Q'], ax=axis2)

embark_perc = train_df[["Embarked","Survived"]].groupby(['Embarked'],as_index=False).mean()
sns.barplot(x='Embarked', y='Survived', data=embark_perc,order=['S','C','Q'],ax=axis3)

这里写图片描述
从图中可以看出，途径港口上船的人生存几率较大
（虽然我不知道港口和生存几率有什么关系，可能在途中上船的大都是有钱人吧！！！）

#数值化处理
full_df['Embarked'] = full_df['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)

9、变量相关性

colormap = plt.cm.viridis
plt.figure(figsize=(12,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
vari=full_df[['Cabin','Embarked','Sex','Title','Pclass','Age','Family','Fare','Survived']]

sns.heatmap(vari.astype(float).corr(),linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)

这里写图片描述
（这个图有点奇怪，我认为相关性较高的变量间都呈高度负相关。可能和变量数值化有关吧？？所以在这里只看大小，不关注正负）
从图中可以看出Pclass和Cabin、Pclass和Fare、Title和Sex相关性很高

五、数据建模&评估

1、选择特征

test = full_df[891:]
feature=['Cabin','Embarked','Sex','Title','Pclass','Age','Family','Fare'] 
new_full_df=full_df[feature]
new_full_df.head()

dummies_Family = pd.get_dummies(new_full_df['Family'], prefix= 'Family')
dummies_Age = pd.get_dummies(new_full_df['Age'], prefix= 'Age')
dummies_Sex = pd.get_dummies(new_full_df['Sex'], prefix= 'Sex')
dummies_Title = pd.get_dummies(new_full_df['Title'], prefix= 'Title')
dummies_Fare = pd.get_dummies(new_full_df['Fare'], prefix= 'Fare')
dummies_Embarked = pd.get_dummies(new_full_df['Embarked'], prefix= 'Embarked')
dummies_Pclass = pd.get_dummies(new_full_df['Pclass'], prefix= 'Pclass')
dummies_Cabin = pd.get_dummies(new_full_df['Cabin'], prefix= 'Cabin')

new_full_df = pd.concat([new_full_df, dummies_Cabin, dummies_Embarked, dummies_Sex, dummies_Pclass,dummies_Fare,dummies_Title,dummies_Age,dummies_Family], axis=1)
new_full_df.drop(['Cabin','Embarked','Sex','Title','Pclass','Age','Family','Fare'], axis=1, inplace=True)
new_full_df

train_x_factor=new_full_df[:891]
train_y=full_df.Survived[:891]
test_x_factor=new_full_df[891:]

2、模型评估

from sklearn import cross_validation

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC

models=[LogisticRegression(),RandomForestClassifier(),GradientBoostingClassifier(),SVC()]
names=['LR','RF','GB','SVM']
for name, model in zip(names,models):
    score=cross_validation.cross_val_score(model,train_x_factor,train_y,cv=5)
    print("{}:{},{}".format(name,score.mean(),score))

LR:0.8048126684746046,[ 0.79329609  0.77653631  0.80898876  0.79213483  0.85310734]
RF:0.8148683216143631,[ 0.83240223  0.77094972  0.8258427   0.80898876  0.83615819]
GB:0.8250188400755093,[ 0.82122905  0.7877095   0.83707865  0.80337079  0.87570621]
SVM:0.7967834449907032,[ 0.82681564  0.82122905  0.79775281  0.75280899  0.78531073]

3 、参数调整

from sklearn import grid_search
#Logistic Regression
param_grid={'C':[0.07,0.08,0.09,0.10,0.11,]}
grid_search=grid_search.GridSearchCV(LogisticRegression(),param_grid,cv=5)
grid_search.fit(train_x_factor,train_y)
grid_search.best_params_,grid_search.best_score_

({'C': 0.31}, 0.8103254769921436)

from sklearn import grid_search
#Support Vector Machine
param_grid={'C':[2,3,4,5,6],'gamma':[0.013,0.014,0.015,0.016,0.017]}
grid_search=grid_search.GridSearchCV(SVC(),param_grid,cv=5)
grid_search.fit(train_x_factor,train_y)
grid_search.best_params_,grid_search.best_score_
({'C': 4, 'gamma': 0.2}, 0.8316498316498316)

from sklearn import grid_search
#Gradient Boosting Decision Tree
param_grid={'n_estimators':[110,120,125,130],'learning_rate':[0.1,0.11,0.12,0.14,0.15],'max_depth':[4,5,6,7]}
grid_search=grid_search.GridSearchCV(GradientBoostingClassifier(),param_grid,cv=5)
grid_search.fit(train_x_factor,train_y)
grid_search.best_params_,grid_search.best_score_
({'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 140},
 0.8282828282828283)

from sklearn import grid_search
#Random Forest 
param_grid={'n_estimators':[480,481],'max_depth':[8,9,10]}
grid_search=grid_search.GridSearchCV(RandomForestClassifier(),param_grid,cv=5)
grid_search.fit(train_x_factor,train_y)
grid_search.best_params_,grid_search.best_score_

({'max_depth': 7, 'n_estimators': 470}, 0.8282828282828283)

4、模型融合

from sklearn.ensemble import VotingClassifier

clf1=LogisticRegression(C=0.1)
clf2=RandomForestClassifier(n_estimators=481,max_depth=9)
clf3=GradientBoostingClassifier(n_estimators=125,learning_rate=0.11,max_depth=7)
clf4=SVC(C=3,gamma=0.016,probability=True)

vc_clf=VotingClassifier(estimators=[('LR',clf1),('RF',clf2),('GB',clf3),('SVM',clf4)])

score=cross_validation.cross_val_score(vc_clf,train_x_factor,train_y,cv=5)
print("{},{}".format(score.mean(),score))

0.8216353575642416,[ 0.82122905  0.7877095   0.83146067  0.80337079  0.86440678]

模型融合对预测结果没有看出太大的差别

pred=vc_clf.fit(train_x_factor,train_y).predict(test_x_factor)
df=pd.DataFrame({'PassengerId':test.PassengerId,'Survived':pred})
df.to_csv('titanic_pred.csv',index=False)

其他的改进思路

1、对缺失值较多的Age可以尝试其他的填充方法，比如模型填充、交互Title分类的均值或众数进行填充
2、数据处理主要是对变量分类之后进行因子化建模，可以尝试进行标准化建模
3、建模方面可以多加入几个模型进行交叉验证观察，可以尝试其他的模型融合方式，比如：bagging、Stacking（本文用了VotingClassifier）

腾讯云开发者社区

腾讯云面向开发者汇聚海量精品云计算使用和开发经验，营造开放的云计算技术生态圈。

更多推荐

终极指南：Flink SQL连接器版本管理从混乱到有序的升级之路

Apache Flink作为流处理领域的佼佼者，其SQL连接器的版本管理一直是开发者面临的核心挑战。本文将系统讲解Flink SQL连接器版本管理的最佳实践，帮助你轻松应对版本兼容性问题，实现从混乱到有序的升级之旅。## 连接器版本管理的常见痛点 😫在Flink应用开发中，连接器版本管理常常让开发者头疼不已。不同版本的连接器可能导致各种兼容性问题，例如API变更、功能差异甚至运行时错误。

腾讯云开发者社区

Elasticsearch复杂数据类型终极指南：从入门到精通

Elasticsearch作为功能强大的搜索引擎，支持多种复杂数据类型，让开发者能够灵活处理各种结构化和非结构化数据。本文将带你全面了解Elasticsearch中的复杂数据类型，从基础概念到实际应用，助你轻松掌握数据建模的核心技巧。## 内部对象：构建层级化数据结构在Elasticsearch中，对象类型（Object）是最基础的复杂数据类型之一，用于表示具有嵌套关系的数据。例如，我们可

腾讯云开发者社区

如何快速搭建Neon无服务器PostgreSQL：面向初学者的完整指南

Neon是一款革命性的无服务器PostgreSQL解决方案，它通过分离存储和计算层，实现了自动扩缩容、类代码式数据库分支以及零级扩展能力。本指南将帮助你从零开始搭建Neon开发环境，体验这款创新数据库的强大功能。## 准备工作：环境要求与依赖项在开始搭建Neon环境前，请确保你的系统满足以下要求：- Linux操作系统（推荐Ubuntu 20.04+或Debian 11+）- Git