一、案列背景

1912年,泰坦尼克号在第一次航行中就与冰山相撞沉没,导致了大部分乘客和船员身亡。泰坦尼克号生存预测分析作为kaggle入门项目,我们将根据部分泰坦尼克号旅客名单,来筛选哪些特征可以最好地预测一个人是否会生还。

泰坦尼克号全长约269.06米(882.75英尺),宽28.19米(92.5英尺)。共有五层甲板(分别为A-E甲板),分为头等舱,二等舱,三等舱三个等级的船舱,从英国南安普敦出发开往美国纽约,途径法国瑟堡-奥克特维尔,爱尔兰昆士敦两个港口。在途中与冰山相撞,安排救生艇时,由头等舱到三等舱就进行撤离。妇女孩子优先离开。

以上为查阅所得信息,可以更好的理解数据集中的特征。

从上述中我们可以得出几个可能对生存预测有很大影响的结论:
1、妇女、小孩生还概率较高
2、1代表头等舱,2代表二等舱,3代表三等舱
3、A甲板为头等舱,B、C甲板可能为二等舱,D、E甲板可能为三等舱
4、C 代表从法国瑟堡-奥克特维尔港口(Cherbourg),Q 代表爱尔兰昆士敦(Queenstown ),S 代表英国南安普敦(Southampton )


二、数据加载

#numpy,pandas
import numpy as np
import pandas as pd
from pandas import Series,DataFrame

#plot
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

#load
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
full_df=pd.concat([train_df,test_df],ignore_index=True)
full_df.head()

这里写图片描述

Survived:是否存活(0代表否,1代表是)
Pclass:船舱阶级
Name:乘客的名字
Sex:乘客的性别
Age:乘客的年龄
SibSp:乘客在船上的兄弟姐妹和配偶的数量
Parch:乘客在船上的父母以及小孩的数量
Ticket:乘客船票的编号
Fare:船票费用
Cabin:船舱的座位号
Embarked:乘客上船的港口


三、观察数据

对数据集进行整体观察

# 查看数据框的大小
train_df.shape    #(891, 12)

test_df.shape     #(418, 11)

full_df.shape     #(1309, 12)

训练集和测试集合并后共有条1309条观测,12个变量

#查看各个变量的观测数和变量类型
full_df.info()

这里写图片描述
数据集有浮点型变量3个,整数变量4个,共7个数值型变量,字符串变量5个。

#检查异常和极端值(观察最大值和最小值)
full_df.describe().T

这里写图片描述

#查看各个变量的缺失个数
full_df.isnull().sum()

这里写图片描述
数据集中,缺失值的变量为:Age、Cabin、Embarked、Fare。其中Cabin几乎都缺失。


四、特征工程

因为是对生存预测进行分析,所以主要是对Survived和其他变量进行可视化分析,可以在一定程度上观察他们间是否相关。

1、Pclass

fig, (axis1,axis2,axis3) = plt.subplots(1,3,figsize=(15,4))

train_df.Pclass.value_counts().plot(kind="bar",color='steelblue',ax=axis1)
axis1.set_xticklabels([3,1,2],rotation=0)
axis1.set_ylabel(u'人数',fontproperties='SimHei')
axis1.set_xlabel('Pclass')

pd.crosstab(train_df.Pclass,train_df.Survived).plot.bar(stacked=True,color=['#FA2479','steelblue'],ax=axis2)
axis2.set_xticklabels([1,2,3], rotation=0)
axis2.set_ylabel(u'人数',fontproperties='SimHei')


pd.pivot_table(train_df,index=['Pclass'],values=['Survived']).plot.bar(ax=axis3,color='steelblue')
axis3.set_xticklabels([1,2,3], rotation=0)

axis1.set_title(u"乘客等级分布",fontproperties='SimHei')
axis2.set_title(u"各乘客等级的获救情况",fontproperties='SimHei')
axis3.set_title(u"生存概率",fontproperties='SimHei')

这里写图片描述
从图中可以看到,不同等级的舱位,生存几率也不同。高等舱位生存几率越大。

·

2、Name

数据集中的姓名有个特点,姓名中间是称谓名称,并且用符号与其他部分隔开。
联系现实生活,称谓可以在一定程度上反映一个人的职业,年龄。所以我对名字中的称谓进行了提取。

#对字符串按符号进行分裂提取
full_df['Title']=full_df['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())  
full_df.Title.value_counts()

Mr              757
Miss            260
Mrs             197
Master           61
Dr                8
Rev               8
Col               4
Major             2
Ms                2
Mlle              2
Capt              1
the Countess      1
Jonkheer          1
Sir               1
Don               1
Lady              1
Mme               1
Dona              1
Name: Title, dtype: int64

提取出来的称谓中包含英,法,西班牙语的不同叫法,把他们进行归类。
最终分成了5类。

full_df['Title'] = full_df['Title'].replace(['Capt', 'Don', 'Major', 'Col', 'Sir','Rev'], 'Mr')
full_df['Title'] = full_df['Title'].replace(['Ms','Mlle'], 'Miss')
full_df['Title'] = full_df['Title'].replace('Jonkheer', 'Master')
full_df['Title'] = full_df['Title'].replace(['Mme','Dona', 'Lady', 'the Countess'], 'Mrs')
full_df.Title.value_counts()  

Mr        774
Miss      264
Mrs       201
Master     62
Dr          8
Name: Title, dtype: int64
#plot
full_df.boxplot(column='Age', by='Title')
plt.title(u"各称谓的年龄分布情况",fontproperties='SimHei')
plt.ylabel(u"年龄",fontproperties='SimHei') 

这里写图片描述
从图中可以看到Dr的年龄是偏大;Miss的年龄相较Mrs偏小;Master这一分类,年龄较小;Mr几乎把所有男性都归在这一类,所以不好分析。

full_df[:891].groupby(['Title'])[['Title','Survived']].mean().plot(kind='bar',figsize=(6,3),color='#FA2479')
plt.xticks(rotation=0)

这里写图片描述
从图中可以看出,女性和年龄较小的人生存几率较高。

#把数据进行数值化处理
title_map = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5}
full_df['Title'] = full_df['Title'].map(title_map)

#查看是否全部更改完成
full_df['Title'].value_counts()  

1    774
2    264
3    201
4     62
5      8
Name: Title, dtype: int64

3、Sex

fig, (axis1,axis2,axis3) = plt.subplots(1,3,figsize=(15,4))

train_df.Sex.value_counts().plot(kind="bar",color='steelblue',ax=axis1)
axis1.set_ylabel(u'人数',fontproperties='SimHei')
axis1.set_xlabel('Sex')


pd.crosstab(train_df.Sex,train_df.Survived).plot.bar(stacked=True,color=['#FA2479','steelblue'],ax=axis2)
axis2.set_ylabel(u'人数',fontproperties='SimHei')


pd.pivot_table(train_df,index=['Sex'],values=['Survived']).plot.bar(ax=axis3,color='steelblue')

axis1.set_title(u"男女人数分布",fontproperties='SimHei')
axis2.set_title(u"男女的获救情况",fontproperties='SimHei')
axis3.set_title(u"生存概率",fontproperties='SimHei')

这里写图片描述
从图中看出女性生存几率很高,和之前的结论一致。

#数值化处理
full_df['Sex'] = full_df['Sex'].map( {'female': 0, 'male': 1} ).astype(int)

4、Age

年龄中存在一部分缺失值,所以要对年龄进行填补,在这里采用的是随机填充。

fig, (axis1,axis2) = plt.subplots(1,2,figsize=(15,4))
axis1.set_title('Original Age values - Titanic')
axis2.set_title('New Age values - Titanic')

average_age_full   = full_df["Age"].mean()
std_age_full       = full_df["Age"].std()
count_nan_age_full = full_df["Age"].isnull().sum()

#随机填充法
rand = np.random.randint(average_age_full - std_age_full, average_age_full + std_age_full, size = count_nan_age_full)

full_df['Age'].dropna().astype(int).hist(bins=70, ax=axis1)

full_df["Age"][np.isnan(full_df["Age"])] = rand

full_df['Age'] = full_df['Age'].astype(int)

full_df['Age'].hist(bins=70, ax=axis2)

这里写图片描述
这两幅图是填补前后的年龄分布图,可以看出填补前后分布差别不大。

full_df.Age[full_df.Pclass == 1].plot(kind='kde')   
full_df.Age[full_df.Pclass == 2].plot(kind='kde')
full_df.Age[full_df.Pclass == 3].plot(kind='kde')
plt.xlabel(u"年龄",fontproperties='SimHei')
plt.ylabel(u"密度",fontproperties='SimHei') 
plt.title(u"各等级的乘客年龄分布",fontproperties='SimHei')
plt.legend(('Level1', 'Level2','Level3'),loc='best')

填补缺失值前
填补缺失值前
填补缺失值后
这里写图片描述
这两幅是年龄关联等级舱的年龄分布图,也可以看出两幅图差别不大。
图中可以看出舱位等级越高,年龄相对较大。

#离散化处理
full_df.loc[ full_df['Age'] <= 14, 'Age'] = 0
full_df.loc[(full_df['Age'] > 14) & (full_df['Age'] <= 28), 'Age'] = 1
full_df.loc[(full_df['Age'] > 28) & (full_df['Age'] <= 42), 'Age'] = 2
full_df.loc[(full_df['Age'] > 42) & (full_df['Age'] <= 56), 'Age'] = 3
full_df.loc[ full_df['Age'] > 56, 'Age'] = 4
full[['Age','Survived']].groupby(['Age']).mean().plot.bar(figsize=(8,5))

这里写图片描述
从图中可以看出,小孩年龄段的人生存几率较高,其他年龄段的差别不大

5、SibSp和Parch

这个两个变量都表示家人的数量,所以把他们合并成一个变量Family

#变量合并
full_df['Family'] = full_df['SibSp'] + full_df["Parch"]
full_df['Family'].value_counts()

0     790
1     235
2     159
3      43
5      25
4      22
6      16
10     11
7       8
Name: Family, dtype: int64

#分类(有无家人)
full_df['Family'][full_df.Family==0] = 0
full_df['Family'][full_df.Family>0] = 1
fig, (axis1,axis2,axis3) = plt.subplots(1,3,figsize=(15,5))

sns.countplot(x='Family', data=full_df[:891], ax=axis1,color='blue')

sns.countplot(x='Family', hue="Survived", data=full_df[:891], ax=axis2)

embark_perc = full_df[:891][["Family","Survived"]].groupby(['Family'],as_index=False).mean()
sns.barplot(x='Family', y='Survived', data=embark_perc,ax=axis3,color='blue')

#axis1.set_xticklabels(["Alone","With Family"], rotation=0)
#axis2.set_xticklabels(["Alone","With Family"], rotation=0)
#axis3.set_xticklabels(["Alone","With Family"], rotation=0)

这里写图片描述
可以看出,有家人比独自一人的的生存几率高.

6、Fare

由之前的数据观察,有一个缺失值。

#众数填补缺失值
mod=full_df.Fare.mode()
mod
0    8.05
dtype: float64

full_df["Fare"] = full_df["Fare"].fillna(8.05)
#离散化处理
full_df.loc[ full_df['Fare'] <= 7.854, 'Fare'] = 0
full_df.loc[(full_df['Fare'] > 7.854) & (full_df['Fare'] <= 10.5), 'Fare'] = 1
full_df.loc[(full_df['Fare'] > 10.5) & (full_df['Fare'] <= 21.558), 'Fare'] = 2
full_df.loc[(full_df['Fare'] > 21.558) & (full_df['Fare'] <= 41.579), 'Fare'] = 3
full_df.loc[ full_df['Fare'] > 41.579, 'Fare'] = 4
full_df[['Fare','Survived']].groupby(['Fare']).mean().plot.bar(figsize=(6,4),color='steelblue')

这里写图片描述
从图中可以看出,船票越贵,生存几率越高。船票反映出的是船舱的等级,船票越贵,船舱等级越高。

7、Cabin

由 数据观察知道,这个变量有超过一半的缺失值,不好进行填补。一般遇到这种情况,在不能进行收集完善数据的情况下,只能把这个变量删除。
在这里我把变量进行有无舱号二分类处理,观察他们与Survived是否相关

full_df.loc[full_df.Cabin.notnull(),'Cabin']=1
full_df.loc[full_df.Cabin.isnull(),'Cabin']=0
fig, (axis1,axis2,axis3) = plt.subplots(1,3,figsize=(15,4))

full_df[:891].Cabin.value_counts().plot(kind="bar",color='steelblue',ax=axis1)
axis1.set_ylabel(u'人数',fontproperties='SimHei')
axis1.set_xlabel('Cabin')

pd.crosstab(full_df[:891].Cabin,full_df[:891].Survived).plot.bar(stacked=True,color=['#FA2479','steelblue'],ax=axis2)
axis2.set_ylabel(u'人数',fontproperties='SimHei')

pd.pivot_table(full_df[:891],index=['Cabin'],values=['Survived']).plot.bar(ax=axis3,color='steelblue')

axis1.set_title(u"有无Cabin的分布",fontproperties='SimHei')
axis2.set_title(u"有无Cabin的获救情况",fontproperties='SimHei')
axis3.set_title(u"生存概率",fontproperties='SimHei')

axis1.set_xticklabels(["no Cabin","With Cabin"], rotation=0)
axis2.set_xticklabels(["no Cabin","With Cabin"], rotation=0)
axis3.set_xticklabels(["no Cabin","With Cabin"], rotation=0)

这里写图片描述
从图中可以看出,有舱号记录的人生存几率较大

8、Embarked

full_df.Embarked[full_df.Embarked.isnull()]

61     NaN
829    NaN
mod=full_df.Embarked.mode()
mod
0    S
dtype: object

full_df['Embarked'].fillna('S',inplace=True)
fig, (axis1,axis2,axis3) = plt.subplots(1,3,figsize=(15,5))

sns.countplot(x='Embarked', data=train_df,order=['S','C','Q'], ax=axis1)

sns.countplot(x='Embarked', hue="Survived", data=train_df, order=['S','C','Q'], ax=axis2)

embark_perc = train_df[["Embarked","Survived"]].groupby(['Embarked'],as_index=False).mean()
sns.barplot(x='Embarked', y='Survived', data=embark_perc,order=['S','C','Q'],ax=axis3)

这里写图片描述
从图中可以看出,途径港口上船的人生存几率较大
(虽然我不知道港口和生存几率有什么关系,可能在途中上船的大都是有钱人吧!!!)

#数值化处理
full_df['Embarked'] = full_df['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)

9、变量相关性

colormap = plt.cm.viridis
plt.figure(figsize=(12,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
vari=full_df[['Cabin','Embarked','Sex','Title','Pclass','Age','Family','Fare','Survived']]

sns.heatmap(vari.astype(float).corr(),linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)

这里写图片描述
(这个图有点奇怪,我认为相关性较高的变量间都呈高度负相关。可能和变量数值化有关吧??所以在这里只看大小,不关注正负)
从图中可以看出Pclass和Cabin、Pclass和Fare、Title和Sex相关性很高


五、数据建模&评估

1、选择特征

test = full_df[891:]
feature=['Cabin','Embarked','Sex','Title','Pclass','Age','Family','Fare'] 
new_full_df=full_df[feature]
new_full_df.head()
dummies_Family = pd.get_dummies(new_full_df['Family'], prefix= 'Family')
dummies_Age = pd.get_dummies(new_full_df['Age'], prefix= 'Age')
dummies_Sex = pd.get_dummies(new_full_df['Sex'], prefix= 'Sex')
dummies_Title = pd.get_dummies(new_full_df['Title'], prefix= 'Title')
dummies_Fare = pd.get_dummies(new_full_df['Fare'], prefix= 'Fare')
dummies_Embarked = pd.get_dummies(new_full_df['Embarked'], prefix= 'Embarked')
dummies_Pclass = pd.get_dummies(new_full_df['Pclass'], prefix= 'Pclass')
dummies_Cabin = pd.get_dummies(new_full_df['Cabin'], prefix= 'Cabin')

new_full_df = pd.concat([new_full_df, dummies_Cabin, dummies_Embarked, dummies_Sex, dummies_Pclass,dummies_Fare,dummies_Title,dummies_Age,dummies_Family], axis=1)
new_full_df.drop(['Cabin','Embarked','Sex','Title','Pclass','Age','Family','Fare'], axis=1, inplace=True)
new_full_df
train_x_factor=new_full_df[:891]
train_y=full_df.Survived[:891]
test_x_factor=new_full_df[891:]

2、模型评估

from sklearn import cross_validation

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
models=[LogisticRegression(),RandomForestClassifier(),GradientBoostingClassifier(),SVC()]
names=['LR','RF','GB','SVM']
for name, model in zip(names,models):
    score=cross_validation.cross_val_score(model,train_x_factor,train_y,cv=5)
    print("{}:{},{}".format(name,score.mean(),score))

LR:0.8048126684746046,[ 0.79329609  0.77653631  0.80898876  0.79213483  0.85310734]
RF:0.8148683216143631,[ 0.83240223  0.77094972  0.8258427   0.80898876  0.83615819]
GB:0.8250188400755093,[ 0.82122905  0.7877095   0.83707865  0.80337079  0.87570621]
SVM:0.7967834449907032,[ 0.82681564  0.82122905  0.79775281  0.75280899  0.78531073]

3 、参数调整

from sklearn import grid_search
#Logistic Regression
param_grid={'C':[0.07,0.08,0.09,0.10,0.11,]}
grid_search=grid_search.GridSearchCV(LogisticRegression(),param_grid,cv=5)
grid_search.fit(train_x_factor,train_y)
grid_search.best_params_,grid_search.best_score_

({'C': 0.31}, 0.8103254769921436)
from sklearn import grid_search
#Support Vector Machine
param_grid={'C':[2,3,4,5,6],'gamma':[0.013,0.014,0.015,0.016,0.017]}
grid_search=grid_search.GridSearchCV(SVC(),param_grid,cv=5)
grid_search.fit(train_x_factor,train_y)
grid_search.best_params_,grid_search.best_score_
({'C': 4, 'gamma': 0.2}, 0.8316498316498316)
from sklearn import grid_search
#Gradient Boosting Decision Tree
param_grid={'n_estimators':[110,120,125,130],'learning_rate':[0.1,0.11,0.12,0.14,0.15],'max_depth':[4,5,6,7]}
grid_search=grid_search.GridSearchCV(GradientBoostingClassifier(),param_grid,cv=5)
grid_search.fit(train_x_factor,train_y)
grid_search.best_params_,grid_search.best_score_
({'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 140},
 0.8282828282828283)
from sklearn import grid_search
#Random Forest 
param_grid={'n_estimators':[480,481],'max_depth':[8,9,10]}
grid_search=grid_search.GridSearchCV(RandomForestClassifier(),param_grid,cv=5)
grid_search.fit(train_x_factor,train_y)
grid_search.best_params_,grid_search.best_score_

({'max_depth': 7, 'n_estimators': 470}, 0.8282828282828283)

4、模型融合

from sklearn.ensemble import VotingClassifier

clf1=LogisticRegression(C=0.1)
clf2=RandomForestClassifier(n_estimators=481,max_depth=9)
clf3=GradientBoostingClassifier(n_estimators=125,learning_rate=0.11,max_depth=7)
clf4=SVC(C=3,gamma=0.016,probability=True)

vc_clf=VotingClassifier(estimators=[('LR',clf1),('RF',clf2),('GB',clf3),('SVM',clf4)])

score=cross_validation.cross_val_score(vc_clf,train_x_factor,train_y,cv=5)
print("{},{}".format(score.mean(),score))

0.8216353575642416,[ 0.82122905  0.7877095   0.83146067  0.80337079  0.86440678]

模型融合对预测结果没有看出太大的差别

pred=vc_clf.fit(train_x_factor,train_y).predict(test_x_factor)
df=pd.DataFrame({'PassengerId':test.PassengerId,'Survived':pred})
df.to_csv('titanic_pred.csv',index=False)

其他的改进思路

1、对缺失值较多的Age可以尝试其他的填充方法,比如模型填充、交互Title分类的均值或众数进行填充
2、数据处理主要是对变量分类之后进行因子化建模,可以尝试进行标准化建模
3、建模方面可以多加入几个模型进行交叉验证观察,可以尝试其他的模型融合方式,比如:bagging、Stacking(本文用了VotingClassifier)

Logo

腾讯云面向开发者汇聚海量精品云计算使用和开发经验,营造开放的云计算技术生态圈。

更多推荐