spark之客户流失预测
Sparkify是一个类似于Spotify或Pandora的数字化音乐服务(类似于网易云音乐和QQ音乐的音乐平台)。使用Spark分析探索某数字音乐服务平台Sparkify(类似于网易云音乐和QQ音乐的音乐平台)2016年10月1日-2016年12月1日期间用户在该平台上的行为数据。通过对用户行为和用户信息的分析,提取可能对预测用户是否流失有帮助的相关特征,从而建立流失用户预测模型。
·
一.数据
这是一个Udacity纳米学位项目(数据科学的顶点)。这个项目使用来自Sparkify的用户事件数据来建立一个模型来预测用户的流失。Sparkify是一个类似于Spotify或Pandora的数字化音乐服务(类似于网易云音乐和QQ音乐的音乐平台)。使用Spark分析探索某数字音乐服务平台Sparkify(类似于网易云音乐和QQ音乐的音乐平台)2016年10月1日-2016年12月1日期间用户在该平台上的行为数据。通过对用户行为和用户信息的分析,提取可能对预测用户是否流失有帮助的相关特征,从而建立流失用户预测模型。
以下是APP的日志数据,包含226个不同用户的信息,从为一首歌的点赞到更改账户设置,都有详细的操作。
数据集名称:mini_sparkify_event_data.json,原始数据12G,这里使用子集123M,可自行网上搜索下载。
数据描述:
artist 音乐信息 歌手名称
auth 网页信息 用户进入平台的方式
firstName 用户信息 用户的名
gender 用户信息 用户性别:F为女,M为男
itemInSession 网页信息 会话顺序
lastName 用户信息 用户的姓氏
length 音乐信息 音乐时长(秒)
level Event 用户等级:free为免费用户,paid为付费用户
location 会话信息 用户在会话期间所属位置
method 网页信息 HTTP method ,GET 或者 PUT
page 网页信息 用户行为类型
registration 用户信息 用户注册时间
sessionId 会话信息 会话编号
song 音乐信息 歌曲名称
status 网页信息 HTTP状态编码. 2xx=Successful, 3xx=Redirection, 4xx=Client Error.
ts 网页信息 用户行为发生的时间
userAgent 会话信息 网络环境,所属浏览器
userId 用户信息 用户编码,具有唯一性
查看属性:
+----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
| artist| auth|firstName|gender|itemInSession|lastName| length|level| location|method| page| registration|sessionId| song|status| ts| userAgent|userId|
+----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
| Martha Tilston|Logged In| Colin| M| 50| Freeman|277.89016| paid| Bakersfield, CA| PUT|NextSong|1538173362000| 29| Rockpools| 200|1538352117000|Mozilla/5.0 (Wind...| 30|
|Five Iron Frenzy|Logged In| Micah| M| 79| Long|236.09424| free|Boston-Cambridge-...| PUT|NextSong|1538331630000| 8| Canada| 200|1538352180000|"Mozilla/5.0 (Win...| 9|
| Adam Lambert|Logged In| Colin| M| 51| Freeman| 282.8273| paid| Bakersfield, CA| PUT|NextSong|1538173362000| 29| Time For Miracles| 200|1538352394000|Mozilla/5.0 (Wind...| 30|
| Enigma|Logged In| Micah| M| 80| Long|262.71302| free|Boston-Cambridge-...| PUT|NextSong|1538331630000| 8|Knocking On Forbi...| 200|1538352416000|"Mozilla/5.0 (Win...| 9|
| Daft Punk|Logged In| Colin| M| 52| Freeman|223.60771| paid| Bakersfield, CA| PUT|NextSong|1538173362000| 29|Harder Better Fas...| 200|1538352676000|Mozilla/5.0 (Wind...| 30|
+----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
二.代码
详细代码见:RF用户流失预测(https://github.com/jiangnanboy/spark_tutorial)
public static void churnAnalysis(SparkSession session) {
/**
* 加载数据
* 这是日志数据,包含226个不同用户的信息,从为一首歌点赞到更改账户设置,都有详细的操作。
*
* 数据描述
* artist: Artist name (ex. Daft Punk)
* auth: User authentication status (ex. Logged)
* firstName: User first name (ex. Colin)
* gender: Gender (ex. F or M)
* itemInSession: Item count in a session (ex. 52)
* lastName: User last name (ex. Freeman)
* length: Length of song (ex. 223.60771)
* level: User plan (ex. paid)
* location: User's location (ex. Bakersfield)
* method: HTTP method (ex. PUT)
* page: Page name (ex. NextSong)
* registration: Registration timestamp (unix timestamp) (ex. 1538173362000)
* sessionId: Session ID (ex. 29)
* song: Song (ex. Harder Better Faster Stronger)
* status: HTTP status (ex. 200)
* ts: Event timestamp(unix timestamp) (ex. 1538352676000)
* userAgent: User's browswer agent (ex. Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0)
* userId: User ID (ex. 30)
*
* +----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
* | artist| auth|firstName|gender|itemInSession|lastName| length|level| location|method| page| registration|sessionId| song|status| ts| userAgent|userId|
* +----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
* | Martha Tilston|Logged In| Colin| M| 50| Freeman|277.89016| paid| Bakersfield, CA| PUT|NextSong|1538173362000| 29| Rockpools| 200|1538352117000|Mozilla/5.0 (Wind...| 30|
* |Five Iron Frenzy|Logged In| Micah| M| 79| Long|236.09424| free|Boston-Cambridge-...| PUT|NextSong|1538331630000| 8| Canada| 200|1538352180000|"Mozilla/5.0 (Win...| 9|
* | Adam Lambert|Logged In| Colin| M| 51| Freeman| 282.8273| paid| Bakersfield, CA| PUT|NextSong|1538173362000| 29| Time For Miracles| 200|1538352394000|Mozilla/5.0 (Wind...| 30|
* | Enigma|Logged In| Micah| M| 80| Long|262.71302| free|Boston-Cambridge-...| PUT|NextSong|1538331630000| 8|Knocking On Forbi...| 200|1538352416000|"Mozilla/5.0 (Win...| 9|
* | Daft Punk|Logged In| Colin| M| 52| Freeman|223.60771| paid| Bakersfield, CA| PUT|NextSong|1538173362000| 29|Harder Better Fas...| 200|1538352676000|Mozilla/5.0 (Wind...| 30|
* +----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
*/
String path = PropertiesReader.get("mini_sparkify_event_data"); //数据集自行下载
Dataset<Row> dataset = session.read().json(path);
dataset.persist(StorageLevel.MEMORY_AND_DISK());
dataset.show(5);
/**
*
*/
//创建视图
dataset.createOrReplaceTempView("userlogs");
String[] columnsName = dataset.columns();
Column[] columns = new Column[columnsName.length];
for(int index = 0;index < columnsName.length; index++) {
columns[index] = functions.count(functions.when(functions.isnull(col(columnsName[index])), columnsName[index])).as(columnsName[index]);
}
/**
* 查看每个列为null的数量
*
* +------+----+---------+------+-------------+--------+------+-----+--------+------+----+------------+---------+-----+------+---+---------+------+
* |artist|auth|firstName|gender|itemInSession|lastName|length|level|location|method|page|registration|sessionId| song|status| ts|userAgent|userId|
* +------+----+---------+------+-------------+--------+------+-----+--------+------+----+------------+---------+-----+------+---+---------+------+
* | 58392| 0| 8346| 8346| 0| 8346| 58392| 0| 8346| 0| 0| 8346| 0|58392| 0| 0| 8346| 0|
* +------+----+---------+------+-------------+--------+------+-----+--------+------+----+------------+---------+-----+------+---+---------+------+
*/
dataset.select(columns);
/**
* 列“firstName”中的缺失值
* +------+----------+---------+------+-------------+--------+------+-----+--------+------+-----+------------+---------+----+------+-------------+---------+------+
* |artist| auth|firstName|gender|itemInSession|lastName|length|level|location|method| page|registration|sessionId|song|status| ts|userAgent|userId|
* +------+----------+---------+------+-------------+--------+------+-----+--------+------+-----+------------+---------+----+------+-------------+---------+------+
* | null|Logged Out| null| null| 100| null| null| free| null| GET| Home| null| 8|null| 200|1538355745000| null| |
* | null|Logged Out| null| null| 101| null| null| free| null| GET| Help| null| 8|null| 200|1538355807000| null| |
* | null|Logged Out| null| null| 102| null| null| free| null| GET| Home| null| 8|null| 200|1538355841000| null| |
* | null|Logged Out| null| null| 103| null| null| free| null| PUT|Login| null| 8|null| 307|1538355842000| null| |
* | null|Logged Out| null| null| 2| null| null| free| null| GET| Home| null| 240|null| 200|1538356678000| null| |
* | null|Logged Out| null| null| 3| null| null| free| null| PUT|Login| null| 240|null| 307|1538356679000| null| |
* | null|Logged Out| null| null| 0| null| null| free| null| PUT|Login| null| 100|null| 307|1538358102000| null| |
* | null|Logged Out| null| null| 0| null| null| free| null| PUT|Login| null| 241|null| 307|1538360117000| null| |
* | null|Logged Out| null| null| 14| null| null| free| null| GET| Home| null| 187|null| 200|1538361527000| null| |
* | null|Logged Out| null| null| 15| null| null| free| null| PUT|Login| null| 187|null| 307|1538361528000| null| |
* +------+----------+---------+------+-------------+--------+------+-----+--------+------+-----+------------+---------+----+------+-------------+---------+------+
*/
dataset.where(col("firstName").isNaN()).show(10);
/**
* 列"artist"中的缺失值
* +------+----------+---------+------+-------------+--------+------+-----+--------------------+------+---------------+-------------+---------+----+------+-------------+--------------------+------+
* |artist| auth|firstName|gender|itemInSession|lastName|length|level| location|method| page| registration|sessionId|song|status| ts| userAgent|userId|
* +------+----------+---------+------+-------------+--------+------+-----+--------------------+------+---------------+-------------+---------+----+------+-------------+--------------------+------+
* | null| Logged In| Colin| M| 54| Freeman| null| paid| Bakersfield, CA| PUT|Add to Playlist|1538173362000| 29|null| 200|1538352905000|Mozilla/5.0 (Wind...| 30|
* | null| Logged In| Micah| M| 84| Long| null| free|Boston-Cambridge-...| GET| Roll Advert|1538331630000| 8|null| 200|1538353150000|"Mozilla/5.0 (Win...| 9|
* | null| Logged In| Micah| M| 86| Long| null| free|Boston-Cambridge-...| PUT| Thumbs Up|1538331630000| 8|null| 307|1538353376000|"Mozilla/5.0 (Win...| 9|
* | null| Logged In| Alexi| F| 4| Warren| null| paid|Spokane-Spokane V...| GET| Downgrade|1532482662000| 53|null| 200|1538354749000|Mozilla/5.0 (Wind...| 54|
* | null| Logged In| Alexi| F| 7| Warren| null| paid|Spokane-Spokane V...| PUT| Thumbs Up|1532482662000| 53|null| 307|1538355255000|Mozilla/5.0 (Wind...| 54|
* | null| Logged In| Micah| M| 95| Long| null| free|Boston-Cambridge-...| PUT| Thumbs Down|1538331630000| 8|null| 307|1538355306000|"Mozilla/5.0 (Win...| 9|
* | null| Logged In| Micah| M| 97| Long| null| free|Boston-Cambridge-...| GET| Home|1538331630000| 8|null| 200|1538355504000|"Mozilla/5.0 (Win...| 9|
* | null| Logged In| Micah| M| 99| Long| null| free|Boston-Cambridge-...| PUT| Logout|1538331630000| 8|null| 307|1538355687000|"Mozilla/5.0 (Win...| 9|
* | null| Logged In| Ashlynn| F| 9|Williams| null| free| Tallahassee, FL| PUT| Thumbs Up|1537365219000| 217|null| 307|1538355711000|"Mozilla/5.0 (Mac...| 74|
* | null|Logged Out| null| null| 100| null| null| free| null| GET| Home| null| 8|null| 200|1538355745000| null| |
* +------+----------+---------+------+-------------+--------+------+-----+--------------------+------+---------------+-------------+---------+----+------+-------------+--------------------+------+
*/
dataset.where(col("artist").isNaN()).show(10);
/**
* 过滤注销的用户
*/
dataset = dataset.where(col("auth").notEqual("Logged Out"));
/**通过以上分析可知:
* 虽然在较高的级别上userId或sessionId列中没有丢失的值,但进一步查看firstName列中丢失的值,可以发现注销的用户拥有空(但不是null)的用户ID。排除这些用户。
* 列artist中也有缺失的值,但这些值对应于与音乐无关的动作的日志(如“Add to Playlist”、“Roll Advert”等)。只要这些用户是登录的,我们就希望保留这个activity,因为他们可能是分类的重要行为标记。
*/
/**
* |-- artist: string (nullable = true)
* |-- auth: string (nullable = true)
* |-- firstName: string (nullable = true)
* |-- gender: string (nullable = true)
* |-- itemInSession: long (nullable = true)
* |-- lastName: string (nullable = true)
* |-- length: double (nullable = true)
* |-- level: string (nullable = true)
* |-- location: string (nullable = true)
* |-- method: string (nullable = true)
* |-- page: string (nullable = true)
* |-- registration: long (nullable = true)
* |-- sessionId: long (nullable = true)
* |-- song: string (nullable = true)
* |-- status: long (nullable = true)
* |-- ts: long (nullable = true)
* |-- userAgent: string (nullable = true)
* |-- userId: string (nullable = true)
*/
dataset.printSchema();
/**
* 对用户事件动作group,count
*
* +--------------------+------+
* | Page| count|
* +--------------------+------+
* | Cancel| 52|
* | Submit Downgrade| 63|
* | Thumbs Down| 2546|
* | Home| 10118|
* | Downgrade| 2055|
* | Roll Advert| 3933|
* | Logout| 3226|
* | Save Settings| 310|
* |Cancellation Conf...| 52|
* | About| 509|
* | Submit Registration| 5|
* | Settings| 1514|
* | Register| 18|
* | Add to Playlist| 6526|
* | Add Friend| 4277|
* | NextSong|228108|
* | Thumbs Up| 12551|
* | Help| 1477|
* | Upgrade| 499|
* | Error| 253|
* +--------------------+------+
*/
dataset.groupBy("Page").count().show();
/**
* 不同的用户数量
*+--------+
* |nb_users|
* +--------+
* | 226|
* +--------+
*/
session.sql("select count(distinct userId) as nb_users from userlogs").show();
/**
* 创建数据集,包括用户id和标签(是否流失)
*
* 定义:列Page中值为"Cancellation Confirmation"为流失,其它非流失
*/
Dataset<Row> churnDataset = session.sql("select distinct userId, 1 as churn from userlogs where Page='Cancellation Confirmation'");
Dataset<Row> noChurnDataset = session.sql("select distinct userId, 0 as churn from userlogs where userId not in (select distinct userId from userlogs where Page='Cancellation Confirmation')");
/**
* union churnDataset noChurnDataset,shuffling the rows
*/
Dataset<Row> unionDataset = churnDataset.union(noChurnDataset);
unionDataset.createOrReplaceTempView("churn");
unionDataset = session.sql("select * from churn order by rand()");
unionDataset.createOrReplaceTempView("churn");
/**
* churn userId
* 0 174
* 1 52
*/
unionDataset.groupBy(col("churn")).count().show();
/**
* 以上通过定义page为“Cancellation Confirmation”为可以流失用户。
* 这种方法使我们能够在用户流失之前研究其行为,尝试建立预测模型并提取表明将来有流失风险的行为。
* 从以上可知建立的数据集不平衡,在建立模型前可进行采样缓解这种不平衡带来的不准确问题。
*/ ......更多推荐
所有评论(0)