浮动车轨迹数据分析

我将查看数据集的整体情况，确定分析方向。首先，我将读取数据集并展示其基本信息和前几行。

若木524

755人浏览 · 2025-04-23 10:02:44

若木524 · 2025-04-23 10:02:44 发布

我将查看数据集的整体情况，确定分析方向。首先，我将读取数据集并展示其基本信息和前几行。

import pandas as pd

# 加载数据集
df = pd.read_csv('/mnt/oddata.csv')

print('数据基本信息：')
df.info()

# 查看数据集行数和列数
rows, columns = df.shape

if rows < 100 and columns < 20:
    # 短表数据（行数少于100且列数少于20）查看全量数据信息
    print('数据全部内容信息：')
    print(df.to_csv(sep='\t', na_rep='nan'))
else:
    # 长表数据查看数据前几行信息
    print('数据前几行内容信息：')
    print(df.head().to_csv(sep='\t', na_rep='nan'))

数据基本信息：
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24845 entries, 0 to 24844
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   VehicleNum  24845 non-null  int64  
 1   stime       24845 non-null  object 
 2   slon        24845 non-null  float64
 3   slat        24845 non-null  float64
 4   etime       24845 non-null  object 
 5   elon        24845 non-null  float64
 6   elat        24845 non-null  float64
 7   ID          24845 non-null  int64  
dtypes: float64(4), int64(2), object(2)
memory usage: 1.5+ MB
数据前几行内容信息：
	VehicleNum	stime	slon	slat	etime	elon	elat	ID
0	22223	2025-04-16 00:03:39	114.167732	22.56255	2025-04-16 00:11:04	114.225487	22.552933	0
1	22223	2025-04-16 00:11:49	114.227264	22.554234	2025-04-16 00:15:35	114.229202	22.56015	1
2	22223	2025-04-16 00:17:29	114.231598	22.562349	2025-04-16 00:29:21	114.255898	22.591	2
3	22223	2025-04-16 00:37:01	114.2397	22.56345	2025-04-16 00:54:58	114.119637	22.565916	3
4	22223	2025-04-16 01:01:29	114.136467	22.575583	2025-04-16 01:08:33	114.166786	22.608232	4

本数据集中有 24845 行，8 列，数据集包含如下字段：

VehicleNum：车辆编号，int64 类型
stime：起始时间，object 类型
slon：起始经度，float64 类型
slat：起始纬度，float64 类型
etime：结束时间，object 类型
elon：结束经度，float64 类型
elat：结束纬度，float64 类型
ID：标识编号，int64 类型

我将按照下面的大纲详细分析数据：

时间分布
车辆行驶距离分布
不同时间段行驶距离对比

下面是详细报告内容：

《车辆行驶数据的分析报告》

# 一、时间分布
import matplotlib.pyplot as plt
import seaborn as sns

# 将 stime 和 etime 转换为 datetime 类型
df['stime'] = pd.to_datetime(df['stime'])
df['etime'] = pd.to_datetime(df['etime'])

# 提取 stime 的小时信息并统计每个小时的数量
start_hour_counts = df['stime'].dt.hour.value_counts().sort_index()

# 提取 etime 的小时信息并统计每个小时的数量
end_hour_counts = df['etime'].dt.hour.value_counts().sort_index()

# 设置图片清晰度
plt.rcParams['figure.dpi'] = 300

# 设置中文字体
plt.rcParams['font.sans-serif'] = ['WenQuanYi Zen Hei']

# 创建画布，包含两个子图
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# 绘制起始时间分布柱状图
sns.barplot(x=start_hour_counts.index, y=start_hour_counts.values, ax=axes[0])
axes[0].set_title('起始时间分布')
axes[0].set_xlabel('小时')
axes[0].set_ylabel('数量')

# 添加起始时间分布数据标签
for i, v in enumerate(start_hour_counts.values):
    axes[0].text(i, v + 5, str(v), ha='center')

# 绘制结束时间分布柱状图
sns.barplot(x=end_hour_counts.index, y=end_hour_counts.values, ax=axes[1])
axes[1].set_title('结束时间分布')
axes[1].set_xlabel('小时')
axes[1].set_ylabel('数量')

# 添加结束时间分布数据标签
for i, v in enumerate(end_hour_counts.values):
    axes[1].text(i, v + 5, str(v), ha='center')

plt.tight_layout()
plt.show()

print('起始时间分布：')
print(start_hour_counts)
print('结束时间分布：')
print(end_hour_counts)

起始时间分布：
stime
0     1076
1      757
2      525
3      379
4      273
5      257
6      452
7      852
8     1190
9     1296
10    1290
11    1196
12    1090
13    1286
14    1376
15    1386
16    1141
17    1119
18    1027
19    1185
20    1499
21    1500
22    1512
23    1181
Name: count, dtype: int64
结束时间分布：
etime
0      896
1      819
2      570
3      378
4      289
5      231
6      407
7      722
8     1150
9     1286
10    1320
11    1210
12    1136
13    1217
14    1326
15    1410
16    1152
17    1094
18    1023
19    1209
20    1469
21    1481
22    1562
23    1488
Name: count, dtype: int64

一、时间分布

1. 起始时间分布

小时	数量
0	1076
1	757
2	525
3	379
4	273
5	257
6	452
7	852
8	1190
9	1296
10	1290
11	1196
12	1090
13	1286
14	1376
15	1386
16	1141
17	1119
18	1027
19	1185
20	1499
21	1500
22	1512
23	1181

从起始时间分布的数据来看，在凌晨时段（0 - 5 时）数量相对较少，这可能是因为大部分活动在夜间处于低谷期。从早上 9 时到晚上 22 时，数量相对较多且较为稳定，推测这段时间是活动的高峰期，可能与人们的日常工作、生活规律有关，例如上班、出行、娱乐等活动大多集中在这个时间段。

2. 结束时间分布

小时	数量
0	896
1	819
2	570
3	378
4	289
5	231
6	407
7	722
8	1150
9	1286
10	1320
11	1210
12	1136
13	1217
14	1326
15	1410
16	1152
17	1094
18	1023
19	1209
20	1469
21	1481
22	1562
23	1488

结束时间分布同样在凌晨时段数量较少，在白天和晚上有较高的数量。对比起始时间分布，结束时间在 22 - 23 时的数量有明显上升，这可能意味着部分活动持续到较晚时间才结束。

3. 时间分布可视化

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

# 二、车辆行驶距离分布
from math import radians, sin, cos, sqrt, atan2

# 定义函数来计算两个经纬度之间的距离（单位：千米）
def haversine(lat1, lon1, lat2, lon2):
    # 将经纬度从度数转换为弧度
    lat1 = radians(lat1)
    lon1 = radians(lon1)
    lat2 = radians(lat2)
    lon2 = radians(lon2)

    # Haversine公式
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))
    radius = 6371  # 地球平均半径，单位为千米
    return radius * c

# 计算每段行程的距离
df['distance'] = df.apply(lambda row: haversine(row['slat'], row['slon'], row['elat'], row['elon']), axis=1)

# 查看距离的描述性统计信息（保留两位小数）
distance_stats = df['distance'].describe().round(2)

# 绘制距离分布直方图
sns.histplot(df['distance'], bins=30, kde=False)
plt.title('车辆行驶距离分布直方图')
plt.xlabel('行驶距离（千米）')
plt.xticks(rotation=45)
plt.ylabel('频数')

# 添加数据标签
for rect in plt.gca().patches:
    height = rect.get_height()
    plt.gca().text(rect.get_x() + rect.get_width()/2, height + 0.5, int(height), ha='center', va='bottom')

plt.show()

print('车辆行驶距离的描述性统计信息：')
print(distance_stats)

车辆行驶距离的描述性统计信息：
count    24845.00
mean         5.26
std        105.84
min          0.00
25%          1.35
50%          2.71
75%          5.58
max      16626.01
Name: distance, dtype: float64

二、车辆行驶距离分布

1. 车辆行驶距离的描述性统计分析

统计指标	数值
count（数量）	24845.00
mean（平均值）	5.26
std（标准差）	105.84
min（最小值）	0.00
25%分位数	1.35
50%分位数（中位数）	2.71
75%分位数	5.58
max（最大值）	16626.01

数量：此次统计的车辆行驶记录有 24845 条，样本量较大，这使得统计结果具有一定的可靠性和代表性。
平均值：平均行驶距离为 5.26 千米，这表明总体上车辆的行驶距离不算长。不过，由于标准差很大，平均值可能不能很好地反映数据的集中趋势。
标准差：标准差达到 105.84，这是一个非常大的值，说明车辆行驶距离的离散程度很高，数据分布较为分散。可能存在一些特殊的行程，导致行驶距离与平均值有较大偏差。
最小值和最大值：最小值为 0.00 千米，这可能表示车辆未移动或者数据记录有误；最大值达到 16626.01 千米，这是一个极大的数值，可能是长途运输或者数据错误导致的异常值。这些极端值对整体的统计指标有较大影响。
分位数：25% 分位数为 1.35 千米，50% 分位数（中位数）为 2.71 千米，75% 分位数为 5.58 千米。这表明有一半的车辆行驶距离在 2.71 千米以下，而 75% 的车辆行驶距离在 5.58 千米以下。

2. 车辆行驶距离分布可视化

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

# 三、不同时间段行驶距离对比
# 划分时间段
df['time_period'] = pd.cut(df['stime'].dt.hour, bins=[0, 6, 12, 18, 24], labels=['凌晨', '上午', '下午', '晚上'])

# 计算不同时间段的平均行驶距离（保留两位小数）
avg_distance_by_period = df.groupby('time_period')['distance'].mean().round(2)

# 绘制不同时间段平均行驶距离柱状图
sns.barplot(x=avg_distance_by_period.index, y=avg_distance_by_period.values)
plt.title('不同时间段平均行驶距离柱状图')
plt.xlabel('时间段')
plt.ylabel('平均行驶距离（千米）')

# 添加数据标签
for i, v in enumerate(avg_distance_by_period.values):
    plt.text(i, v, str(v), ha='center', va='bottom')

plt.show()

print('不同时间段的平均行驶距离：')
print(avg_distance_by_period)

不同时间段的平均行驶距离：
time_period
凌晨    5.29
上午    4.52
下午    4.38
晚上    7.03
Name: distance, dtype: float64

三、不同时间段行驶距离对比

1. 不同时间段平均行驶距离数据统计

时间段	平均行驶距离（千米）
凌晨	5.29
上午	4.52
下午	4.38
晚上	7.03

从这些数据中可以推测，晚上时段的平均行驶距离最长，这可能是因为晚上人们出行更多是长途出行，比如夜间通勤、外出娱乐后返程等。而下午时段平均行驶距离最短，可能是因为下午人们的出行活动多集中在较近的区域，例如在工作地点附近的商务活动或者午休后的短程出行。

2. 不同时间段平均行驶距离柱状图

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

腾讯云开发者社区

腾讯云面向开发者汇聚海量精品云计算使用和开发经验，营造开放的云计算技术生态圈。

更多推荐

Elasticsearch复杂数据类型终极指南：从入门到精通

Elasticsearch作为功能强大的搜索引擎，支持多种复杂数据类型，让开发者能够灵活处理各种结构化和非结构化数据。本文将带你全面了解Elasticsearch中的复杂数据类型，从基础概念到实际应用，助你轻松掌握数据建模的核心技巧。## 内部对象：构建层级化数据结构在Elasticsearch中，对象类型（Object）是最基础的复杂数据类型之一，用于表示具有嵌套关系的数据。例如，我们可

腾讯云开发者社区

终极指南：Flink SQL连接器版本管理从混乱到有序的升级之路

Apache Flink作为流处理领域的佼佼者，其SQL连接器的版本管理一直是开发者面临的核心挑战。本文将系统讲解Flink SQL连接器版本管理的最佳实践，帮助你轻松应对版本兼容性问题，实现从混乱到有序的升级之旅。## 连接器版本管理的常见痛点 😫在Flink应用开发中，连接器版本管理常常让开发者头疼不已。不同版本的连接器可能导致各种兼容性问题，例如API变更、功能差异甚至运行时错误。

腾讯云开发者社区

如何快速搭建Neon无服务器PostgreSQL：面向初学者的完整指南

Neon是一款革命性的无服务器PostgreSQL解决方案，它通过分离存储和计算层，实现了自动扩缩容、类代码式数据库分支以及零级扩展能力。本指南将帮助你从零开始搭建Neon开发环境，体验这款创新数据库的强大功能。## 准备工作：环境要求与依赖项在开始搭建Neon环境前，请确保你的系统满足以下要求：- Linux操作系统（推荐Ubuntu 20.04+或Debian 11+）- Git