基于PPO的强化学习轨迹规划算法:广义光线注意力局部规划器

本课题将二维激光雷达局部轨迹规划、随机化仿真环境、注意力编码器与 PPO 强化学习结合起来，探讨一种具有较好实时性和泛化性的支持局部和全局的轨迹规划新方案.

怀念失去的天堂

1223人浏览 · 2025-11-23 10:52:53

怀念失去的天堂 · 2025-11-23 10:52:53 发布

GRALP (Generalized-depth Ray-Attention Local Planner)

GRALP trains a lightweight PPO local planner on a fully randomized, map-free GPU environment. Observations are vectorized generalized ray/depth (distance-to-obstacle) samples with kinematic history; actions are continuous planar velocity commands. The codebase also ships a one-command exporter that packages the trained policy into a standalone inference API.

GRALP（Generalized-depth Ray-Attention Local Planner）在完全随机化、无地图的 GPU 环境中训练轻量级的 PPO 局部规划器。观测由向量化的广义光线/深度（离障距离）采样和运动学历史组成，动作是连续的平面速度指令。仓库还提供“一键导出”脚本，可将训练好的策略打包为独立的推理 API。
旧版最远可见点作为前部导航的极简仿真环境实现效果

Quickstart

Install dependencies

pip install -r requirements.txt
# then install torch matching your device, e.g.:
pip install torch --index-url https://download.pytorch.org/whl/cpu
# or
pip install torch --index-url https://download.pytorch.org/whl/cu121

Configure
- config/env_config.json
  - limits: linear/yaw caps (vx_max, omega_max).
  - sim: dt, safe_distance, task redraw knobs (task_point_max_dist_m, task_point_success_radius_m, task_point_random_interval_max).
  - obs: view geometry (patch_meters, ray_max_gap for derived ray count) plus empty-ratio sampling controls (blank_ratio_base, blank_ratio_randmax, blank_ratio_std_ratio) and optional narrow-passage Gaussian distances (narrow_passage_gaussian, narrow_passage_std_ratio).
  - reward: collision/progress/time/jerk weights and orientation_verify gate.
- config/train_config.json
  - device: cuda:0 (default) or cpu; env_config file name to load.
  - sampling: batch_env (e.g., 2048 GPU envs), rollout_len, reset_each_rollout.
  - ppo: discount (gamma), gae_lambda, clipping, optimizer lrs (lr, value_lr), epochs, minibatch_size, entropy_coef, value_coef, max_grad_norm, AMP toggles (amp, amp_bf16), collision_done, log_std_min/max.
  - model: attention shape (num_queries, num_heads).
  - run: total_env_steps, ckpt_dir, log_interval.
Train
```
python -m rl_ppo.ppo_train --train_config config/train_config.json
```
On startup you can create a new checkpoint (y) or resume from the latest checkpoint under run.ckpt_dir (n).

快速开始

安装依赖

pip install -r requirements.txt
# 再安装与你设备匹配的 torch，例如：
pip install torch --index-url https://download.pytorch.org/whl/cpu
# 或
pip install torch --index-url https://download.pytorch.org/whl/cu121

配置文件
- config/env_config.json
  - limits：线速度/角速度上限（vx_max, omega_max）。
  - sim：dt、安全距离与任务点重采样（safe_distance, task_point_max_dist_m, task_point_success_radius_m, task_point_random_interval_max）。
  - obs：视野参数（patch_meters, ray_max_gap 用于推导射线数量），空/障比例采样控制（blank_ratio_base, blank_ratio_randmax, blank_ratio_std_ratio），以及可选的狭窄通道高斯距离采样（narrow_passage_gaussian, narrow_passage_std_ratio）。
  - reward：碰撞/进度/时间/jerk 权重和 orientation_verify 开关。
- config/train_config.json
  - device：默认 cuda:0（可设为 cpu）；env_config 指定加载的环境配置文件。
  - sampling：batch_env（如 2048 个 GPU 环境）、rollout_len、reset_each_rollout。
  - ppo：折扣系数 gamma、gae_lambda、裁剪范围、优化器学习率（lr, value_lr）、epochs、minibatch_size、entropy_coef、value_coef、max_grad_norm、AMP 开关（amp, amp_bf16）、collision_done、log_std_min/max。
  - model：注意力形状（num_queries, num_heads）。
  - run：total_env_steps, ckpt_dir, log_interval。
开始训练
```
python -m rl_ppo.ppo_train --train_config config/train_config.json
```
启动时可选择创建新检查点（输入 y），或从 run.ckpt_dir 下的最新检查点恢复（输入 n）。

Standalone Inference Export

The network is CPU-friendly. Run python3 tools/setup_api.py to rebuild ppo_api/: the script copies the template, syncs key config fields (limits, timestep, FOV, attention shape), and grabs the newest checkpoint under runs/. Use via from ppo_api.inference import PPOInference; see ppo_api/README.md after export.

独立推理导出

网络推理对 CPU 友好。运行 python3 tools/setup_api.py 可重建 ppo_api/：脚本会复制模板、同步关键配置字段（limits、时间步长、视场、注意力形状），并抓取 runs/ 中最新的检查点。使用方式：from ppo_api.inference import PPOInference，更多细节见导出后的 ppo_api/README.md。
0.5m追踪效果示意图

Repository Layout

env/
- sim_gpu_env.py: Batched randomized ray environment (SimRandomGPUBatchEnv) with per-step FOV resampling and task-point rewards.
- ray.py: Ray count utilities used when n_rays==0.
- utils.py: Logging helpers and JSON config loader.
rl_ppo/
- ppo_train.py: PPO training entrypoint.
- ppo_models.py: Shared-encoder Gaussian policy and value head with tanh-squashed actions.
- encoder.py: RayEncoder backbone (ray convolutions + multi-query, multi-head attention) that outputs a 256-d latent.
- ppo_buffer.py: GAE-Lambda rollout buffer.
- ppo_utils.py: Discounted return helpers, checkpoint utilities, AMP guards, and reproducibility tools.
config/
- env_config.json: Environment, observation, and reward settings.
- train_config.json: PPO hyperparameters and run configuration.
tools/
- setup_api.py: One-command exporter that builds a self-contained ppo_api/ with the newest checkpoint and synced configs.
- api_example/: Template for the exported inference package.
runs/: Default checkpoint/output directory (created at runtime).

目录结构

env/
- sim_gpu_env.py：批量随机化的光线环境（SimRandomGPUBatchEnv），支持每步视场重新采样和任务点奖励。
- ray.py：当 n_rays==0 时使用的光线数量工具函数。
- utils.py：日志工具和 JSON 配置加载器。
rl_ppo/
- ppo_train.py：PPO 训练入口。
- ppo_models.py：共享编码器的高斯策略与价值头，动作使用 tanh 压缩。
- encoder.py：RayEncoder 主干（光线卷积 + 多查询多头注意力），输出 256 维潜在向量。
- ppo_buffer.py：GAE-Lambda 轨迹缓冲。
- ppo_utils.py：折扣回报工具、检查点管理、AMP 保护和可复现性辅助。
config/
- env_config.json：环境、观测与奖励配置。
- train_config.json：PPO 超参数与运行配置。
tools/
- setup_api.py：一键导出脚本，使用最新检查点与配置生成独立的 ppo_api/。
- api_example/：导出推理包的模板。
runs/：默认的检查点与输出目录（运行时生成）。

GPU Randomized Environment

Per-step FOV resampling: Each GPU sub-environment redraws per-ray distances every step using an empty/obstacle mask derived from blank_ratio_base plus Gaussian jitter (blank_ratio_randmax, blank_ratio_std_ratio). Empty rays are filled with the full view radius while obstacle rays sample distances.
Gaussian narrow passages (optional): When narrow_passage_gaussian is true, obstacle distances follow a half-Gaussian with std = patch_meters * narrow_passage_std_ratio, producing clustered close obstacles; otherwise distances are uniform within the view radius.
Task points without global maps: Task points are sampled within task_point_max_dist_m and clipped to LOS using the sampled rays; redraw cadence is controlled by task_point_random_interval_max.

GPU 随机环境

每步视场重采样：每个 GPU 子环境每步重新生成射线距离，先用基准空白率 blank_ratio_base 加高斯抖动（blank_ratio_randmax, blank_ratio_std_ratio）得到空/障掩码，空白射线填充视野半径，障碍射线再采样距离。
可选高斯狭窄通道：当 narrow_passage_gaussian 为真时，障碍距离服从半高斯分布（标准差为 patch_meters * narrow_passage_std_ratio），使障碍更集中；否则在视野半径内均匀采样。
无全局地图的任务点：任务点在 task_point_max_dist_m 内随机生成，并按当前射线的 LOS 裁剪，可通过 task_point_random_interval_max 控制重绘频率。

Observation & Action

Observation layout (dimension R + 7): [rays_norm(R), sin_ref, cos_ref, prev_vx/lim, prev_omega/lim, Δvx/(2·lim), Δomega/(2·omega_max), dist_to_task/patch_meters].
Actions are (vx, vy, omega), clipped by limits each step. When only two columns are provided, vy is zeroed inside the environment.

观测与动作

观测向量维度为 R + 7：[rays_norm(R), sin_ref, cos_ref, prev_vx/lim, prev_omega/lim, Δvx/(2·lim), Δomega/(2·omega_max), dist_to_task/patch_meters]。
动作为 (vx, vy, omega)，每步按 limits 裁剪；若只提供两列，环境会将 vy 置零。

GRALP Network (policy/value)

RayEncoder backbone (rl_ppo/encoder.py)
- Ray branch: 1D depthwise-separable convolutions with GELU + squeeze-excite blocks to embed per-ray distances.
- Attention fusion: Multi-query, multi-head attention over the ray features; pose/history MLP provides query bias; outputs [B, num_queries, d_model] plus global averages.
- Fusion head: Concatenates attended rays, global averages, and mean queries → two-layer MLP → 256-d latent.
Policy head (rl_ppo/ppo_models.py)
- Linear map from 256-d latent to mean action, global learnable log_std clamped to [log_std_min, log_std_max].
- Tanh-squashed Gaussian; scaled by per-axis limits; supports evaluation and log-prob correction for PPO.
Value head: Two-layer MLP from the shared latent to a scalar state value.

GRALP 网络（策略/价值）

RayEncoder 主干（rl_ppo/encoder.py）
- 光线路径：1D 深度可分卷积 + GELU + Squeeze-Excite，用于编码每条光线距离。
- 注意力融合：在光线特征上进行多查询多头注意力；姿态/历史 MLP 提供查询偏置；输出 [B, num_queries, d_model] 及全局平均值。
- 融合头：拼接注意力输出、全局平均和查询均值 → 两层 MLP → 256 维潜在表示。
策略头（rl_ppo/ppo_models.py）
- 将 256 维潜在映射到动作均值，使用全局可学习的 log_std，并限制在 [log_std_min, log_std_max]。
- 经过 tanh 压缩的高斯分布，再按各轴 limits 缩放；支持评估与 PPO 的对数概率修正。
价值头：从共享潜在通过两层 MLP 输出状态价值。

Reward Highlights (SimRandomGPUBatchEnv)

Progress toward the task point: -Δd / (vx_max · dt), optionally gated by orientation_verify.
Collision penalty: - w_collision * (1 + |v_world| / vx_max) when the traveled path exceeds the available ray distance (>0).
Jerk penalties on vx and omega, saturation penalty w_limits, and time penalty reward_time per step.
collision_done (default true) resets only the collided sub-env; there is no timeout termination.

奖励要点（SimRandomGPUBatchEnv）

朝任务点的进度奖励：-Δd / (vx_max · dt)，可选由 orientation_verify 控制。
碰撞惩罚：当行进路径超过剩余可用光线距离（>0）时，惩罚 - w_collision * (1 + |v_world| / vx_max)。
对 vx 和 omega 的加加速度（jerk）惩罚，动作饱和惩罚 w_limits，以及每步的时间惩罚 reward_time。
collision_done（默认 true）仅重置发生碰撞的子环境，没有超时终止。

腾讯云开发者社区

腾讯云面向开发者汇聚海量精品云计算使用和开发经验，营造开放的云计算技术生态圈。

更多推荐

终极指南：Flink SQL连接器版本管理从混乱到有序的升级之路

Apache Flink作为流处理领域的佼佼者，其SQL连接器的版本管理一直是开发者面临的核心挑战。本文将系统讲解Flink SQL连接器版本管理的最佳实践，帮助你轻松应对版本兼容性问题，实现从混乱到有序的升级之旅。## 连接器版本管理的常见痛点 😫在Flink应用开发中，连接器版本管理常常让开发者头疼不已。不同版本的连接器可能导致各种兼容性问题，例如API变更、功能差异甚至运行时错误。

腾讯云开发者社区

Elasticsearch复杂数据类型终极指南：从入门到精通

Elasticsearch作为功能强大的搜索引擎，支持多种复杂数据类型，让开发者能够灵活处理各种结构化和非结构化数据。本文将带你全面了解Elasticsearch中的复杂数据类型，从基础概念到实际应用，助你轻松掌握数据建模的核心技巧。## 内部对象：构建层级化数据结构在Elasticsearch中，对象类型（Object）是最基础的复杂数据类型之一，用于表示具有嵌套关系的数据。例如，我们可

腾讯云开发者社区

如何快速搭建Neon无服务器PostgreSQL：面向初学者的完整指南

Neon是一款革命性的无服务器PostgreSQL解决方案，它通过分离存储和计算层，实现了自动扩缩容、类代码式数据库分支以及零级扩展能力。本指南将帮助你从零开始搭建Neon开发环境，体验这款创新数据库的强大功能。## 准备工作：环境要求与依赖项在开始搭建Neon环境前，请确保你的系统满足以下要求：- Linux操作系统（推荐Ubuntu 20.04+或Debian 11+）- Git