① 优势一:应对场景更多样; ② 优势二:上游出错的结果,不一定影响下游的planning;比如,如果看tesla的有些视频,就是这样,明显感知出错了,不影响planning; ③ 优势三,性能天花板够高,模型设计空间大:比如可以和大模型结合;比如,可以和无监督训练结合。因为,无监督,说明特征无倾向;数据量够大,说明特征泛化好。那分阶段的,一般是有监督训练,当然也可以无监督做个backbone,但还是需要有监督再调;
1、端到端技术路线划分及代表工作
① 直接端到端:就是说,不需要中间感知预测模块,比如mile、driveworld、dreamer-v1、dreamer-v2、sem2、bevplanner、transfuser、driveTransformer;可能需要监督,也可能不需要监督,但是,都没有中间模块了; ② 模块化端到端:以UniAD为代表,FusionAD,VAD,GenAD,都是; ③ 大语言模型路线:drive like a human, driveGPT4, LMDrive, EMMA,Senna;我认为是,这条路线在NLP和多模态的成功,具有启发意义; ④ 基于world model的路线:world models,dreamer-V1\V2, sem2,mile,driveworld, 这些的状态转移,其实就是world model。但是现在所说的world model,比如gaia-1, drivewm, 其实可以和端到端模型结合,比如drivewm做了一个比较粗糙的结合。我认为是趋势,是未来。 ⑤ 基于Diffusion的路线:以DiffusionDrive为例; 按照学习范式,又可分为模仿学习和强化学习,这两个并不冲突,可以一起用。 以上,仅为梳理方便而人为划分,仅供参考。角度不同,划分也不同。我认为,每个研究领域都有其自己的生命力,不可硬性分为几个set的。
比如drive like a human, driveGPT4, LMDrive, EMMA,Senna。 首先,我觉得VLM或LLM是有用的。 因为LLM或VLM,复杂场景理解、推理能力,这是很强的。另外一方面,在自动驾驶里,对于轨迹解释、VQA等,可能只能用VLM这样的技术来做。 但是,具体怎么用?是直接替代模块化端到端,还是和他们结合?我认为是后者。 VLM擅长场景理解和推理。所以在复杂场景,模块化端到端可能就傻眼了;VLM呢,泛化能力强,还能有个基本的场景理解。所以这些场景,VLM出决策建议,或者粗轨迹给模块化的端到端,或者直接给下游,应该是很有用的。
(1)双流架构的模型:
也就是一个运行快的模型,和一个运行慢的模型,并行运行;至于二者怎么分工和交互,每个工作各有所长,这个细节可以再讨论。相关工作,比如 DriveVLM、LeapAD、AsyncDriver。On the road虽然没做,但在future work中提到了感知部分需要融合传统方案和VLM方案的双流构思。Senna是做端到端规划,其逻辑和思想,与On the road一致。On the road和Senna都认为,VLM适合粗粒度的场景理解和推理,应结合具体任务的模型,实现专家模型泛化能力的增强。我个人非常赞同这个观点。 2024.03, DRIVEVLM: The Convergence of Autonomous Driving and Large Vision-Language Models 2024.05, Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving (LeapAD) 2024.06, Asynchronous Large Language Model Enhanced Planner for Autonomous Driving,和DriveVLM不同的是:这里的两个系统是做自适应融合,而DriveVLM是做switch. 2023.11, On the Road with GPT-4V(ision): Explorations of Utilizing Visual-Language Model as Autonomous Driving Agent的conclusion部分, 总结的特别好:VLM适合粗粒度的场景理解和推理,可和具体任务模型(专家模型)结合,发挥二者优势。 2024.10,Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving
(2)3D信息:
有几篇工作,支撑需要3D信息的观点。至于这个3D,是显式的监督信息带来的,还是2D自监督带来的(如dinov2),是可以讨论的。比如"Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving? "、”Language-Image Models with 3D Understanding(Cube-LLM)“、”On the Road with GPT-4V(ision): Explorations of Utilizing Visual-Language Model as Autonomous Driving Agent“。前两篇,是正向支撑,证明了加了3D比较好;第三篇是反向支撑,证明没有3D的定位和空间推理能力弱。 2024.05, "Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving? 2024.05, Language-Image Models with 3D Understanding(Cube-LLM)
(3)总结:
总的来说,这条路线的发展趋势可能是:①和非大语言模型的方案形成双流架构;② 补充3D信息。 此外,On the Road with GPT-4V 和 Image Textualization这两篇论文都提到,现在VLM对环境的感知,属于粒度比较粗的场景理解。 当然,如 Image Textualization这样的方法,正在弥补VLM在细粒度问题上的不足。这条路线值得一直关注。
2.3 基于world model的端到端路线
World Model分为两类:端到端自动驾驶模型中的world model,数据生成中的world model。 world model的定义: 2018, World Models World Model要具备三个属性:预测、表征、可控。
(mile) Hu A, Corrado G, Griffiths N, et al. Model-based imitation learning for urban driving[J]. Advances in Neural Information Processing Systems, 2022, 35: 20703-20716.
(Driveworld) Min C, Zhao D, Xiao L, et al. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 15522-15533.
(Dreamer-v1) Hafner D, Lillicrap T, Ba J, et al. Dream to control: Learning behaviors by latent imagination[J]. arXiv preprint arXiv:1912.01603, 2019.
(Dreamer-v2) Hafner D, Lillicrap T, Norouzi M, et al. Mastering atari with discrete world models[J]. arXiv preprint arXiv:2010.02193, 2020.
(SEM2) Gao Z, Mu Y, Chen C, et al. Enhance sample efficiency and robustness of end-to-end urban autonomous driving via semantic masked world model[J]. IEEE Transactions on Intelligent Transportation Systems, 2024.
(BevPlanner) Li Z, Yu Z, Lan S, et al. Is ego status all you need for open-loop end-to-end autonomous driving?[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 14864-14873.
(TransFuser) Chitta K, Prakash A, Jaeger B, et al. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(11): 12878-12895.
(UniAD) Hu Y, Yang J, Chen L, et al. Planning-oriented autonomous driving[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 17853-17862.
(FusionAD) Ye T, Jing W, Hu C, et al. Fusionad: Multi-modality fusion for prediction and planning tasks of autonomous driving[J]. arXiv preprint arXiv:2308.01006, 2023.
(VAD) Jiang B, Chen S, Xu Q, et al. Vad: Vectorized scene representation for efficient autonomous driving[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 8340-8350.
(GenAD) Zheng W, Song R, Guo X, et al. Genad: Generative end-to-end autonomous driving[C]//European Conference on Computer Vision. Springer, Cham, 2025: 87-104.
(Drive like a human) Fu D, Li X, Wen L, et al. Drive like a human: Rethinking autonomous driving with large language models[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2024: 910-919.
(DriveGPT4) Xu Z, Zhang Y, Xie E, et al. Drivegpt4: Interpretable end-to-end autonomous driving via large language model[J]. IEEE Robotics and Automation Letters, 2024.
(LMDrive) Shao H, Hu Y, Wang L, et al. Lmdrive: Closed-loop end-to-end driving with large language models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 15120-15130.
(EMMA) Hwang J J, Xu R, Lin H, et al. Emma: End-to-end multimodal model for autonomous driving[J]. arXiv preprint arXiv:2410.23262, 2024.
(Senna) Jiang B, Chen S, Liao B, et al. Senna: Bridging large vision-language models and end-to-end autonomous driving[J]. arXiv preprint arXiv:2410.22313, 2024.
(World Models) Ha D, Schmidhuber J. World models[J]. arXiv preprint arXiv:1803.10122, 2018.
(Gaia-1) Hu A, Russell L, Yeo H, et al. Gaia-1: A generative world model for autonomous driving[J]. arXiv preprint arXiv:2309.17080, 2023.
(DriveWM) Wang Y, He J, Fan L, et al. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 14749-14759.
(DiffusionDrive) Liao B, Chen S, Yin H, et al. DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving[J]. arXiv preprint arXiv:2411.15139, 2024.
(DriveVLM) Tian X, Gu J, Li B, et al. Drivevlm: The convergence of autonomous driving and large vision-language models[J]. arXiv preprint arXiv:2402.12289, 2024.
(LeapAD) Mei J, Ma Y, Yang X, et al. Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving[J]. arXiv preprint arXiv:2405.15324, 2024.
(AsyncDriver) Chen Y, Ding Z, Wang Z, et al. Asynchronous large language model enhanced planner for autonomous driving[C]//European Conference on Computer Vision. Springer, Cham, 2025: 22-38.
(On the road) Wen L, Yang X, Fu D, et al. On the road with gpt-4v (ision): Early explorations of visual-language model on autonomous driving[J]. arXiv preprint arXiv:2311.05332, 2023.
(3D-Tokenized LLM) Bai Y, Wu D, Liu Y, et al. Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving?[J]. arXiv preprint arXiv:2405.18361, 2024.
(Cube-LLM) Cho J H, Ivanovic B, Cao Y, et al. Language-Image Models with 3D Understanding[J]. arXiv preprint arXiv:2405.03685, 2024.
(Image Textualization) Pi R, Zhang J, Zhang J, et al. Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions[J]. arXiv preprint arXiv:2406.07502, 2024.
(Fiery) Hu A, Murez Z, Mohan N, et al. Fiery: Future instance prediction in bird's-eye view from surround monocular cameras[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 15273-15282.
软瓦格化 RISC-V 处理器集群可加速设计并降低风险作者:John Min John Min是Arteris的客户成功副总裁。他拥有丰富的架构专业知识,能够成功管理可定制和标准处理器在功耗、尺寸和性能方面的设计权衡。他的背景包括利用 ARC、MIPS、x86 和定制媒体处理器来设计 CPU SoC,尤其擅长基于微处理器的 SoC。RISC-V 指令集架构 (ISA) 以其强大的功能、灵活性、低采用成本和开源基础而闻名,正在经历各个细分市场的快速增长。这种多功能 ISA 支持汽车、航空航天、国防