I am currently a PhD student at NLPR of the Institute of Automation, Chinese Academy of Science (CASIA), under the supervision of Prof. Jin Gao and Prof. Weiming Hu. Prior to that, I received my B.E. degree from Tianjin University. Notably, I work closely with Prof. Zhipeng Zhang through my ongoing internships at AI School of Shanghai Jiaotong University and KargoBot. My research focuses on perception, scene understanding, and action planning in autonomous driving and embodied intelligence scenarios.
Publications
Motivation: In camera + LiDAR multi-modal 3D detection, current methods struggle to achieve efficiency, long-range modeling, and full scene information retention simultaneously. Method: Inspired by linear-attention mechanisms, we propose the first efficient, set-based fusion strategy using linear attention to balance speed, range, and completeness in camera–LiDAR detection. We perform qualitative and quantitative analyses to select the Mamba attention module for our main experiments. By combining Height-Fidelity LiDAR encoding with a Hybrid Mamba Block, we align modalities while preserving height cues and learning both local and global context. Results: On the nuScenes validation set, we achieve an NDS of 75.0, surpassing SOTA methods that rely on sampled-resolution inputs and deliver faster inference.
Motivation: Existing methods require the teacher and student models to share identical architectures and input formats, which weakens knowledge distillation and under-utilizes temporal information. Method: We introduce the first online asymmetric semi-supervised 3D detection framework that breaks the constraints of structural/input-format consistency. An attention-based refinement module is integrated, and past/future temporal cues are leveraged in a divide-and-conquer strategy to correct poor detections, missed objects, and false positives. Results: On the Waymo dataset, our approach boosts mAP (L1) by 4.7 points over the previous SOTA while requiring fewer training resources.
Motivation: Semi-supervised monocular 3D detection suffers from noisy pseudo-labels and low learning efficiency due to the lack of high-quality unlabeled samples. Method: We introduce an Augment-Criticizestrategy that automatically learns image transformations and aggregates predictions to mine more reliable pseudo-labels. A Critical Retraining Strategy (CRS)dynamically evaluates pseudo-label contributions during training to suppress noisy samples. Results: Applied to MonoDLE and MonoFlex, our approach yields significant performance gains, demonstrating its effectiveness and generality.