ActFormer: Scalable Collaborative Perception via Active Queries

May 1, 2024·
Suozhi Huang*
,
Juexiao Zhang
,
Yiming Li
,
Chen Feng
· 0 min read
Abstract
We present ActFormer, a Transformer that learns bird’s eye view (BEV) representations by using predefined BEV queries to interact with multi-robot multi-camera inputs. Each BEV query can actively select relevant cameras for information aggregation based on pose information, instead of interacting with all cameras indiscriminately. Experiments on the V2X-Sim dataset demonstrate that ActFormer improves the detection performance from 29.89% to 45.15% in terms of with about 50% fewer queries, showcasing the effectiveness of ActFormer in multi-agent collaborative 3D object detection.
Type
Publication
In ICRA 2024