In the past few years, foundation models and generative-AI models — and particularly, large language models (LLMs) — have become a major topic of AI research. That’s true even in field of computer vision, with its increased focus on vision-language models that yoke LLMs and image encoders.
This shift can be seen in the topics of the Amazon papers accepted to this year’s Computer Vision and Pattern Recognition Conference (CVPR 2024). A plurality of the papers deal with vision-language models, while a number of others concern related topics such as visual question answering, hallucination mitigation, and retrieval-aided generation. At the same time, however, classical computer vision topics such as 3-D reconstruction, object tracking, and pose estimation remain well represented.
3-D reconstruction
No more ambiguity in 360◦ room layout via bi-layout estimation
Yu-Ju Tsai, Jin-Cheng Jhang, Jingjing Zheng, Wei Wang, Albert Chen, Min Sun, Cheng-Hao Kuo, Ming-Hsuan Yang
ViewFusion: Towards multi-view consistency via interpolated denoising
Xianghui Yang, Yan Zuo, Sameera Ramasinghe, Loris Bazzani, Gil Avraham, Anton van den Hengel
Algorithmic information theory
Interpretable measures of conceptual similarity by complexity-constrained descriptive auto-encoding
Alessandro Achille, Greg Ver Steeg, Tian Yu Liu, Matthew Trager, Carson Klingenberg, Stefano Soatto
Geospatial analysis
Bridging remote sensors with multisensor geospatial foundation models
Boran Han, Shuai Zhang, Xingjian Shi, Markus Reichstein
Hallucination mitigation
Multi-modal hallucination control by visual information grounding
Alessandro Favero, Luca Zancato, Matthew Trager, Siddharth Choudhary, Pramuditha Perera, Alessandro Achille, Ashwin Swaminathan, Stefano Soatto
THRONE: An object-based hallucination benchmark for the free-form generations of large vision-language models
Prannay Kaul, Zhizhong Li, Hao Yang, Yonatan Dukler, Ashwin Swaminathan, C. J. Taylor, Stefano Soatto
Metric learning
Learning for transductive threshold calibration in open-world recognition
Qin Zhang, Dongsheng An, Tianjun Xiao, Tong He, Qingming Tang, Ying Nian Wu, Joe Tighe, Yifan Xing, Stefano Soatto
Model robustness
GDA: Generalized diffusion for robust test-time adaptation
Yun Yun Tsai, Fu-Chen Chen, Albert Chen, Junfeng Yang, Che-Chun Su, Min Sun, Cheng-Hao Kuo
Object-centric learning
Adaptive slot attention: Object discovery with dynamic slot number
Ke Fan, Zechen Bai, Tianjun Xiao, Tong He, Max Horn, Yanwei Fu, Francesco Locatello, Zheng Zhang
Object tracking
Self-supervised multi-object tracking with path consistency
Zijia Lu, Bing Shuai, Yanbei Chen, Zhenlin Xu, Davide Modolo
Pose estimation
MRC-Net: 6-DoF pose estimation with multiscale residual correlation
Yuelong Li, Yafei Mao, Raja Bala, Sunil Hadap
Responsible AI
FairRAG: Fair human generation via fair retrieval augmentation
Robik Shrestha, Yang Zou, James Chen, Zhiheng Li, Yusheng Xie, Tiffany Deng
Retrieval-augmented generation
CPR: Retrieval augmented generation for copyright protection
Aditya Golatkar, Alessandro Achille, Luca Zancato, Yu-Xiang Wang, Ashwin Swaminathan, Stefano Soatto
Security
Sharpness-aware optimization for real-world adversarial attacks for diverse compute platforms with enhanced transferability
Muchao Ye, Xiang Xu, Qin Zhang, Jon Wu
Video-language models
VidLA: Video-language alignment at scale
Mamshad Nayeem Rizve, Fan Fei, Jayakrishnan Unnikrishnan, Son Tran, Benjamin Yao, Belinda Zeng, Mubarak Shah, Trishul Chilimbi
Vision-language models
Accept the modality gap: An exploration in the hyperbolic space
Sameera Ramasinghe, Violetta Shevchenko, Gil Avraham, Ajanthan Thalaiyasingam
Enhancing vision-language pre-training with rich supervisions
Yuan Gao, Kunyu Shi, Pengkai Zhu, Edouard Belval, Oren Nuriel, Srikar Appalaraju, Shabnam Ghadar, Vijay Mahadevan, Zhuowen Tu, Stefano Soatto
GROUNDHOG: Grounding large language models to holistic segmentation
Yichi Zhang, Martin Ma, Xiaofeng Gao, Suhaila Shakiah, Qiaozi (QZ) Gao, Joyce Chai
Hyperbolic learning with synthetic captions for open-world detection
Fanjie Kong, Yanbei Chen, Jiarui Cai, Davide Modolo
Non-autoregressive sequence-to-sequence vision-language models
Kunyu Shi, Qi Dong, Luis Goncalves, Zhuowen Tu, Stefano Soatto
On the scalability of diffusion-based text-to-image generation
Hao Li, Yang Zou, Ying Wang, Orchid Majumder, Yusheng Xie, R. Manmatha, Ashwin Swaminathan, Zhuowen Tu, Stefano Ermon, Stefano Soatto
Visual question answering
GRAM: Global reasoning for multi-page VQA
Tsachi Blau, Sharon Fogel, Roi Ronen, Alona Golts, Roy Ganz, Elad Ben Avraham, Aviad Aberdam, Shahar Tsiper, Ron Litman
Question aware vision transformer for multimodal reasoning
Roy Ganz, Yair Kittenplon, Aviad Aberdam, Elad Ben Avraham, Oren Nuriel, Shai Mazor, Ron Litman
Synthesize step-by-step: Tools, templates and LLMs as data generators for reasoning-based chart VQA
Zhuowan Li, Bhavan Jasani, Peng Tang, Shabnam Ghadar