1 code implementation • 1 May 2025 • Arsha Nagrani, Sachit Menon, Ahmet Iscen, Shyamal Buch, Ramin Mehran, Nilpa Jha, Anja Hauth, Yukun Zhu, Carl Vondrick, Mikhail Sirotenko, Cordelia Schmid, Tobias Weyand
To remedy this, we provide a new video reasoning dataset called MINERVA for modern multimodal models.
1 code implementation • 12 Dec 2024 • Arsha Nagrani, Mingda Zhang, Ramin Mehran, Rachel Hornung, Nitesh Bharadwaj Gundavarapu, Nilpa Jha, Austin Myers, Xingyi Zhou, Boqing Gong, Cordelia Schmid, Mikhail Sirotenko, Yukun Zhu, Tobias Weyand
Our dataset Neptune covers a broad range of long video reasoning abilities and consists of a subset that emphasizes multimodal reasoning.
Ranked #1 on
Multiple-choice
on Neptune-Full
no code implementations • Neural Information Processing Systems 2024 • Nitesh Bharadwaj Gundavarapu, Luke Friedman, Raghav Goyal, Chaitra Hegde, Eirikur Agustsson, Sagar M. Waghmare, Mikhail Sirotenko, Ming-Hsuan Yang, Tobias Weyand, Boqing Gong, Leonid Sigal
Nevertheless, the majority of prior works that leverage MAE pre-training have focused on relatively short video representations (16 / 32 frames in length) largely due to hardware memory and compute limitations that scale poorly with video length due to the dense memory-intensive self-attention decoding.
Ranked #1 on
Action Recognition
on Diving-48
(using extra training data)
no code implementations • 20 Feb 2024 • Long Zhao, Nitesh B. Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A. Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, Boqing Gong
We introduce VideoPrism, a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model.
1 code implementation • 6 Jul 2023 • Liangzhe Yuan, Nitesh Bharadwaj Gundavarapu, Long Zhao, Hao Zhou, Yin Cui, Lu Jiang, Xuan Yang, Menglin Jia, Tobias Weyand, Luke Friedman, Mikhail Sirotenko, Huisheng Wang, Florian Schroff, Hartwig Adam, Ming-Hsuan Yang, Ting Liu, Boqing Gong
We evaluate the video understanding capabilities of existing foundation models (FMs) using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition, temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring an FM for downstream tasks.
no code implementations • 2 Jun 2022 • Zu Kim, André Araujo, Bingyi Cao, Cam Askew, Jack Sim, Mike Green, N'Mah Fodiatu Yilla, Tobias Weyand
We showcase its application to the landmark recognition domain, presenting a detailed analysis and the final fairer landmark rankings.
no code implementations • 19 Aug 2021 • Zu Kim, André Araujo, Bingyi Cao, Cam Askew, Jack Sim, Mike Green, N'Mah Fodiatu Yilla, Tobias Weyand
To create a more comprehensive and equitable dataset, we start by defining the fair relevance of a landmark to the world population.
1 code implementation • CVPR 2021 • Quin Thames, Arjun Karpur, Wade Norris, Fangting Xia, Liviu Panait, Tobias Weyand, Jack Sim
Understanding the nutritional content of food from visual data is a challenging computer vision problem, with the potential to have a positive and widespread impact on public health.
1 code implementation • CVPR 2020 • Tobias Weyand, Andre Araujo, Bingyi Cao, Jack Sim
GLDv2 is the largest such dataset to date by a large margin, including over 5M images and 200k distinct instance labels.
5 code implementations • 3 Apr 2020 • Tobias Weyand, Andre Araujo, Bingyi Cao, Jack Sim
GLDv2 is the largest such dataset to date by a large margin, including over 5M images and 200k distinct instance labels.
Ranked #1 on
Landmark Recognition
on Google Landmarks Dataset v2 (recognition, validation)
(using extra training data)
no code implementations • ECCV 2018 • Paul Hongsuck Seo, Tobias Weyand, Jack Sim, Bohyung Han
Image geolocalization is the task of identifying the location depicted in a photo based only on its visual information.
Ranked #1 on
Photo geolocation estimation
on Im2GPS
(Reference images metric)
160 code implementations • 17 Apr 2017 • Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam
We present a class of efficient models called MobileNets for mobile and embedded vision applications.
Ranked #1018 on
Image Classification
on ImageNet
13 code implementations • ICCV 2017 • Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, Bohyung Han
We propose an attentive local feature descriptor suitable for large-scale image retrieval, referred to as DELF (DEep Local Feature).
Ranked #2 on
Image Retrieval
on Oxf105k
1 code implementation • 17 Feb 2016 • Tobias Weyand, Ilya Kostrikov, James Philbin
Is it possible to build a system to determine the location where a photo was taken using just its pixels?
Ranked #1 on
Photo geolocation estimation
on Im2GPS
(Reference images metric)
no code implementations • 18 Sep 2014 • Tobias Weyand, Bastian Leibe
We evaluate how different choices of methods and parameters for the individual pipeline steps affect overall system performance and examine their effects for different query categories such as buildings, paintings or sculptures.