no code implementations • 30 Sep 2024 • Haotian Zhang, Mingfei Gao, Zhe Gan, Philipp Dufter, Nina Wenzel, Forrest Huang, Dhruti Shah, Xianzhi Du, BoWen Zhang, Yanghao Li, Sam Dodge, Keen You, Zhen Yang, Aleksei Timofeev, Mingze Xu, Hong-You Chen, Jean-Philippe Fauconnier, Zhengfeng Lai, Haoxuan You, ZiRui Wang, Afshin Dehghan, Peter Grasch, Yinfei Yang
We present MM1. 5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning.
Ranked #39 on Visual Question Answering on MM-Vet
no code implementations • 9 Aug 2024 • Ning li, Huaikang Zhou, Mingze Xu
This study explores the potential of Large Language Models (LLMs), specifically GPT-4, to enhance objectivity in organizational task performance evaluations.
1 code implementation • 22 Jul 2024 • Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, Afshin Dehghan
As a result, this design allows us to adequately capture both spatial and temporal features that are beneficial for detailed video understanding.
no code implementations • 20 Sep 2023 • Haodong Duan, Mingze Xu, Bing Shuai, Davide Modolo, Zhuowen Tu, Joseph Tighe, Alessandro Bergamo
It first models the intra-person skeleton dynamics for each skeleton sequence with graph convolutions, and then uses stacked Transformer encoders to capture person interactions that are important for action recognition in general scenarios.
no code implementations • ICCV 2023 • Haodong Duan, Mingze Xu, Bing Shuai, Davide Modolo, Zhuowen Tu, Joseph Tighe, Alessandro Bergamo
It first models the intra-person skeleton dynamics for each skeleton sequence with graph convolutions, and then uses stacked Transformer encoders to capture person interactions that are important for action recognition in the wild.
Ranked #3 on Human Interaction Recognition on NTU RGB+D
1 code implementation • 30 Sep 2022 • Jun Fang, Mingze Xu, Hao Chen, Bing Shuai, Zhuowen Tu, Joseph Tighe
In this paper, we provide an in-depth study of Stochastic Backpropagation (SBP) when training deep neural networks for standard image classification and object detection tasks.
1 code implementation • CVPR 2022 • Feng Cheng, Mingze Xu, Yuanjun Xiong, Hao Chen, Xinyu Li, Wei Li, Wei Xia
We propose a memory efficient method, named Stochastic Backpropagation (SBP), for training deep neural networks on videos.
no code implementations • CVPR 2022 • Jiarui Cai, Mingze Xu, Wei Li, Yuanjun Xiong, Wei Xia, Zhuowen Tu, Stefano Soatto
We propose an online tracking algorithm that performs the object detection and data association under a common framework, capable of linking objects after a long time span.
2 code implementations • NeurIPS 2021 • Mingze Xu, Yuanjun Xiong, Hao Chen, Xinyu Li, Wei Xia, Zhuowen Tu, Stefano Soatto
We present Long Short-term TRansformer (LSTR), a temporal modeling algorithm for online action detection, which employs a long- and short-term memory mechanism to model prolonged sequence data.
Ranked #3 on Online Action Detection on TVSeries
no code implementations • 6 Jul 2021 • Wei Li, Yuanjun Xiong, Shuo Yang, Mingze Xu, Yongxin Wang, Wei Xia
We design a new instance-to-track matching objective to learn appearance embedding that compares a candidate detection to the embedding of the tracks persisted in the tracker.
1 code implementation • CVPR 2022 • Jiaojiao Zhao, Yanyi Zhang, Xinyu Li, Hao Chen, Shuai Bing, Mingze Xu, Chunhui Liu, Kaustav Kundu, Yuanjun Xiong, Davide Modolo, Ivan Marsic, Cees G. M. Snoek, Joseph Tighe
We propose TubeR: a simple solution for spatio-temporal video action detection.
1 code implementation • 25 Mar 2021 • Chuhua Wang, Yuchen Wang, Mingze Xu, David J. Crandall
We propose to predict the future trajectories of observed agents (e. g., pedestrians or vehicles) by estimating and using their goals at multiple time scales.
Ranked #1 on Trajectory Prediction on HEV-I
1 code implementation • ICCV 2021 • Tianchen Zhao, Xiang Xu, Mingze Xu, Hui Ding, Yuanjun Xiong, Wei Xia
We propose a new method to detect deepfake images using the cue of the source feature inconsistency within the forged images.
no code implementations • 8 Oct 2020 • Yuchen Wang, Mingze Xu, John Paden, Lora Koenig, Geoffrey Fox, David Crandall
Understanding the structure of Earth's polar ice sheets is important for modeling how global warming will impact polar ice and, in turn, the Earth's climate.
3 code implementations • 6 Apr 2020 • Yu Yao, Xizi Wang, Mingze Xu, Zelin Pu, Ella Atkins, David Crandall
A new spatial-temporal area under curve (STAUC) evaluation metric is proposed and used with DoTA.
no code implementations • 9 Apr 2019 • Jianwei Yang, Zhile Ren, Mingze Xu, Xinlei Chen, David Crandall, Devi Parikh, Dhruv Batra
Passive visual systems typically fail to recognize objects in the amodal setting where they are heavily occluded.
no code implementations • ICCV 2019 • Mingfei Gao, Mingze Xu, Larry S. Davis, Richard Socher, Caiming Xiong
We propose StartNet to address Online Detection of Action Start (ODAS) where action starts and their associated categories are detected in untrimmed, streaming videos.
3 code implementations • 2 Mar 2019 • Yu Yao, Mingze Xu, Yuchen Wang, David J. Crandall, Ella M. Atkins
Recognizing abnormal events such as traffic violations and accidents in natural driving scenes is essential for successful autonomous driving and advanced driver assistance systems.
Ranked #1 on Traffic Accident Detection on A3D
2 code implementations • ICCV 2019 • Mingze Xu, Mingfei Gao, Yi-Ting Chen, Larry S. Davis, David J. Crandall
Most work on temporal action detection is formulated as an offline problem, in which the start and end times of actions are determined after the entire video is fully observed.
Ranked #12 on Online Action Detection on TVSeries
2 code implementations • 19 Sep 2018 • Yu Yao, Mingze Xu, Chiho Choi, David J. Crandall, Ella M. Atkins, Behzad Dariush
Predicting the future location of vehicles is essential for safety-critical applications such as advanced driver assistance systems (ADAS) and autonomous driving.
no code implementations • ECCV 2018 • Mingze Xu, Chenyou Fan, Yuchen Wang, Michael S. Ryoo, David J. Crandall
In this paper, we wish to solve two specific problems: (1) given two or more synchronized third-person videos of a scene, produce a pixel-level segmentation of each visible person and identify corresponding people across different views (i. e., determine who in camera A corresponds with whom in camera B), and (2) given one or more synchronized third-person videos as well as a first-person video taken by a mobile or wearable camera, segment and identify the camera wearer in the third-person videos.
1 code implementation • 11 Jan 2018 • Mingze Xu, Chenyou Fan, John D Paden, Geoffrey C. Fox, David J. Crandall
Deep learning methods have surpassed the performance of traditional techniques on a wide range of problems in computer vision, but nearly all of this work has studied consumer photos, where precisely correct output is often not critical.
no code implementations • 11 Jan 2018 • Mingze Xu, Aidean Sharghi, Xin Chen, David J. Crandall
A major emerging challenge is how to protect people's privacy as cameras and computer vision are increasingly integrated into our daily lives, including in smart devices inside homes.
no code implementations • 21 Dec 2017 • Mingze Xu, David J. Crandall, Geoffrey C. Fox, John D Paden
Ground-penetrating radar on planes and satellites now makes it practical to collect 3D observations of the subsurface structure of the polar ice sheets, providing crucial data for understanding and tracking global climate change.
no code implementations • CVPR 2017 • Chenyou Fan, Jang-Won Lee, Mingze Xu, Krishna Kumar Singh, Yong Jae Lee, David J. Crandall, Michael S. Ryoo
We consider scenarios in which we wish to perform joint scene understanding, object tracking, activity recognition, and other tasks in environments in which multiple people are wearing body-worn cameras while a third-person static camera also captures the scene.