no code implementations • 11 Feb 2025 • Xiaoyu Yang, Jie Lu, En Yu
The evolution of large-scale contrastive pre-training propelled by top-tier datasets has reached a transition point in the scaling law.
no code implementations • 5 Feb 2025 • Zining Zhu, Liang Zhao, Kangheng Lin, Jinze Yang, En Yu, Chenglong Liu, Haoran Wei, Jianjian Sun, Zheng Ge, Xiangyu Zhang
This paper presents Perceptual Preference Optimization (PerPO), a perception alignment method aimed at addressing the visual discrimination challenges in generative pre-trained multimodal large language models (MLLMs).
no code implementations • 3 Feb 2025 • Tianlin Zhang, En Yu, Yi Shao, Shuai Li, Sujuan Hou, Jiande Sun
Multimodal fake news detection has garnered significant attention due to its profound implications for social security.
1 code implementation • 23 Dec 2024 • Sijia Chen, En Yu, Wenbing Tao
It introduces the cross-view to obtain the appearances of objects from multiple views, avoiding the problem of the invisible appearances of objects in RMOT task.
1 code implementation • 12 Dec 2024 • Ruiwen Zhou, Wenyue Hua, Liangming Pan, Sitao Cheng, Xiaobao Wu, En Yu, William Yang Wang
This paper introduces RuleArena, a novel and challenging benchmark designed to evaluate the ability of large language models (LLMs) to follow complex, real-world rules in reasoning.
1 code implementation • 22 May 2024 • Xiaoyu Yang, Jie Lu, En Yu
This mainly includes gradual drift due to long-tailed data and sudden drift from Out-Of-Distribution (OOD) data, both of which have increasingly drawn the attention of the research community.
1 code implementation • CVPR 2024 • Sijia Chen, En Yu, Jinyang Li, Wenbing Tao
In this study, we pioneer an exploration into the distribution patterns of tracking data and identify a pronounced long-tail distribution issue within existing MOT datasets.
no code implementations • 23 Jan 2024 • Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, En Yu, Jianjian Sun, Chunrui Han, Xiangyu Zhang
In Vary-toy, we introduce an improved vision vocabulary, allowing the model to not only possess all features of Vary but also gather more generality.
Ranked #209 on
Visual Question Answering
on MM-Vet
no code implementations • 17 Dec 2023 • En Yu, Jie Lu, Bin Zhang, Guangquan Zhang
Specifically, OBAL operates in a dual-phase mechanism, in the first of which we design an Adaptive COvariate Shift Adaptation (AdaCOSA) algorithm to construct an initialized ensemble model using archived data from various source streams, thus mitigating the covariate shift while learning the dynamic correlations via an adaptive re-weighting strategy.
no code implementations • 30 Nov 2023 • En Yu, Liang Zhao, Yana Wei, Jinrong Yang, Dongming Wu, Lingyu Kong, Haoran Wei, Tiancai Wang, Zheng Ge, Xiangyu Zhang, Wenbing Tao
Then, FIT requires MLLMs to first predict trajectories of related objects and then reason about potential future events based on them.
Ranked #160 on
Visual Question Answering
on MM-Vet
no code implementations • 18 Jul 2023 • Liang Zhao, En Yu, Zheng Ge, Jinrong Yang, Haoran Wei, HongYu Zhou, Jianjian Sun, Yuang Peng, Runpei Dong, Chunrui Han, Xiangyu Zhang
Based on precise referring instruction, we propose ChatSpot, a unified end-to-end multimodal large language model that supports diverse forms of interactivity including mouse clicks, drag-and-drop, and drawing boxes, which provides a more flexible and seamless interactive experience.
no code implementations • 18 Jul 2023 • Zhuoling Li, Chunrui Han, Zheng Ge, Jinrong Yang, En Yu, Haoqian Wang, Hengshuang Zhao, Xiangyu Zhang
Besides, GroupLane with ResNet18 still surpasses PersFormer by 4. 9% F1 score, while the inference speed is nearly 7x faster and the FLOPs is only 13. 3% of it.
no code implementations • 23 May 2023 • En Yu, Tiancai Wang, Zhuoling Li, Yuang Zhang, Xiangyu Zhang, Wenbing Tao
Although end-to-end multi-object trackers like MOTR enjoy the merits of simplicity, they suffer from the conflict between detection and association seriously, resulting in unsatisfactory convergence dynamics.
no code implementations • 3 Dec 2022 • En Yu, Songtao Liu, Zhuoling Li, Jinrong Yang, Zeming Li, Shoudong Han, Wenbing Tao
VLM joints the information in the generated visual prompts and the textual prompts from a pre-defined Trackbook to obtain instance-level pseudo textual description, which is domain invariant to different tracking scenes.
no code implementations • 1 Sep 2022 • Pan Wang, Liangliang Ren, Shengkai Wu, Jinrong Yang, En Yu, Hangcheng Yu, Xiaoping Li
The point cloud based 3D single object tracking has drawn increasing attention.
no code implementations • 23 Aug 2022 • Jinrong Yang, En Yu, Zeming Li, Xiaoping Li, Wenbing Tao
Recent advanced works generally employ a series of object attributes, e. g., position, size, velocity, and appearance, to provide the clues for the association in 3D MOT.
no code implementations • 8 Jun 2022 • Zhuoling Li, Chuanrui Zhang, En Yu, Haoqian Wang
(2) Combining depth estimation and 2D object detection is a promising M3OD pre-training baseline.
no code implementations • CVPR 2022 • En Yu, Zhuoling Li, Shoudong Han
To this end, we propose a strategy, namely multi-view trajectory contrastive learning, in which each trajectory is represented as a center vector.
no code implementations • 10 May 2021 • En Yu, Zhuoling Li, Shoudong Han, Hongwei Wang
Existing online multiple object tracking (MOT) algorithms often consist of two subtasks, detection and re-identification (ReID).
no code implementations • 10 Sep 2020 • Shoudong Han, Piao Huang, Hongwei Wang, En Yu, Donghaisheng Liu, Xiaofeng Pan, Jun Zhao
Modern multi-object tracking (MOT) systems usually model the trajectories by associating per-frame detections.
no code implementations • 16 Mar 2020 • Piao Huang, Shoudong Han, Jun Zhao, Donghaisheng Liu, Hongwei Wang, En Yu, Alex ChiChung Kot
Modern multi-object tracking (MOT) system usually involves separated modules, such as motion model for location and appearance model for data association.
no code implementations • 25 Apr 2019 • Li Wang, Lei Zhu, En Yu, Jiande Sun, Huaxiang Zhang
Deep hashing has recently received attention in cross-modal retrieval for its impressive advantages.