no code implementations • 30 May 2025 • Xinrui Chen, Haoli Bai, Tao Yuan, Ruikang Liu, Kang Zhao, Xianzhi Yu, Lu Hou, Tian Guan, Yonghong He, Chun Yuan
With only 5K samples, the retained performance of LinearPatch can be further boosted to 95. 16% within 30 minutes on a single computing card.
no code implementations • 26 May 2025 • Hanting Chen, Jiarui Qin, Jialong Guo, Tao Yuan, Yichun Yin, HuiLing Zhen, Yasheng Wang, Jinpeng Li, Xiaojun Meng, Meng Zhang, Rongju Ruan, Zheyuan Bai, Yehui Tang, Can Chen, Xinghao Chen, Fisher Yu, Ruiming Tang, Yunhe Wang
While structured pruning offers a promising avenue for model compression, existing methods often struggle with the detrimental effects of aggressive, simultaneous width and depth reductions, leading to substantial performance degradation.
no code implementations • 21 May 2025 • Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li
Vision language models (VLMs) have achieved impressive performance across a variety of computer vision tasks.
no code implementations • 30 Apr 2025 • Pengxiang Li, Zhi Gao, Bofei Zhang, Yapeng Mi, Xiaojian Ma, Chenrui Shi, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li
The data is subsequently used to update the controller for tool usage through preference tuning, producing a SPORT agent.
no code implementations • 17 Apr 2025 • Bofei Zhang, Zirui Shang, Zhi Gao, Wang Zhang, Rui Xie, Xiaojian Ma, Tao Yuan, Xinxiao wu, Song-Chun Zhu, Qing Li
Building Graphical User Interface (GUI) agents is a promising research direction, which simulates human interaction with computers or mobile phones to perform diverse GUI tasks.
no code implementations • 19 Feb 2025 • Boxun Li, Yadong Li, Zhiyuan Li, Congyi Liu, Weilin Liu, Guowei Niu, Zheyue Tan, Haiyang Xu, Zhuyu Yao, Tao Yuan, Dong Zhou, Yueqing Zhuang, Shengen Yan, Guohao Dai, Yu Wang
In this work, we present the Megrez models, comprising a language model (Megrez-3B-Instruct) and a multimodal model (Megrez-3B-Omni).
1 code implementation • 24 Dec 2024 • Yang shen, Xiu-Shen Wei, Yifan Sun, Yuxin Song, Tao Yuan, Jian Jin, Heyang Xu, Yazhou Yao, Errui Ding
In this paper, we explore the idea that CV adopts discrete and terminological task definitions (\eg, ``image segmentation''), which may be a key barrier to zero-shot task generalization.
no code implementations • 20 Dec 2024 • Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaojian Ma, Tao Yuan, Yue Fan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li
The advancement of large language models (LLMs) prompts the development of multi-modal agents, which are used as a controller to call external tools, providing a feasible way to solve practical tasks.
no code implementations • 21 Oct 2024 • Kang Zhao, Tao Yuan, Han Bao, Zhenfeng Su, Chang Gao, Zhaofeng Sun, Zichen Liang, Liping Jing, Jianfei Chen
In this study, we thoroughly investigate the application of V:N:M sparsity in vision models and LLMs across multiple tasks, from pertaining to downstream tasks.
no code implementations • 16 Jul 2024 • Pengxiang Li, Zhi Gao, Bofei Zhang, Tao Yuan, Yuwei Wu, Mehrtash Harandi, Yunde Jia, Song-Chun Zhu, Qing Li
Vision language models (VLMs) have achieved impressive progress in diverse applications, becoming a prevalent research direction.
1 code implementation • 6 Feb 2024 • Tao Yuan, Xuefei Ning, Dong Zhou, Zhijie Yang, Shiyao Li, Minghui Zhuang, Zheyue Tan, Zhuyu Yao, Dahua Lin, Boxun Li, Guohao Dai, Shengen Yan, Yu Wang
In contrast, the average context lengths of mainstream benchmarks are insufficient (5k-21k), and they suffer from potential knowledge leakage and inaccurate metrics, resulting in biased evaluation.
1 code implementation • EMNLP 2020 • Liang Qiu, Yizhou Zhao, Weiyan Shi, Yuan Liang, Feng Shi, Tao Yuan, Zhou Yu, Song-Chun Zhu
Inducing a meaningful structural representation from one or a set of dialogues is a crucial but challenging task in computational linguistics.
no code implementations • 25 Apr 2020 • Tao Yuan, Hangxin Liu, Lifeng Fan, Zilong Zheng, Tao Gao, Yixin Zhu, Song-Chun Zhu
Aiming to understand how human (false-)belief--a core socio-cognitive ability--would affect human interactions with robots, this paper proposes to adopt a graphical model to unify the representation of object states, robot knowledge, and human (false-)beliefs.
no code implementations • NeurIPS 2019 • Siyuan Huang, Yixin Chen, Tao Yuan, Siyuan Qi, Yixin Zhu, Song-Chun Zhu
Detecting 3D objects from a single RGB image is intrinsically ambiguous, thus requiring appropriate prior knowledge and intermediate representations as constraints to reduce the uncertainties and improve the consistencies between the 2D image plane and the 3D world coordinate.
Ranked #2 on
Monocular 3D Object Detection
on SUN RGB-D
(AP@0.15 (10 / PNet-30) metric)
no code implementations • ICCV 2019 • Yixin Chen, Siyuan Huang, Tao Yuan, Siyuan Qi, Yixin Zhu, Song-Chun Zhu
We propose a new 3D holistic++ scene understanding problem, which jointly tackles two tasks from a single-view image: (i) holistic scene parsing and reconstruction---3D estimations of object bounding boxes, camera pose, and room layout, and (ii) 3D human pose estimation.
3D Human Pose Estimation
Human-Object Interaction Detection
+1
no code implementations • 25 Jul 2019 • Feng Shi, Ziheng Xu, Tao Yuan, Song-Chun Zhu
In this work, we propose a Highly Untangled Generative-model Engine for Edge-computing or HUGE2 for accelerating these two special convolutions on the edge-computing platform by decomposing the kernels and untangling these smaller convolutions by performing basic matrix multiplications.
no code implementations • 16 Sep 2017 • Hang Qi, Yuanlu Xu, Tao Yuan, Tianfu Wu, Song-Chun Zhu
The proposed joint parsing framework represents such correlations and constraints explicitly and generates semantic scene-centric parse graphs.