no code implementations • 6 Dec 2024 • Yongxin Wang, Meng Cao, Haokun Lin, Mingfei Han, Liang Ma, Jin Jiang, Yuhao Cheng, Xiaodan Liang
Remarkably, EACO also shows the potential critical ability in open-source MLLMs, demonstrating that EACO is a viable path to boost the competence of MLLMs.
1 code implementation • 28 Jun 2024 • Sukmin Yun, Haokun Lin, Rusiru Thushara, Mohammad Qazim Bhat, Yongxin Wang, Zutao Jiang, Mingkai Deng, Jinhong Wang, Tianhua Tao, Junbo Li, Haonan Li, Preslav Nakov, Timothy Baldwin, Zhengzhong Liu, Eric P. Xing, Xiaodan Liang, Zhiqiang Shen
To address this problem, we propose $\texttt{Web2Code}$, a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning and an evaluation framework for the webpage understanding and HTML code translation abilities of MLLMs.
no code implementations • 27 Oct 2022 • Jiale Liu, Yu-Wei Zhan, Xin Luo, Zhen-Duo Chen, Yongxin Wang, Xin-Shun Xu
And due to the problems of statistical heterogeneity, model heterogeneity, and forcing each client to accept the same parameters, applying federated learning to cross-modal hash learning becomes very tricky.
no code implementations • 12 Apr 2022 • Yu-Wei Zhan, Xin Luo, Yongxin Wang, Zhen-Duo Chen, Xin-Shun Xu
To narrow the domain differences between sketches and images, we extract edge maps for natural images and treat them as a bridge between images and sketches, which have similar content to images and similar style to sketches.
no code implementations • 24 Mar 2022 • Zi-Chao Zhang, Zhen-Duo Chen, Yongxin Wang, Xin Luo, Xin-Shun Xu
Recently, several Vision Transformer (ViT) based methods have been proposed for Fine-Grained Visual Classification (FGVC). These methods significantly surpass existing CNN-based ones, demonstrating the effectiveness of ViT in FGVC tasks. However, there are some limitations when applying ViT directly to FGVC. First, ViT needs to split images into patches and calculate the attention of every pair, which may result in heavy redundant calculation and unsatisfying performance when handling fine-grained images with complex background and small objects. Second, a standard ViT only utilizes the class token in the final layer for classification, which is not enough to extract comprehensive fine-grained information.
no code implementations • 6 Jul 2021 • Wei Li, Yuanjun Xiong, Shuo Yang, Mingze Xu, Yongxin Wang, Wei Xia
We design a new instance-to-track matching objective to learn appearance embedding that compares a candidate detection to the embedding of the tracks persisted in the tracker.
3 code implementations • ICCV 2021 • Yifan Xing, Tong He, Tianjun Xiao, Yongxin Wang, Yuanjun Xiong, Wei Xia, David Wipf, Zheng Zhang, Stefano Soatto
Our hierarchical GNN uses a novel approach to merge connected components predicted at each level of the hierarchy to form a new graph at the next level.
1 code implementation • NAACL 2021 • Jianing Yang, Yongxin Wang, Ruitao Yi, Yuying Zhu, Azaan Rehman, Amir Zadeh, Soujanya Poria, Louis-Philippe Morency
Human communication is multimodal in nature; it is through multiple modalities such as language, voice, and facial expressions, that opinions and emotions are expressed.
no code implementations • 16 Sep 2020 • Yu-Wei Zhan, Xin Luo, Yu Sun, Yongxin Wang, Zhen-Duo Chen, Xin-Shun Xu
However, existing hashing methods for social image retrieval are based on batch mode which violates the nature of social images, i. e., social images are usually generated periodically or collected in a stream fashion.
no code implementations • 20 Aug 2020 • Xinshuo Weng, Yongxin Wang, Yunze Man, Kris Kitani
3D Multi-object tracking (MOT) is crucial to autonomous systems.
no code implementations • 7 Jul 2020 • Jianing Yang, Yuying Zhu, Yongxin Wang, Ruitao Yi, Amir Zadeh, Louis-Philippe Morency
In this paper, we analyze QA biases in popular video question answering datasets and discover pretrained language models can answer 37-48% questions correctly without using any multimodal context information, far exceeding the 20% random guess baseline for 5-choose-1 multiple-choice questions.
1 code implementation • 23 Jun 2020 • Yongxin Wang, Kris Kitani, Xinshuo Weng
Despite the fact that the two components are dependent on each other, prior works often design detection and data association modules separately which are trained with separate objectives.
Ranked #1 on
Multi-Object Tracking
on 2D MOT 2015
no code implementations • 15 Jun 2020 • Yongxin Wang, Duminda Wijesekera
Despite recent success of object detectors using deep neural networks, their deployment on safety-critical applications such as self-driving cars remains questionable.
1 code implementation • 12 Jun 2020 • Xinshuo Weng, Yongxin Wang, Yunze Man, Kris Kitani
As a result, the feature of one object is informed of the features of other objects so that the object feature can lean towards the object with similar feature (i. e., object probably with a same ID) and deviate from objects with dissimilar features (i. e., object probably with different IDs), leading to a more discriminative feature for each object; (2) instead of obtaining the feature from either 2D or 3D space in prior work, we propose a novel joint feature extractor to learn appearance and motion features from 2D and 3D space simultaneously.
1 code implementation • CVPR 2020 • Eunji Chong, Yongxin Wang, Nataniel Ruiz, James M. Rehg
We address the problem of detecting attention targets in video.
no code implementations • ECCV 2018 • Eunji Chong, Nataniel Ruiz, Yongxin Wang, Yun Zhang, Agata Rozga, James Rehg
This paper addresses the challenging problem of estimating the general visual attention of people in images.