no code implementations • 5 Nov 2024 • Ying Zhou, Xinyao Wang, Yulei Niu, Yaojie Shen, Lexin Tang, Fan Chen, Ben He, Le Sun, Longyin Wen
Recent advancements in large language models (LLMs) have significantly enhanced their knowledge and generative capabilities, leading to a surge of interest in leveraging LLMs for high-quality data synthesis.
1 code implementation • 22 Sep 2024 • Hung-Ting Su, Ya-Ching Hsu, Xudong Lin, Xiang-Qian Shi, Yulei Niu, Han-Yuan Hsu, Hung-Yi Lee, Winston H. Hsu
Large language models (LLMs) equipped with chain-of-thoughts (CoT) prompting have shown significant multi-step reasoning capabilities in factual content like mathematics, commonsense, and logic.
1 code implementation • 13 Sep 2024 • Yaojie Shen, Xinyao Wang, Yulei Niu, Ying Zhou, Lexin Tang, Libo Zhang, Fan Chen, Longyin Wen
Despite its success, our study shows that the length exploitation issue present in PO is even more severe in Iterative Preference Optimization (IPO) due to the iterative nature of the process.
no code implementations • 16 Jun 2024 • Hung-Ting Su, Chun-Tong Chao, Ya-Ching Hsu, Xudong Lin, Yulei Niu, Hung-Yi Lee, Winston H. Hsu
This paper introduces a novel dataset, Tropes in Movies (TiM), designed as a testbed for exploring two critical yet previously overlooked video reasoning skills: (1) Abstract Perception: understanding and tokenizing abstract concepts in videos, and (2) Long-range Compositional Reasoning: planning and integrating intermediate reasoning steps for understanding long-range videos with numerous frames.
no code implementations • 28 May 2024 • Jiawei Ma, Yulei Niu, Shiyuan Huang, Guangxing Han, Shih-Fu Chang
Language has been useful in extending the vision encoder to data from diverse distributions without empirical discovery in training domains.
no code implementations • 27 Mar 2024 • Ali Zare, Yulei Niu, Hammad Ayyubi, Shih-Fu Chang
Procedure Planning in instructional videos entails generating a sequence of action steps based on visual observations of the initial and target states.
no code implementations • 3 Mar 2024 • Yulei Niu, Wenliang Guo, Long Chen, Xudong Lin, Shih-Fu Chang
We study the problem of procedure planning in instructional videos, which aims to make a goal-oriented sequence of action steps given partial visual state observations.
no code implementations • 7 Apr 2023 • Hung-Ting Su, Yulei Niu, Xudong Lin, Winston H. Hsu, Shih-Fu Chang
Causal Video Question Answering (CVidQA) queries not only association or temporal relations but also causal relations in a video.
1 code implementation • CVPR 2023 • Jiawei Ma, Yulei Niu, Jincheng Xu, Shiyuan Huang, Guangxing Han, Shih-Fu Chang
Generalized few-shot object detection aims to achieve precise detection on both base classes with abundant annotations and novel classes with limited training data.
no code implementations • 29 Jan 2023 • Beier Zhu, Yulei Niu, Saeil Lee, Minhoe Hur, Hanwang Zhang
We present a new paradigm for fine-tuning large-scale visionlanguage pre-trained models on downstream task, dubbed Prompt Regularization (ProReg).
no code implementations • 6 Jan 2023 • Andrew Lu, Xudong Lin, Yulei Niu, Shih-Fu Chang
Understanding event relationships in videos requires a model to understand the underlying structures of events (i. e. the event type, the associated argument roles, and corresponding entities) and factual knowledge for reasoning.
no code implementations • 23 Oct 2022 • Yulei Niu, Long Chen, Chang Zhou, Hanwang Zhang
The network response serves as additional supervision to formulate the machine domain, which uses the data collected from the human domain as a transfer set.
1 code implementation • 22 Oct 2022 • Long Chen, Yulei Niu, Brian Chen, Xudong Lin, Guangxing Han, Christopher Thomas, Hammad Ayyubi, Heng Ji, Shih-Fu Chang
Specifically, given an article and a relevant video, WSAG aims to localize all ``groundable'' sentences to the video, and these sentences are possibly at different semantic scales.
1 code implementation • 20 Jul 2022 • Zhen Wang, Long Chen, Wenbo Ma, Guangxing Han, Yulei Niu, Jian Shao, Jun Xiao
Given an image and a reference caption, the image caption editing task aims to correct the misalignment errors and generate a refined caption.
1 code implementation • ICLR 2022 • Xinting Hu, Yulei Niu, Chunyan Miao, Xian-Sheng Hua, Hanwang Zhang
Our method is three-fold: 1) We propose Class-Aware Propensity (CAP) that exploits the unlabeled data to train an improved classifier using the biased labeled data.
no code implementations • 14 Jun 2022 • Hammad A. Ayyubi, Christopher Thomas, Lovish Chum, Rahul Lokesh, Long Chen, Yulei Niu, Xudong Lin, Xuande Feng, Jaywon Koo, Sounak Ray, Shih-Fu Chang
To support research on this task, we introduce the Multimodal Hierarchical Events (MultiHiEve) dataset.
1 code implementation • ICCV 2023 • Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, Hanwang Zhang
Thanks to the large pre-trained vision-language models (VLMs) like CLIP, we can craft a zero-shot classifier by "prompt", e. g., the confidence score of an image being "[CLASS]" can be obtained by using the VLM provided similarity measure between the image and the prompt sentence "a photo of a [CLASS]".
1 code implementation • 29 Dec 2021 • Beier Zhu, Yulei Niu, Xian-Sheng Hua, Hanwang Zhang
We address the overlooked unbiasedness in existing long-tailed classification methods: we find that their overall improvement is mostly attributed to the biased preference of tail over head, as the test distribution is assumed to be balanced; however, when the test is as imbalanced as the long-tailed training data -- let the test respect Zipf's law of nature -- the tail bias is no longer beneficial overall because it hurts the head majorities.
1 code implementation • CVPR 2022 • Kaifeng Gao, Long Chen, Yulei Niu, Jian Shao, Jun Xiao
To this end, we propose a new classification-then-grounding framework for VidSGG, which can avoid all the three overlooked drawbacks.
1 code implementation • NeurIPS 2021 • Yulei Niu, Hanwang Zhang
Question answering (QA) models are well-known to exploit data bias, e. g., the language prior in visual QA and the position bias in reading comprehension.
1 code implementation • 3 Oct 2021 • Long Chen, Yuhang Zheng, Yulei Niu, Hanwang Zhang, Jun Xiao
Specifically, CSST is composed of two parts: Counterfactual Samples Synthesizing (CSS) and Counterfactual Samples Training (CST).
1 code implementation • ACL 2021 • Sicheng Yu, Hao Zhang, Yulei Niu, Qianru Sun, Jing Jiang
Pre-trained multilingual language models, e. g., multilingual-BERT, are widely used in cross-lingual tasks, yielding the state-of-the-art performance.
1 code implementation • 12 Oct 2020 • Sicheng Yu, Yulei Niu, Shuohang Wang, Jing Jiang, Qianru Sun
We then conduct two novel CVC inference methods (on trained models) to capture the effect of comprehensive reasoning as the final prediction.
1 code implementation • CVPR 2021 • Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, Ji-Rong Wen
VQA models may tend to rely on language bias as a shortcut and thus fail to sufficiently learn the multi-modal knowledge from both vision and language.
1 code implementation • 19 Mar 2020 • An Zhao, Mingyu Ding, Zhiwu Lu, Tao Xiang, Yulei Niu, Jiechao Guan, Ji-Rong Wen, Ping Luo
Existing few-shot learning (FSL) methods make the implicit assumption that the few target class samples are from the same domain as the source class samples.
6 code implementations • CVPR 2020 • Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, Hanwang Zhang
Today's scene graph generation (SGG) task is still far from practical, mainly due to the severe training bias, e. g., collapsing diverse "human walk on / sit on / lay on beach" into "human on beach".
Ranked #3 on
Scene Graph Generation
on Visual Genome
1 code implementation • CVPR 2020 • Jiaxin Qi, Yulei Niu, Jianqiang Huang, Hanwang Zhang
This paper unravels the design tricks adopted by us, the champion team MReaL-BDAI, for Visual Dialog Challenge 2019: two causal principles for improving Visual Dialog (VisDial).
no code implementations • 27 Aug 2019 • Yuqi Huo, Xiaoli Xu, Yao Lu, Yulei Niu, Zhiwu Lu, Ji-Rong Wen
In addition to motion vectors, we also provide a temporal fusion method to explicitly induce the temporal context.
no code implementations • 8 Jul 2019 • Yulei Niu, Hanwang Zhang, Zhiwu Lu, Shih-Fu Chang
Specifically, our framework exploits the reciprocal relation between the referent and context, i. e., either of them influences estimation of the posterior distribution of the other, and thereby the search space of context can be greatly reduced.
1 code implementation • CVPR 2019 • Yulei Niu, Hanwang Zhang, Manli Zhang, Jianhong Zhang, Zhiwu Lu, Ji-Rong Wen
Visual dialog is a challenging vision-language task, which requires the agent to answer multi-round questions about an image.
Ranked #13 on
Visual Dialog
on VisDial v0.9 val
1 code implementation • CVPR 2018 • Hanwang Zhang, Yulei Niu, Shih-Fu Chang
This is a general yet challenging vision-language task since it does not only require the localization of objects, but also the multimodal comprehension of context --- visual attributes (e. g., "largest", "baby") and relationships (e. g., "behind") that help to distinguish the referent from other objects, especially those of the same category.
no code implementations • 5 Sep 2017 • Yulei Niu, Zhiwu Lu, Ji-Rong Wen, Tao Xiang, Shih-Fu Chang
In this paper, we address two main issues in large-scale image annotation: 1) how to learn a rich feature representation suitable for predicting a diverse set of visual concepts ranging from object, scene to abstract concept; 2) how to annotate an image with the optimal number of class labels.