no code implementations • ACL 2022 • Moxin Li, Fuli Feng, Hanwang Zhang, Xiangnan He, Fengbin Zhu, Tat-Seng Chua
Neural discrete reasoning (NDR) has shown remarkable progress in combining deep models with discrete reasoning.
no code implementations • 6 Dec 2024 • Qingshan Xu, Jiequan Cui, Xuanyu Yi, Yuxuan Wang, Yuan Zhou, Yew-Soon Ong, Hanwang Zhang
To address this problem, we propose Hard Gaussian Splatting, dubbed HGS, which considers multi-view significant positional gradients and rendering errors to grow hard Gaussians that fill the gaps of classical Gaussian Splatting on 3D scenes, thus achieving superior NVS results.
1 code implementation • 5 Dec 2024 • Jinbin Bai, Wei Chow, Ling Yang, Xiangtai Li, Juncheng Li, Hanwang Zhang, Shuicheng Yan
HumanEdit bridges this gap by employing human annotators to construct data pairs and administrators to provide feedback.
no code implementations • 28 Nov 2024 • Xue Song, Jiequan Cui, Hanwang Zhang, Jiaxin Shi, Jingjing Chen, Chi Zhang, Yu-Gang Jiang
Furthermore, generalizable models for image editing with visual instructions typically require quad data, i. e., a before-after image pair, along with query and target images.
1 code implementation • 25 Nov 2024 • Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao, Long Chen
For causal generation, it introduces unidirectional feature computation, which ensures that the cache of conditional frames can be precomputed in previous autoregression steps and reused in every subsequent step, eliminating redundant computations.
no code implementations • 25 Nov 2024 • Yuan Zhou, Qingshan Xu, Jiequan Cui, Junbao Zhou, Jing Zhang, Richang Hong, Hanwang Zhang
In this paper, we propose a new de\textbf{C}oupled du\textbf{A}l-interactive linea\textbf{R} att\textbf{E}ntion (CARE) mechanism, revealing that features' decoupling and interaction can fully unleash the power of linear attention.
no code implementations • 24 Nov 2024 • Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, Yueting Zhuang
Instruction-based image editing aims to modify specific image elements with natural language instructions.
1 code implementation • 11 Nov 2024 • Beier Zhu, Jiequan Cui, Hanwang Zhang
When fine-tuning zero-shot models like CLIP, our desideratum is for the fine-tuned model to excel in both in-distribution (ID) and out-of-distribution (OOD).
no code implementations • 1 Nov 2024 • Wei Chow, Juncheng Li, Qifan Yu, Kaihang Pan, Hao Fei, Zhiqi Ge, Shuai Yang, Siliang Tang, Hanwang Zhang, Qianru Sun
Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval, yet struggles with complex scenarios requiring fine-grained semantic differentiation.
no code implementations • 25 Oct 2024 • Xingyu Zhu, Beier Zhu, Yi Tan, Shuo Wang, Yanbin Hao, Hanwang Zhang
Vision-language models, such as CLIP, have shown impressive generalization capacities when using appropriate text descriptions.
no code implementations • 23 Oct 2024 • Qingshan Xu, Xuanyu Yi, Jianyao Xu, Wenbing Tao, Yew-Soon Ong, Hanwang Zhang
In this work, we reveal that there exists an inconsistency between the frequency regularization of PE and rendering loss.
no code implementations • 8 Oct 2024 • Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan
Recent developments of vision large language models (LLMs) have seen remarkable progress, yet still encounter challenges towards multimodal generalists, such as coarse-grained instance-level understanding, lack of unified support for both images and videos, and insufficient coverage across various vision tasks.
1 code implementation • 30 Sep 2024 • Kaihang Pan, Zhaoyu Fan, Juncheng Li, Qifan Yu, Hao Fei, Siliang Tang, Richang Hong, Hanwang Zhang, Qianru Sun
In this paper, we propose UniKE, a novel multimodal editing method that establishes a unified perspective and paradigm for intrinsic knowledge editing and external knowledge resorting.
no code implementations • 9 Aug 2024 • Dongsheng Wang, Jiequan Cui, Miaoge Li, Wang Lin, Bo Chen, Hanwang Zhang
However, current research is inherently constrained by challenges such as the need for high-quality instruction pairs and the loss of visual information in image-to-text training objectives.
1 code implementation • 24 Jul 2024 • Xingyu Zhu, Beier Zhu, Yi Tan, Shuo Wang, Yanbin Hao, Hanwang Zhang
Vision-language models such as CLIP are capable of mapping the different modality data into a unified feature space, enabling zero/few-shot inference by measuring the similarity of given images and texts.
1 code implementation • 14 Jul 2024 • Wei Suo, Lanqing Lai, Mengyang Sun, Hanwang Zhang, Peng Wang, Yanning Zhang
As a fundamental and extensively studied task in computer vision, image segmentation aims to locate and identify different semantic concepts at the pixel level.
1 code implementation • 16 Jun 2024 • Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao
Inspired from the huge success of large language models (LLMs) and following GPT (generative pre-trained transformer), we bring causal (i. e., unidirectional) generation into VDMs, and use past frames as prompt to generate future frames.
no code implementations • 13 Jun 2024 • Yucheng Han, Rui Wang, Chi Zhang, Juntao Hu, Pei Cheng, Bin Fu, Hanwang Zhang
Recent advancements in image generation have enabled the creation of high-quality images from text conditions.
2 code implementations • 10 Jun 2024 • Xuanyu Yi, Zike Wu, Qiuhong Shen, Qingshan Xu, Pan Zhou, Joo-Hwee Lim, Shuicheng Yan, Xinchao Wang, Hanwang Zhang
Recent 3D large reconstruction models (LRMs) can generate high-quality 3D content in sub-seconds by integrating multi-view diffusion models with scalable multi-view reconstructors.
no code implementations • 7 Jun 2024 • Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan
The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
Ranked #72 on Visual Question Answering on MM-Vet
2 code implementations • 27 May 2024 • Kai Wang, Mingjia Shi, Yukun Zhou, Zekai Li, Zhihang Yuan, Yuzhang Shang, Xiaojiang Peng, Hanwang Zhang, Yang You
Training diffusion models is always a computation-intensive task.
no code implementations • 11 May 2024 • Wang Lin, Jingyuan Chen, Jiaxin Shi, Yichen Zhu, Chen Liang, Junzhong Miao, Tao Jin, Zhou Zhao, Fei Wu, Shuicheng Yan, Hanwang Zhang
We tackle the common challenge of inter-concept visual confusion in compositional concept generation using text-guided diffusion models (TGDMs).
1 code implementation • 3 May 2024 • Kaihang Pan, Siliang Tang, Juncheng Li, Zhaoyu Fan, Wei Chow, Shuicheng Yan, Tat-Seng Chua, Yueting Zhuang, Hanwang Zhang
For multimodal LLMs, the synergy of visual comprehension (textual output) and generation (visual output) presents an ongoing challenge.
no code implementations • 29 Apr 2024 • Liying Gao, Bingliang Jiao, Peng Wang, Shizhou Zhang, Hanwang Zhang, Yanning Zhang
In this study, we aim to tackle two major challenges of this task simultaneously: i) zero-shot, dealing with unseen categories, and ii) fine-grained, referring to intra-category instance-level retrieval.
1 code implementation • CVPR 2024 • Xuanyu Yi, Zike Wu, Qingshan Xu, Pan Zhou, Joo-Hwee Lim, Hanwang Zhang
Score distillation sampling~(SDS) has been widely adopted to overcome the absence of unseen views in reconstructing 3D objects from a \textbf{single} image.
2 code implementations • 27 Mar 2024 • Qiuhong Shen, Zike Wu, Xuanyu Yi, Pan Zhou, Hanwang Zhang, Shuicheng Yan, Xinchao Wang
We tackle the challenge of efficiently reconstructing a 3D asset from a single image at millisecond speed.
no code implementations • 18 Mar 2024 • Yuxuan Wang, Xuanyu Yi, Zike Wu, Na Zhao, Long Chen, Hanwang Zhang
However, this approach faces a critical issue of multi-view inconsistency, where the guidance images exhibit significant discrepancies across views, leading to mode collapse and visual artifacts of 3DGS.
no code implementations • 17 Mar 2024 • Yuan Zhou, Richang Hong, Yanrong Guo, Lin Liu, Shijie Hao, Hanwang Zhang
In this paper, we propose to tackle Few-Shot Class-Incremental Learning (FSCIL) from a new perspective, i. e., relation disentanglement, which means enhancing FSCIL via disentangling spurious relation between categories.
1 code implementation • CVPR 2024 • Fengda Zhang, Qianpei He, Kun Kuang, Jiashuo Liu, Long Chen, Chao Wu, Jun Xiao, Hanwang Zhang
This work proposes a novel, generation-based two-stage framework to train a fair FAC model on biased data without additional annotation.
no code implementations • CVPR 2024 • Leigang Qu, Wenjie Wang, Yongqi Li, Hanwang Zhang, Liqiang Nie, Tat-Seng Chua
We present a discriminative adapter built on T2I models to probe their discriminative abilities on two representative tasks and leverage discriminative fine-tuning to improve their text-image alignment.
1 code implementation • CVPR 2024 • Xue Song, Jiequan Cui, Hanwang Zhang, Jingjing Chen, Richang Hong, Yu-Gang Jiang
Through the lens of the formulation, we find that the crux of TBIE is that existing techniques hardly achieve a good trade-off between editability and fidelity, mainly due to the overfitting of the single-image fine-tuning.
1 code implementation • CVPR 2024 • Zhongqi Yue, Pan Zhou, Richang Hong, Hanwang Zhang, Qianru Sun
To this end, we find an inductive bias that the time-steps of a Diffusion Model (DM) can isolate the nuanced class attributes, i. e., as the forward diffusion adds noise to an image at each time-step, nuanced attributes are usually lost at an earlier time-step than the spurious attributes that are visually prominent.
1 code implementation • CVPR 2024 • Jiequan Cui, Beier Zhu, Xin Wen, Xiaojuan Qi, Bei Yu, Hanwang Zhang
Second, with the proposed concept of Model Prediction Bias, we investigate the origins of problematic representation during optimization.
1 code implementation • 21 Jan 2024 • Zhongqi Yue, Jiankun Wang, Qianru Sun, Lei Ji, Eric I-Chao Chang, Hanwang Zhang
Representation learning is all about discovering the hidden modular attributes that generate the data faithfully.
1 code implementation • CVPR 2024 • Zike Wu, Pan Zhou, Xuanyu Yi, Xiaoding Yuan, Hanwang Zhang
To solve this issue, we first deeply analyze the SDS and find that its distillation sampling process indeed corresponds to the trajectory sampling of a stochastic differential equation (SDE): SDS samples along an SDE trajectory to yield a less noisy sample which then serves as a guidance to optimize a 3D model.
1 code implementation • 10 Jan 2024 • Yucheng Han, Na Zhao, Weiling Chen, Keng Teck Ma, Hanwang Zhang
Our DPKE enriches the knowledge of limited training data, particularly unlabeled data, from two perspectives: data-perspective and feature-perspective.
no code implementations • 10 Jan 2024 • Luanyuan Dai, Xiaoyu Du, Hanwang Zhang, Jinhui Tang
To obtain information integrating implicit and explicit local graphs, we construct local graphs from implicit and explicit aspects and combine them effectively, which is used to build a global graph.
2 code implementations • 15 Dec 2023 • Xu Yang, Yingzhe Peng, Haoxuan Ma, Shuo Xu, Chi Zhang, Yucheng Han, Hanwang Zhang
As Archimedes famously said, ``Give me a lever long enough and a fulcrum on which to place it, and I shall move the world'', in this study, we propose to use a tiny Language Model (LM), \eg, a Transformer with 67M parameters, to lever much larger Vision-Language Models (LVLMs) with 9B parameters.
no code implementations • 27 Nov 2023 • Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, Hanwang Zhang
Next, we introduce ChartLlama, a multi-modal large language model that we've trained using our created dataset.
3 code implementations • ICCV 2023 • Jiali Ma, Zhongqi Yue, Kagaya Tomoyuki, Suzuki Tomoki, Karlekar Jayashree, Sugiri Pranata, Hanwang Zhang
Unfortunately, face datasets inevitably capture the imbalanced demographic attributes that are ubiquitous in real-world observations, and the model learns biased feature that generalizes poorly in the minority group.
1 code implementation • NeurIPS 2023 • Beier Zhu, Kaihua Tang, Qianru Sun, Hanwang Zhang
In this study, we systematically examine the biases in foundation models and demonstrate the efficacy of our proposed Generalized Logit Adjustment (GLA) method.
1 code implementation • NeurIPS 2023 • Zhongqi Yue, Hanwang Zhang, Qianru Sun
Domain Adaptation (DA) is always challenged by the spurious correlation between domain-invariant features (e. g., class identity) and domain-specific features (e. g., environment) that does not generalize to the target domain.
no code implementations • 17 Sep 2023 • Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim
Many studies focus on improving pretraining or developing new backbones in text-video retrieval.
no code implementations • CVPR 2024 • Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Tat-Seng Chua
In this work, we investigate strengthening the awareness of video dynamics for DMs, for high-quality T2V generation.
no code implementations • ICCV 2023 • Xuanyu Yi, Jiajun Deng, Qianru Sun, Xian-Sheng Hua, Joo-Hwee Lim, Hanwang Zhang
We tackle the data scarcity challenge in few-shot point cloud recognition of 3D objects by using a joint prediction from a conventional 3D model and a well-trained 2D model.
1 code implementation • 8 Aug 2023 • Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Hanwang Zhang, Yueting Zhuang
This shortcoming results in MLLMs' underperformance in comprehending demonstrative instructions consisting of multiple, interleaved, and multimodal instructions that demonstrate the required context to complete a task.
1 code implementation • ICCV 2023 • Yanghao Wang, Zhongqi Yue, Xian-Sheng Hua, Hanwang Zhang
First, as the randomization is independent of the distribution of the limited known objects, the random proposals become the instrumental variable that prevents the training from being confounded by the known objects.
1 code implementation • CVPR 2024 • Tan Wang, Linjie Li, Kevin Lin, Yuanhao Zhai, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang
In this paper, we depart from the traditional paradigm of human motion transfer and emphasize two additional critical attributes for the synthesis of human dance content in social media contexts: (i) Generalizability: the model should be able to generalize beyond generic human viewpoints as well as unseen human subjects, backgrounds, and poses; (ii) Compositionality: it should allow for the seamless composition of seen/unseen subjects, backgrounds, and poses from different sources.
1 code implementation • 12 Jun 2023 • Zike Wu, Pan Zhou, Kenji Kawaguchi, Hanwang Zhang
In this paper, we propose a Fast Diffusion Model (FDM) to significantly speed up DMs from a stochastic optimization perspective for both faster training and sampling.
no code implementations • 7 Jun 2023 • Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim
Text-video retrieval contains various challenges, including biases coming from diverse sources.
4 code implementations • 23 May 2023 • Jiequan Cui, Zhuotao Tian, Zhisheng Zhong, Xiaojuan Qi, Bei Yu, Hanwang Zhang
In this paper, we delve deeper into the Kullback-Leibler (KL) Divergence loss and mathematically prove that it is equivalent to the Decoupled Kullback-Leibler (DKL) Divergence loss that consists of 1) a weighted Mean Square Error (wMSE) loss and 2) a Cross-Entropy loss incorporating soft labels.
1 code implementation • ICCV 2023 • Tan Wang, Kevin Lin, Linjie Li, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang
Unlike the existing image-text similarity objective which only categorizes matched pairs as similar and unmatched pairs as dissimilar, equivariance also requires similarity to vary faithfully according to the semantic changes.
Ranked #7 on Visual Reasoning on Winoground
1 code implementation • CVPR 2023 • Hui Lv, Zhongqi Yue, Qianru Sun, Bin Luo, Zhen Cui, Hanwang Zhang
At each MIL training iteration, we use the current detector to divide the samples into two groups with different context biases: the most confident abnormal/normal snippets and the rest ambiguous ones.
1 code implementation • CVPR 2023 • Fengyun Wang, Dong Zhang, Hanwang Zhang, Jinhui Tang, Qianru Sun
SSC is a well-known ill-posed problem as the prediction model has to "imagine" what is behind the visible surface, which is usually represented by Truncated Signed Distance Function (TSDF).
1 code implementation • 1 Feb 2023 • Kaifeng Gao, Long Chen, Hanwang Zhang, Jun Xiao, Qianru Sun
Without bells and whistles, our RePro achieves a new state-of-the-art performance on two VidVRD benchmarks of not only the base training object and predicate categories, but also the unseen ones.
no code implementations • 29 Jan 2023 • Beier Zhu, Yulei Niu, Saeil Lee, Minhoe Hur, Hanwang Zhang
We present a new paradigm for fine-tuning large-scale visionlanguage pre-trained models on downstream task, dubbed Prompt Regularization (ProReg).
1 code implementation • 5 Jan 2023 • Zihua Wang, Xu Yang, Hanwang Zhang, Haiyang Xu, Ming Yan, Fei Huang, Yu Zhang
In this gradual clustering process, a parsing tree is generated which embeds the hierarchical knowledge of the input sequence.
no code implementations • ICCV 2023 • Xu Yang, Zhangzikang Li, Haiyang Xu, Hanwang Zhang, Qinghao Ye, Chenliang Li, Ming Yan, Yu Zhang, Fei Huang, Songfang Huang
To amend this, we propose a novel TW-BERT to learn Trajectory-Word alignment by a newly designed trajectory-to-word (T2W) attention for solving video-language tasks.
1 code implementation • CVPR 2023 • Muli Yang, Liancheng Wang, Cheng Deng, Hanwang Zhang
Novel Class Discovery (NCD) aims to discover unknown classes without any annotation, by exploiting the transferable knowledge already learned from a base set of known classes.
1 code implementation • ICCV 2023 • Haoxin Li, YuAn Liu, Hanwang Zhang, Boyang Li
The video background is clearly a source of static bias, but the video foreground, such as the clothing of the actor, can also provide static bias.
no code implementations • 20 Nov 2022 • Jianqiang Huang, Jian Wang, Qianru Sun, Hanwang Zhang
An intuitive solution is ``coupling'' the CAM with the long-range attention matrix of visual transformers (ViT) We find that the direct ``coupling'', e. g., pixel-wise multiplication of attention and activation, achieves a more global coverage (on the foreground), but unfortunately goes with a great increase of false positives, i. e., background pixels are mistakenly included.
Weakly supervised Semantic Segmentation Weakly-Supervised Semantic Segmentation
no code implementations • 23 Oct 2022 • Yulei Niu, Long Chen, Chang Zhou, Hanwang Zhang
The network response serves as additional supervision to formulate the machine domain, which uses the data collected from the human domain as a transfer set.
1 code implementation • 4 Oct 2022 • Xu Yang, Hanwang Zhang, Chongyang Gao, Jianfei Cai
This is because the language is only partially observable, for which we need to dynamically collocate the modules during the process of image captioning.
1 code implementation • 6 Aug 2022 • Jiaxin Qi, Kaihua Tang, Qianru Sun, Xian-Sheng Hua, Hanwang Zhang
If the context in every class is evenly distributed, OOD would be trivial because the context can be easily removed due to an underlying principle: class is invariant to context.
no code implementations • 27 Jul 2022 • Lin Li, Long Chen, Hanrong Shi, Hanwang Zhang, Yi Yang, Wei Liu, Jun Xiao
To this end, we propose a novel NoIsy label CorrEction and Sample Training strategy for SGG: NICEST.
1 code implementation • 27 Jul 2022 • Xuanyu Yi, Kaihua Tang, Xian-Sheng Hua, Joo-Hwee Lim, Hanwang Zhang
Such imbalanced training data makes a classifier less discriminative for the tail classes, whose previously "easy" noises are now turned into "hard" ones -- they are almost as outliers as the clean tail samples.
1 code implementation • 25 Jul 2022 • Tan Wang, Qianru Sun, Sugiri Pranata, Karlekar Jayashree, Hanwang Zhang
We are interested in learning robust models from insufficient data, without the need for any externally pre-trained checkpoints.
1 code implementation • 19 Jul 2022 • Kaihua Tang, Mingyuan Tao, Jiaxin Qi, Zhenguang Liu, Hanwang Zhang
In fact, even if the class is balanced, samples within each class may still be long-tailed due to the varying attributes.
Ranked #1 on Long-tail Learning on ImageNet-GLT
1 code implementation • ICLR 2022 • Xinting Hu, Yulei Niu, Chunyan Miao, Xian-Sheng Hua, Hanwang Zhang
Our method is three-fold: 1) We propose Class-Aware Propensity (CAP) that exploits the unlabeled data to train an improved classifier using the biased labeled data.
1 code implementation • 29 Jun 2022 • Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim
In this report, we present our approach for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022.
Ranked #10 on Multi-Instance Retrieval on EPIC-KITCHENS-100
1 code implementation • 26 Jun 2022 • Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim
Most methods consider only one joint embedding space between global visual and textual features without considering the local structures of each modality.
Ranked #12 on Video Retrieval on YouCook2
1 code implementation • ICCV 2023 • Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, Hanwang Zhang
Thanks to the large pre-trained vision-language models (VLMs) like CLIP, we can craft a zero-shot classifier by "prompt", e. g., the confidence score of an image being "[CLASS]" can be obtained by using the VLM provided similarity measure between the image and the prompt sentence "a photo of a [CLASS]".
1 code implementation • 24 May 2022 • Haiteng Zhao, Chang Ma, Xinshuai Dong, Anh Tuan Luu, Zhi-Hong Deng, Hanwang Zhang
Deep learning models have achieved great success in many fields, yet they are vulnerable to adversarial examples.
1 code implementation • CVPR 2022 • Zhaozheng Chen, Tan Wang, Xiongwei Wu, Xian-Sheng Hua, Hanwang Zhang, Qianru Sun
Specifically, due to the sum-over-class pooling nature of BCE, each pixel in CAM may be responsive to multiple classes co-occurring in the same receptive field.
Weakly supervised Semantic Segmentation Weakly-Supervised Semantic Segmentation
no code implementations • 31 Dec 2021 • Jianqiang Huang, Yu Qin, Jiaxin Qi, Qianru Sun, Hanwang Zhang
We focus on the confounding bias between language and location in the visual grounding pipeline, where we find that the bias is the major visual reasoning bottleneck.
1 code implementation • 29 Dec 2021 • Beier Zhu, Yulei Niu, Xian-Sheng Hua, Hanwang Zhang
We address the overlooked unbiasedness in existing long-tailed classification methods: we find that their overall improvement is mostly attributed to the biased preference of tail over head, as the test distribution is assumed to be balanced; however, when the test is as imbalanced as the long-tailed training data -- let the test respect Zipf's law of nature -- the tail bias is no longer beneficial overall because it hurts the head majorities.
1 code implementation • NeurIPS 2021 • Xinhsuai Dong, Luu Anh Tuan, Min Lin, Shuicheng Yan, Hanwang Zhang
The fine-tuning of pre-trained language models has a great success in many NLP fields.
1 code implementation • NeurIPS 2021 • Yulei Niu, Hanwang Zhang
Question answering (QA) models are well-known to exploit data bias, e. g., the language prior in visual QA and the position bias in reading comprehension.
1 code implementation • NeurIPS 2021 • Tan Wang, Zhongqi Yue, Jianqiang Huang, Qianru Sun, Hanwang Zhang
A good visual representation is an inference map from observations (images) to features (vectors) that faithfully reflects the hidden modularized generative factors (semantics).
1 code implementation • 3 Oct 2021 • Long Chen, Yuhang Zheng, Yulei Niu, Hanwang Zhang, Jun Xiao
Specifically, CSST is composed of two parts: Counterfactual Samples Synthesizing (CSS) and Counterfactual Samples Training (CST).
no code implementations • ICCV 2021 • Xu Yang, Chongyang Gao, Hanwang Zhang, Jianfei Cai
We propose an Auto-Parsing Network (APN) to discover and exploit the input data's hidden tree structures for improving the effectiveness of the Transformer-based vision-language systems.
1 code implementation • ICCV 2021 • Tan Wang, Chang Zhou, Qianru Sun, Hanwang Zhang
Attention module does not always help deep models learn causal features that are robust in any confounding context, e. g., a foreground object feature is invariant to different backgrounds.
1 code implementation • ACL 2021 • Yixin Cao, Xiang Ji, Xin Lv, Juanzi Li, Yonggang Wen, Hanwang Zhang
We present InferWiki, a Knowledge Graph Completion (KGC) dataset that improves upon existing benchmarks in inferential ability, assumptions, and patterns.
1 code implementation • ICCV 2021 • Zhongqi Yue, Qianru Sun, Xian-Sheng Hua, Hanwang Zhang
However, the theoretical solution provided by transportability is far from practical for UDA, because it requires the stratification and representation of the unobserved confounder that is the cause of the domain gap.
2 code implementations • 17 Jun 2021 • Kaihua Tang, Mingyuan Tao, Hanwang Zhang
As these visual confounders are imperceptible in general, we propose to use the instrumental variable that achieves causal intervention without the need for confounder observation.
1 code implementation • Findings (ACL) 2021 • Fuli Feng, Jizhi Zhang, Xiangnan He, Hanwang Zhang, Tat-Seng Chua
Present language understanding methods have demonstrated extraordinary ability of recognizing patterns in texts via machine learning.
no code implementations • 12 May 2021 • Chenchi Zhang, Wenbo Ma, Jun Xiao, Hanwang Zhang, Jian Shao, Yueting Zhuang, Long Chen
In this paper, we argue that these methods overlook an obvious \emph{mismatch} between the roles of proposals in the two stages: they generate proposals solely based on the detection confidence (i. e., query-agnostic), hoping that the proposals contain all instances mentioned in the text query (i. e., query-aware).
1 code implementation • EMNLP 2021 • Jiaxin Shi, Shulin Cao, Lei Hou, Juanzi Li, Hanwang Zhang
Multi-hop Question Answering (QA) is a challenging task because it requires precise reasoning with entity relations at every step towards the answer.
1 code implementation • CVPR 2021 • YuAn Liu, Jingyuan Chen, Zhenfang Chen, Bing Deng, Jianqiang Huang, Hanwang Zhang
The key challenge is how to distinguish the action of interest segments from the background, which is unlabelled even on the video-level.
Weakly-supervised Temporal Action Localization Weakly Supervised Temporal Action Localization
no code implementations • CVPR 2021 • Xu Yang, Hanwang Zhang, GuoJun Qi, Jianfei Cai
Specifically, CATT is implemented as a combination of 1) In-Sample Attention (IS-ATT) and 2) Cross-Sample Attention (CS-ATT), where the latter forcibly brings other samples into every IS-ATT, mimicking the causal intervention.
1 code implementation • CVPR 2021 • Xinting Hu, Kaihua Tang, Chunyan Miao, Xian-Sheng Hua, Hanwang Zhang
We propose a causal framework to explain the catastrophic forgetting in Class-Incremental Learning (CIL) and then derive a novel distillation method that is orthogonal to the existing anti-forgetting techniques, such as data replay and feature/label distillation.
1 code implementation • CVPR 2021 • Zhongqi Yue, Tan Wang, Hanwang Zhang, Qianru Sun, Xian-Sheng Hua
We show that the key reason is that the generation is not Counterfactual Faithful, and thus we propose a faithful one, whose generation is from the sample-specific counterfactual question: What would the sample look like, if we set its class attribute to a certain class, while keeping its sample attribute unchanged?
no code implementations • ACM International Conference on Multimedia 2020 • Yang, Xu, Chongyang Gao, Hanwang Zhang, and Jianfei Cai
We propose irredundant attention in SSG-RNN to improve the possibility of abstracting topics from rarely described sub-graphs and inheriting attention in WSG-RNN to generate more grounded sentences with the abstracted topics, both of which give rise to more distinctive paragraphs.
2 code implementations • NeurIPS 2020 • Kaihua Tang, Jianqiang Huang, Hanwang Zhang
On one hand, it has a harmful causal effect that misleads the tail prediction biased towards the head.
Ranked #36 on Long-tail Learning on CIFAR-10-LT (ρ=10)
1 code implementation • NeurIPS 2020 • Zhongqi Yue, Hanwang Zhang, Qianru Sun, Xian-Sheng Hua
Specifically, we develop three effective IFSL algorithmic implementations based on the backdoor adjustment, which is essentially a causal intervention towards the SCM of many-shot learning: the upper-bound of FSL in a causal view.
1 code implementation • NeurIPS 2020 • Dong Zhang, Hanwang Zhang, Jinhui Tang, Xian-Sheng Hua, Qianru Sun
We present a causal inference framework to improve Weakly-Supervised Semantic Segmentation (WSSS).
Ranked #37 on Weakly-Supervised Semantic Segmentation on COCO 2014 val
1 code implementation • 21 Sep 2020 • Wenjie Wang, Fuli Feng, Xiangnan He, Hanwang Zhang, Tat-Seng Chua
However, we argue that there is a significant gap between clicks and user satisfaction -- it is common that a user is "cheated" to click an item by the attractive title/cover of the item.
1 code implementation • 3 Sep 2020 • Long Chen, Wenbo Ma, Jun Xiao, Hanwang Zhang, Shih-Fu Chang
The prevailing framework for solving referring expression grounding is based on a two-stage process: 1) detecting proposals with an object detector and 2) grounding the referent to one of the proposals.
1 code implementation • ECCV 2020 • Dong Zhang, Hanwang Zhang, Jinhui Tang, Meng Wang, Xiansheng Hua, Qianru Sun
Yet, the non-local spatial interactions are not across scales, and thus they fail to capture the non-local contexts of objects (or parts) residing in different scales.
2 code implementations • ACL 2022 • Shulin Cao, Jiaxin Shi, Liangming Pan, Lunyiu Nie, Yutong Xiang, Lei Hou, Juanzi Li, Bin He, Hanwang Zhang
To this end, we introduce KQA Pro, a dataset for Complex KBQA including ~120K diverse natural language questions.
1 code implementation • CVPR 2021 • Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, Ji-Rong Wen
VQA models may tend to rely on language bias as a shortcut and thus fail to sufficiently learn the multi-modal knowledge from both vision and language.
1 code implementation • CVPR 2020 • Dan Guo, Hui Wang, Hanwang Zhang, Zheng-Jun Zha, Meng Wang
Visual dialog is a challenging task that requires the comprehension of the semantic dependencies among implicit visual and textual contexts.
Ranked #12 on Visual Dialog on VisDial v0.9 val
1 code implementation • CVPR 2020 • Xinting Hu, Yi Jiang, Kaihua Tang, Jingyuan Chen, Chunyan Miao, Hanwang Zhang
Real-world visual recognition requires handling the extreme sample imbalance in large-scale long-tailed data.
1 code implementation • CVPR 2020 • Yuanen Zhou, Meng Wang, Daqing Liu, Zhenzhen Hu, Hanwang Zhang
To improve the grounding accuracy while retaining the captioning quality, it is expensive to collect the word-region alignment as strong supervision.
2 code implementations • CVPR 2020 • Long Chen, Xin Yan, Jun Xiao, Hanwang Zhang, ShiLiang Pu, Yueting Zhuang
To reduce the language biases, several recent works introduce an auxiliary question-only model to regularize the training of targeted VQA model, and achieve dominating performance on VQA-CP.
Ranked #1 on Visual Question Answering (VQA) on VQA-CP (using extra training data)
no code implementations • 9 Mar 2020 • Xu Yang, Hanwang Zhang, Jianfei Cai
Dataset bias in vision-language tasks is becoming one of the main problems which hinders the progress of our community.
no code implementations • 5 Mar 2020 • Fuli Feng, Xiangnan He, Hanwang Zhang, Tat-Seng Chua
Graph Convolutional Network (GCN) is an emerging technique that performs learning and reasoning on graph data.
6 code implementations • CVPR 2020 • Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, Hanwang Zhang
Today's scene graph generation (SGG) task is still far from practical, mainly due to the severe training bias, e. g., collapsing diverse "human walk on / sit on / lay on beach" into "human on beach".
Ranked #3 on Scene Graph Generation on Visual Genome
1 code implementation • CVPR 2020 • Tan Wang, Jianqiang Huang, Hanwang Zhang, Qianru Sun
We present a novel unsupervised feature representation learning method, Visual Commonsense Region-based Convolutional Neural Network (VC R-CNN), to serve as an improved visual region encoder for high-level tasks such as captioning and VQA.
Ranked #23 on Image Captioning on COCO Captions
no code implementations • 5 Jan 2020 • Brian Chen, Bo Wu, Alireza Zareian, Hanwang Zhang, Shih-Fu Chang
Compared to the traditional Partial Label Learning (PLL) problem, GPLL relaxes the supervision assumption from instance-level -- a label set partially labels an instance -- to group-level: 1) a label set partially labels a group of instances, where the within-group instance-label link annotations are missing, and 2) cross-group links are allowed -- instances in a group may be partially linked to the label set from another group.
Ranked #1 on Partial Label Learning on MPII Movie Description
1 code implementation • CVPR 2020 • Jiaxin Qi, Yulei Niu, Jianqiang Huang, Hanwang Zhang
This paper unravels the design tricks adopted by us, the champion team MReaL-BDAI, for Visual Dialog Challenge 2019: two causal principles for improving Visual Dialog (VisDial).
no code implementations • 8 Jul 2019 • Yulei Niu, Hanwang Zhang, Zhiwu Lu, Shih-Fu Chang
Specifically, our framework exploits the reciprocal relation between the referent and context, i. e., either of them influences estimation of the posterior distribution of the other, and thereby the search space of context can be greatly reduced.
no code implementations • 9 Jun 2019 • Daqing Liu, Hanwang Zhang, Zheng-Jun Zha, Meng Wang, Qianru Sun
In this paper, we alleviate the missing-annotation problem and enable the joint reasoning by leveraging the language scene graph which covers both labeled referent and unlabeled contexts (other objects, attributes, and relationships).
1 code implementation • 6 Jun 2019 • Zheng-Jun Zha, Daqing Liu, Hanwang Zhang, Yongdong Zhang, Feng Wu
With the maturity of visual detection techniques, we are more ambitious in describing visual content with open-vocabulary, fine-grained and free-form language, i. e., the task of image captioning.
no code implementations • 5 Jun 2019 • Richang Hong, Daqing Liu, Xiaoyu Mo, Xiangnan He, Hanwang Zhang
Grounding natural language in images, such as localizing "the black dog on the left of the tree", is one of the core problems in artificial intelligence, as it needs to comprehend the fine-grained and compositional language space.
no code implementations • ICCV 2019 • Xu Yang, Hanwang Zhang, Jianfei Cai
To this end, we make the following technical contributions for CNM training: 1) compact module design --- one for function words and three for visual content words (eg, noun, adjective, and verb), 2) soft module fusion and multi-step module execution, robustifying the visual reasoning in partial observation, 3) a linguistic loss for module controller being faithful to part-of-speech collocations (eg, adjective is before noun).
no code implementations • ICCV 2019 • Tianhao Yang, Zheng-Jun Zha, Hanwang Zhang
We study the multi-round response generation in visual dialog, where a response is generated according to a visually grounded conversational history.
Ranked #10 on Visual Dialog on VisDial v0.9 val
no code implementations • ICCV 2019 • Daqing Liu, Hanwang Zhang, Feng Wu, Zheng-Jun Zha
In particular, we develop a novel modular network called Neural Module Tree network (NMTree) that regularizes the visual grounding along the dependency parsing tree of the sentence, where each node is a neural module that calculates visual attention according to its linguistic feature, and the grounding score is accumulated in a bottom-up direction where as needed.
2 code implementations • CVPR 2019 • Xu Yang, Kaihua Tang, Hanwang Zhang, Jianfei Cai
We propose Scene Graph Auto-Encoder (SGAE) that incorporates the language inductive bias into the encoder-decoder image captioning framework for more human-like captions.
no code implementations • ICCV 2019 • Long Chen, Hanwang Zhang, Jun Xiao, Xiangnan He, ShiLiang Pu, Shih-Fu Chang
CMAT is a multi-agent policy gradient method that frames objects as cooperative agents, and then directly maximizes a graph-level metric as the reward.
1 code implementation • CVPR 2019 • Yulei Niu, Hanwang Zhang, Manli Zhang, Jianhong Zhang, Zhiwu Lu, Ji-Rong Wen
Visual dialog is a challenging vision-language task, which requires the agent to answer multi-round questions about an image.
Ranked #13 on Visual Dialog on VisDial v0.9 val
6 code implementations • CVPR 2019 • Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, Wei Liu
We propose to compose dynamic tree structures that place the objects in an image into a visual context, helping visual reasoning tasks such as scene graph generation and visual Q&A.
Ranked #7 on Panoptic Scene Graph Generation on PSG Dataset
2 code implementations • CVPR 2019 • Jiaxin Shi, Hanwang Zhang, Juanzi Li
We aim to dismantle the prevalent black-box neural architectures used in complex visual reasoning tasks, into the proposed eXplainable and eXplicit Neural Modules (XNMs), which advance beyond existing neural module networks towards using scene graphs --- objects as nodes and the pairwise relationships as edges --- for explainable and explicit reasoning with structured knowledge.
Ranked #11 on Visual Question Answering (VQA) on CLEVR
2 code implementations • 6 Nov 2018 • Jiaxin Shi, Lei Hou, Juanzi Li, Zhiyuan Liu, Hanwang Zhang
Sentence embedding is an effective feature representation for most deep learning-based NLP tasks.
1 code implementation • 6 Nov 2018 • Jiaxin Shi, Chen Liang, Lei Hou, Juanzi Li, Zhiyuan Liu, Hanwang Zhang
We propose DeepChannel, a robust, data-efficient, and interpretable neural model for extractive document summarization.
no code implementations • NeurIPS 2018 • Hang Gao, Zheng Shou, Alireza Zareian, Hanwang Zhang, Shih-Fu Chang
Deep neural networks suffer from over-fitting and catastrophic forgetting when trained with small data.
no code implementations • 1 Sep 2018 • Qiangeng Xu, Hanwang Zhang, Weiyue Wang, Peter N. Belhumeur, Ulrich Neumann
In this paper, we introduce a stochastic dynamics video infilling (SDVI) framework to generate frames between long intervals in a video.
1 code implementation • 16 Aug 2018 • Daqing Liu, Zheng-Jun Zha, Hanwang Zhang, Yongdong Zhang, Feng Wu
To fill the gap, we propose a Context-Aware Visual Policy network (CAVP) for sequence-level image captioning.
1 code implementation • ECCV 2018 • Xu Yang, Hanwang Zhang, Jianfei Cai
By "agnostic", we mean that the feature is less likely biased to the classes of paired objects.
1 code implementation • 6 May 2018 • Han Liu, Xiangnan He, Fuli Feng, Liqiang Nie, Rui Liu, Hanwang Zhang
In this paper, we develop a generic feature-based recommendation model, called Discrete Factorization Machine (DFM), for fast and accurate recommendation.
no code implementations • 3 Apr 2018 • Wenhao Jiang, Lin Ma, Xinpeng Chen, Hanwang Zhang, Wei Liu
Recently, much advance has been made in image captioning, and an encoder-decoder framework has achieved outstanding performance for this task.
no code implementations • 7 Feb 2018 • Jingkuan Song, Hanwang Zhang, Xiangpeng Li, Lianli Gao, Meng Wang, Richang Hong
Existing video hash functions are built on three isolated stages: frame pooling, relaxed learning, and binarization, which have not adequately explored the temporal order of video frames in a joint binary optimization model, resulting in severe information loss.
1 code implementation • CVPR 2018 • Hanwang Zhang, Yulei Niu, Shih-Fu Chang
This is a general yet challenging vision-language task since it does not only require the localization of objects, but also the multimodal comprehension of context --- visual attributes (e. g., "largest", "baby") and relationships (e. g., "behind") that help to distinguish the referent from other objects, especially those of the same category.
1 code implementation • CVPR 2018 • Long Chen, Hanwang Zhang, Jun Xiao, Wei Liu, Shih-Fu Chang
We propose a novel framework called Semantics-Preserving Adversarial Embedding Network (SP-AEN) for zero-shot visual recognition (ZSL), where test images and their classes are both unseen during training.
43 code implementations • WWW 2017 • Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, Tat-Seng Chua
When it comes to model the key factor in collaborative filtering -- the interaction between user and item features, they still resorted to matrix factorization and applied an inner product on the latent features of users and items.
3 code implementations • 16 Aug 2017 • Xiangnan He, Hanwang Zhang, Min-Yen Kan, Tat-Seng Chua
To address this, we specifically design a new learning algorithm based on the element-wise Alternating Least Squares (eALS) technique, for efficiently optimizing a MF model with variably-weighted missing data.
7 code implementations • 15 Aug 2017 • Jun Xiao, Hao Ye, Xiangnan He, Hanwang Zhang, Fei Wu, Tat-Seng Chua
Factorization Machines (FMs) are a supervised learning approach that enhances the linear regression model by incorporating the second-order feature interactions.
no code implementations • ICCV 2017 • Hanwang Zhang, Zawlin Kyaw, Jinyang Yu, Shih-Fu Chang
We aim to tackle a novel vision task called Weakly Supervised Visual Relation Detection (WSVRD) to detect "subject-predicate-object" relations in an image with object relation groundtruths available only at the image level.
1 code implementation • 14 May 2017 • Lizi Liao, Xiangnan He, Hanwang Zhang, Tat-Seng Chua
For social networks, besides the network structure, there also exists rich information about social actors, such as user profiles of friendship networks and textual content of citation networks.
Social and Information Networks
2 code implementations • CVPR 2017 • Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, Tat-Seng Chua
To the best of our knowledge, VTransE is the first end-to-end relation detection network.
2 code implementations • CVPR 2017 • Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, Tat-Seng Chua
Existing visual attention models are generally spatial, i. e., the attention is modeled as spatial probabilities that re-weight the last conv-layer feature map of a CNN encoding an input image.
no code implementations • CVPR 2016 • Hanwang Zhang, Xindi Shang, Wenzhuo Yang, Huan Xu, Huanbo Luan, Tat-Seng Chua
Leveraging on the structure of the proposed collaborative learning formulation, we develop an efficient online algorithm that can jointly learn the label embeddings and visual classifiers.
no code implementations • ICCV 2015 • Xue Geng, Hanwang Zhang, Jingwen Bian, Tat-Seng Chua
It is often a great challenge for traditional recommender systems to learn representative features of both users and images in large social networks, in particular, social curation networks, which are characterized as the extremely sparse links between users and images, and the extremely diverse visual contents of images.