Search Results for author: Hanwang Zhang

Found 104 papers, 70 papers with code

Tuning Multi-mode Token-level Prompt Alignment across Modalities

no code implementations25 Sep 2023 Dongsheng Wang, Miaoge Li, Xinyang Liu, MingSheng Xu, Bo Chen, Hanwang Zhang

To address the limitation, we propose a multi-mode token-level tuning framework that leverages the optimal transportation to learn and align a set of prompt tokens across modalities.

Make the U in UDA Matter: Invariant Consistency Learning for Unsupervised Domain Adaptation

1 code implementation22 Sep 2023 Zhongqi Yue, Hanwang Zhang, Qianru Sun

Domain Adaptation (DA) is always challenged by the spurious correlation between domain-invariant features (e. g., class identity) and domain-specific features (e. g., environment) that does not generalize to the target domain.

Unsupervised Domain Adaptation

Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models

no code implementations26 Aug 2023 Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Tat-Seng Chua

In this work, we investigate strengthening the awareness of video dynamics for DMs, for high-quality T2V generation.

Video Generation

Invariant Training 2D-3D Joint Hard Samples for Few-Shot Point Cloud Recognition

no code implementations18 Aug 2023 Xuanyu Yi, Jiajun Deng, Qianru Sun, Xian-Sheng Hua, Joo-Hwee Lim, Hanwang Zhang

We tackle the data scarcity challenge in few-shot point cloud recognition of 3D objects by using a joint prediction from a conventional 3D model and a well-trained 2D model.

3D Shape Classification Retrieval

Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions

1 code implementation8 Aug 2023 Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Hanwang Zhang, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Yueting Zhuang

To address this issue, we propose a generic and lightweight controllable knowledge re-injection module, which utilizes the sophisticated reasoning ability of LLMs to control the VPG to conditionally extract instruction-specific visual information and re-inject it into the LLM.

Image Captioning Instruction Following

Random Boxes Are Open-world Object Detectors

1 code implementation17 Jul 2023 Yanghao Wang, Zhongqi Yue, Xian-Sheng Hua, Hanwang Zhang

First, as the randomization is independent of the distribution of the limited known objects, the random proposals become the instrumental variable that prevents the training from being confounded by the known objects.

object-detection Open World Object Detection

DisCo: Disentangled Control for Referring Human Dance Generation in Real World

1 code implementation30 Jun 2023 Tan Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang

Generative AI has made significant strides in computer vision, particularly in image/video synthesis conditioned on text descriptions.

Fast Diffusion Model

1 code implementation12 Jun 2023 Zike Wu, Pan Zhou, Kenji Kawaguchi, Hanwang Zhang

To mitigate this, we propose a Fast Diffusion Model (FDM) which improves the diffusion process of DMs from a stochastic optimization perspective to speed up both training and sampling.

Image Generation

An Overview of Challenges in Egocentric Text-Video Retrieval

no code implementations7 Jun 2023 Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim

Text-video retrieval contains various challenges, including biases coming from diverse sources.

Retrieval Video Retrieval

Decoupled Kullback-Leibler Divergence Loss

4 code implementations23 May 2023 Jiequan Cui, Zhuotao Tian, Zhisheng Zhong, Xiaojuan Qi, Bei Yu, Hanwang Zhang

In this paper, we delve deeper into the Kullback-Leibler (KL) Divergence loss and observe that it is equivalent to the Doupled Kullback-Leibler (DKL) Divergence loss that consists of 1) a weighted Mean Square Error (wMSE) loss and 2) a Cross-Entropy loss incorporating soft labels.

Adversarial Defense Adversarial Robustness +1

Equivariant Similarity for Vision-Language Foundation Models

1 code implementation25 Mar 2023 Tan Wang, Kevin Lin, Linjie Li, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang

Unlike the existing image-text similarity objective which only categorizes matched pairs as similar and unmatched pairs as dissimilar, equivariance also requires similarity to vary faithfully according to the semantic changes.

Retrieval Text Retrieval +1

Unbiased Multiple Instance Learning for Weakly Supervised Video Anomaly Detection

1 code implementation CVPR 2023 Hui Lv, Zhongqi Yue, Qianru Sun, Bin Luo, Zhen Cui, Hanwang Zhang

At each MIL training iteration, we use the current detector to divide the samples into two groups with different context biases: the most confident abnormal/normal snippets and the rest ambiguous ones.

Anomaly Detection Multiple Instance Learning +1

Semantic Scene Completion with Cleaner Self

1 code implementation CVPR 2023 Fengyun Wang, Dong Zhang, Hanwang Zhang, Jinhui Tang, Qianru Sun

SSC is a well-known ill-posed problem as the prediction model has to "imagine" what is behind the visible surface, which is usually represented by Truncated Signed Distance Function (TSDF).

Compositional Prompt Tuning with Motion Cues for Open-vocabulary Video Relation Detection

1 code implementation1 Feb 2023 Kaifeng Gao, Long Chen, Hanwang Zhang, Jun Xiao, Qianru Sun

Without bells and whistles, our RePro achieves a new state-of-the-art performance on two VidVRD benchmarks of not only the base training object and predicate categories, but also the unseen ones.

Video Visual Relation Detection

Debiased Fine-Tuning for Vision-language Models by Prompt Regularization

no code implementations29 Jan 2023 Beier Zhu, Yulei Niu, Saeil Lee, Minhoe Hur, Hanwang Zhang

We present a new paradigm for fine-tuning large-scale visionlanguage pre-trained models on downstream task, dubbed Prompt Regularization (ProReg).

Learning Trajectory-Word Alignments for Video-Language Tasks

no code implementations5 Jan 2023 Xu Yang, Zhangzikang Li, Haiyang Xu, Hanwang Zhang, Qinghao Ye, Chenliang Li, Ming Yan, Yu Zhang, Fei Huang, Songfang Huang

To amend this, we propose a novel TW-BERT to learn Trajectory-Word alignment by a newly designed trajectory-to-word (T2W) attention for solving video-language tasks.

Question Answering Retrieval +4

Bootstrap Your Own Prior: Towards Distribution-Agnostic Novel Class Discovery

1 code implementation CVPR 2023 Muli Yang, Liancheng Wang, Cheng Deng, Hanwang Zhang

Novel Class Discovery (NCD) aims to discover unknown classes without any annotation, by exploiting the transferable knowledge already learned from a base set of known classes.

Novel Class Discovery

Mitigating and Evaluating Static Bias of Action Representations in the Background and the Foreground

no code implementations23 Nov 2022 Haoxin Li, YuAn Liu, Hanwang Zhang, Boyang Li

In video action recognition, shortcut static features can interfere with the learning of motion features, resulting in poor out-of-distribution (OOD) generalization.

Action Recognition Data Augmentation +1

Attention-based Class Activation Diffusion for Weakly-Supervised Semantic Segmentation

no code implementations20 Nov 2022 Jianqiang Huang, Jian Wang, Qianru Sun, Hanwang Zhang

An intuitive solution is ``coupling'' the CAM with the long-range attention matrix of visual transformers (ViT) We find that the direct ``coupling'', e. g., pixel-wise multiplication of attention and activation, achieves a more global coverage (on the foreground), but unfortunately goes with a great increase of false positives, i. e., background pixels are mistakenly included.

Weakly supervised Semantic Segmentation Weakly-Supervised Semantic Segmentation

Respecting Transfer Gap in Knowledge Distillation

no code implementations23 Oct 2022 Yulei Niu, Long Chen, Chang Zhou, Hanwang Zhang

The network response serves as additional supervision to formulate the machine domain, which uses the data collected from the human domain as a transfer set.

Knowledge Distillation

Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning

1 code implementation4 Oct 2022 Xu Yang, Hanwang Zhang, Chongyang Gao, Jianfei Cai

This is because the language is only partially observable, for which we need to dynamically collocate the modules during the process of image captioning.

Image Captioning Visual Question Answering (VQA) +1

Class Is Invariant to Context and Vice Versa: On Learning Invariance for Out-Of-Distribution Generalization

1 code implementation6 Aug 2022 Jiaxin Qi, Kaihua Tang, Qianru Sun, Xian-Sheng Hua, Hanwang Zhang

If the context in every class is evenly distributed, OOD would be trivial because the context can be easily removed due to an underlying principle: class is invariant to context.

Out-of-Distribution Generalization

Identifying Hard Noise in Long-Tailed Sample Distribution

1 code implementation27 Jul 2022 Xuanyu Yi, Kaihua Tang, Xian-Sheng Hua, Joo-Hwee Lim, Hanwang Zhang

Such imbalanced training data makes a classifier less discriminative for the tail classes, whose previously "easy" noises are now turned into "hard" ones -- they are almost as outliers as the clean tail samples.


Equivariance and Invariance Inductive Bias for Learning from Insufficient Data

1 code implementation25 Jul 2022 Tan Wang, Qianru Sun, Sugiri Pranata, Karlekar Jayashree, Hanwang Zhang

We are interested in learning robust models from insufficient data, without the need for any externally pre-trained checkpoints.

Inductive Bias

Invariant Feature Learning for Generalized Long-Tailed Classification

1 code implementation19 Jul 2022 Kaihua Tang, Mingyuan Tao, Jiaxin Qi, Zhenguang Liu, Hanwang Zhang

In fact, even if the class is balanced, samples within each class may still be long-tailed due to the varying attributes.

Classification Long-tail Learning

On Non-Random Missing Labels in Semi-Supervised Learning

1 code implementation ICLR 2022 Xinting Hu, Yulei Niu, Chunyan Miao, Xian-Sheng Hua, Hanwang Zhang

Our method is three-fold: 1) We propose Class-Aware Propensity (CAP) that exploits the unlabeled data to train an improved classifier using the biased labeled data.

Imputation Pseudo Label

RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval

1 code implementation26 Jun 2022 Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim

Most methods consider only one joint embedding space between global visual and textual features without considering the local structures of each modality.

Retrieval Text to Video Retrieval +1

Prompt-aligned Gradient for Prompt Tuning

1 code implementation30 May 2022 Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, Hanwang Zhang

Thanks to the large pre-trained vision-language models (VLMs) like CLIP, we can craft a zero-shot classifier by "prompt", e. g., the confidence score of an image being "[CLASS]" can be obtained by using the VLM provided similarity measure between the image and the prompt sentence "a photo of a [CLASS]".

Domain Adaptation Few-Shot Learning +1

Certified Robustness Against Natural Language Attacks by Causal Intervention

1 code implementation24 May 2022 Haiteng Zhao, Chang Ma, Xinshuai Dong, Anh Tuan Luu, Zhi-Hong Deng, Hanwang Zhang

Deep learning models have achieved great success in many fields, yet they are vulnerable to adversarial examples.

Class Re-Activation Maps for Weakly-Supervised Semantic Segmentation

1 code implementation CVPR 2022 Zhaozheng Chen, Tan Wang, Xiongwei Wu, Xian-Sheng Hua, Hanwang Zhang, Qianru Sun

Specifically, due to the sum-over-class pooling nature of BCE, each pixel in CAM may be responsive to multiple classes co-occurring in the same receptive field.

Weakly supervised Semantic Segmentation Weakly-Supervised Semantic Segmentation

Deconfounded Visual Grounding

1 code implementation31 Dec 2021 Jianqiang Huang, Yu Qin, Jiaxin Qi, Qianru Sun, Hanwang Zhang

We focus on the confounding bias between language and location in the visual grounding pipeline, where we find that the bias is the major visual reasoning bottleneck.

Referring Expression Visual Grounding +1

Cross-Domain Empirical Risk Minimization for Unbiased Long-tailed Classification

1 code implementation29 Dec 2021 Beier Zhu, Yulei Niu, Xian-Sheng Hua, Hanwang Zhang

We address the overlooked unbiasedness in existing long-tailed classification methods: we find that their overall improvement is mostly attributed to the biased preference of tail over head, as the test distribution is assumed to be balanced; however, when the test is as imbalanced as the long-tailed training data -- let the test respect Zipf's law of nature -- the tail bias is no longer beneficial overall because it hurts the head majorities.


Introspective Distillation for Robust Question Answering

1 code implementation NeurIPS 2021 Yulei Niu, Hanwang Zhang

Question answering (QA) models are well-known to exploit data bias, e. g., the language prior in visual QA and the position bias in reading comprehension.

Inductive Bias Question Answering +2

Self-Supervised Learning Disentangled Group Representation as Feature

1 code implementation NeurIPS 2021 Tan Wang, Zhongqi Yue, Jianqiang Huang, Qianru Sun, Hanwang Zhang

A good visual representation is an inference map from observations (images) to features (vectors) that faithfully reflects the hidden modularized generative factors (semantics).

Colorization Contrastive Learning +1

Counterfactual Samples Synthesizing and Training for Robust Visual Question Answering

1 code implementation3 Oct 2021 Long Chen, Yuhang Zheng, Yulei Niu, Hanwang Zhang, Jun Xiao

Specifically, CSST is composed of two parts: Counterfactual Samples Synthesizing (CSS) and Counterfactual Samples Training (CST).

Question Answering Visual Question Answering

Auto-Parsing Network for Image Captioning and Visual Question Answering

no code implementations ICCV 2021 Xu Yang, Chongyang Gao, Hanwang Zhang, Jianfei Cai

We propose an Auto-Parsing Network (APN) to discover and exploit the input data's hidden tree structures for improving the effectiveness of the Transformer-based vision-language systems.

Image Captioning Question Answering +1

Causal Attention for Unbiased Visual Recognition

1 code implementation ICCV 2021 Tan Wang, Chang Zhou, Qianru Sun, Hanwang Zhang

Attention module does not always help deep models learn causal features that are robust in any confounding context, e. g., a foreground object feature is invariant to different backgrounds.

Are Missing Links Predictable? An Inferential Benchmark for Knowledge Graph Completion

1 code implementation ACL 2021 Yixin Cao, Xiang Ji, Xin Lv, Juanzi Li, Yonggang Wen, Hanwang Zhang

We present InferWiki, a Knowledge Graph Completion (KGC) dataset that improves upon existing benchmarks in inferential ability, assumptions, and patterns.

Knowledge Graph Completion

Transporting Causal Mechanisms for Unsupervised Domain Adaptation

1 code implementation ICCV 2021 Zhongqi Yue, Qianru Sun, Xian-Sheng Hua, Hanwang Zhang

However, the theoretical solution provided by transportability is far from practical for UDA, because it requires the stratification and representation of the unobserved confounder that is the cause of the domain gap.

Unsupervised Domain Adaptation

Adversarial Visual Robustness by Causal Intervention

2 code implementations17 Jun 2021 Kaihua Tang, Mingyuan Tao, Hanwang Zhang

As these visual confounders are imperceptible in general, we propose to use the instrumental variable that achieves causal intervention without the need for confounder observation.

Adversarial Robustness

Empowering Language Understanding with Counterfactual Reasoning

1 code implementation Findings (ACL) 2021 Fuli Feng, Jizhi Zhang, Xiangnan He, Hanwang Zhang, Tat-Seng Chua

Present language understanding methods have demonstrated extraordinary ability of recognizing patterns in texts via machine learning.

Natural Language Inference Sentiment Analysis

VL-NMS: Breaking Proposal Bottlenecks in Two-Stage Visual-Language Matching

no code implementations12 May 2021 Chenchi Zhang, Wenbo Ma, Jun Xiao, Hanwang Zhang, Jian Shao, Yueting Zhuang, Long Chen

In this paper, we argue that these methods overlook an obvious \emph{mismatch} between the roles of proposals in the two stages: they generate proposals solely based on the detection confidence (i. e., query-agnostic), hoping that the proposals contain all instances mentioned in the text query (i. e., query-aware).

Image-text matching Referring Expression +2

TransferNet: An Effective and Transparent Framework for Multi-hop Question Answering over Relation Graph

1 code implementation EMNLP 2021 Jiaxin Shi, Shulin Cao, Lei Hou, Juanzi Li, Hanwang Zhang

Multi-hop Question Answering (QA) is a challenging task because it requires precise reasoning with entity relations at every step towards the answer.

Multi-hop Question Answering Question Answering

Causal Attention for Vision-Language Tasks

no code implementations CVPR 2021 Xu Yang, Hanwang Zhang, GuoJun Qi, Jianfei Cai

Specifically, CATT is implemented as a combination of 1) In-Sample Attention (IS-ATT) and 2) Cross-Sample Attention (CS-ATT), where the latter forcibly brings other samples into every IS-ATT, mimicking the causal intervention.

Distilling Causal Effect of Data in Class-Incremental Learning

1 code implementation CVPR 2021 Xinting Hu, Kaihua Tang, Chunyan Miao, Xian-Sheng Hua, Hanwang Zhang

We propose a causal framework to explain the catastrophic forgetting in Class-Incremental Learning (CIL) and then derive a novel distillation method that is orthogonal to the existing anti-forgetting techniques, such as data replay and feature/label distillation.

class-incremental learning Class Incremental Learning +1

Counterfactual Zero-Shot and Open-Set Visual Recognition

1 code implementation CVPR 2021 Zhongqi Yue, Tan Wang, Hanwang Zhang, Qianru Sun, Xian-Sheng Hua

We show that the key reason is that the generation is not Counterfactual Faithful, and thus we propose a faithful one, whose generation is from the sample-specific counterfactual question: What would the sample look like, if we set its class attribute to a certain class, while keeping its sample attribute unchanged?

Binary Classification Open Set Learning +1

Hierarchical Scene Graph Encoder-Decoder for Image Paragraph Captioning

no code implementations ACM International Conference on Multimedia 2020 Yang, Xu, Chongyang Gao, Hanwang Zhang, and Jianfei Cai

We propose irredundant attention in SSG-RNN to improve the possibility of abstracting topics from rarely described sub-graphs and inheriting attention in WSG-RNN to generate more grounded sentences with the abstracted topics, both of which give rise to more distinctive paragraphs.

Image Paragraph Captioning

Interventional Few-Shot Learning

1 code implementation NeurIPS 2020 Zhongqi Yue, Hanwang Zhang, Qianru Sun, Xian-Sheng Hua

Specifically, we develop three effective IFSL algorithmic implementations based on the backdoor adjustment, which is essentially a causal intervention towards the SCM of many-shot learning: the upper-bound of FSL in a causal view.

Few-Shot Learning

Clicks can be Cheating: Counterfactual Recommendation for Mitigating Clickbait Issue

1 code implementation21 Sep 2020 Wenjie Wang, Fuli Feng, Xiangnan He, Hanwang Zhang, Tat-Seng Chua

However, we argue that there is a significant gap between clicks and user satisfaction -- it is common that a user is "cheated" to click an item by the attractive title/cover of the item.

Click-Through Rate Prediction Counterfactual Inference

Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding

1 code implementation3 Sep 2020 Long Chen, Wenbo Ma, Jun Xiao, Hanwang Zhang, Shih-Fu Chang

The prevailing framework for solving referring expression grounding is based on a two-stage process: 1) detecting proposals with an object detector and 2) grounding the referent to one of the proposals.

Referring Expression Vocal Bursts Valence Prediction

Feature Pyramid Transformer

1 code implementation ECCV 2020 Dong Zhang, Hanwang Zhang, Jinhui Tang, Meng Wang, Xiansheng Hua, Qianru Sun

Yet, the non-local spatial interactions are not across scales, and thus they fail to capture the non-local contexts of objects (or parts) residing in different scales.

Instance Segmentation object-detection +2

Counterfactual VQA: A Cause-Effect Look at Language Bias

1 code implementation CVPR 2021 Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, Ji-Rong Wen

VQA models may tend to rely on language bias as a shortcut and thus fail to sufficiently learn the multi-modal knowledge from both vision and language.

Counterfactual Inference Question Answering +1

Iterative Context-Aware Graph Inference for Visual Dialog

1 code implementation CVPR 2020 Dan Guo, Hui Wang, Hanwang Zhang, Zheng-Jun Zha, Meng Wang

Visual dialog is a challenging task that requires the comprehension of the semantic dependencies among implicit visual and textual contexts.

Graph Attention Graph Embedding +1

Learning to Segment the Tail

1 code implementation CVPR 2020 Xinting Hu, Yi Jiang, Kaihua Tang, Jingyuan Chen, Chunyan Miao, Hanwang Zhang

Real-world visual recognition requires handling the extreme sample imbalance in large-scale long-tailed data.

Few-Shot Learning Incremental Learning

More Grounded Image Captioning by Distilling Image-Text Matching Model

1 code implementation CVPR 2020 Yuanen Zhou, Meng Wang, Daqing Liu, Zhenzhen Hu, Hanwang Zhang

To improve the grounding accuracy while retaining the captioning quality, it is expensive to collect the word-region alignment as strong supervision.

Image Captioning Image-text matching +3

Counterfactual Samples Synthesizing for Robust Visual Question Answering

2 code implementations CVPR 2020 Long Chen, Xin Yan, Jun Xiao, Hanwang Zhang, ShiLiang Pu, Yueting Zhuang

To reduce the language biases, several recent works introduce an auxiliary question-only model to regularize the training of targeted VQA model, and achieve dominating performance on VQA-CP.

 Ranked #1 on Visual Question Answering (VQA) on VQA-CP (using extra training data)

Question Answering Visual Question Answering

Deconfounded Image Captioning: A Causal Retrospect

no code implementations9 Mar 2020 Xu Yang, Hanwang Zhang, Jianfei Cai

The dataset bias in vision-language tasks is becoming one of the main problems that hinder the progress of our community.

Causal Inference Image Captioning

Cross-GCN: Enhancing Graph Convolutional Network with $k$-Order Feature Interactions

no code implementations5 Mar 2020 Fuli Feng, Xiangnan He, Hanwang Zhang, Tat-Seng Chua

Graph Convolutional Network (GCN) is an emerging technique that performs learning and reasoning on graph data.

Document Classification

Visual Commonsense R-CNN

2 code implementations CVPR 2020 Tan Wang, Jianqiang Huang, Hanwang Zhang, Qianru Sun

We present a novel unsupervised feature representation learning method, Visual Commonsense Region-based Convolutional Neural Network (VC R-CNN), to serve as an improved visual region encoder for high-level tasks such as captioning and VQA.

Image Captioning Representation Learning +1

Unbiased Scene Graph Generation from Biased Training

6 code implementations CVPR 2020 Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, Hanwang Zhang

Today's scene graph generation (SGG) task is still far from practical, mainly due to the severe training bias, e. g., collapsing diverse "human walk on / sit on / lay on beach" into "human on beach".

Causal Inference Graph Generation +1

General Partial Label Learning via Dual Bipartite Graph Autoencoder

no code implementations5 Jan 2020 Brian Chen, Bo Wu, Alireza Zareian, Hanwang Zhang, Shih-Fu Chang

Compared to the traditional Partial Label Learning (PLL) problem, GPLL relaxes the supervision assumption from instance-level -- a label set partially labels an instance -- to group-level: 1) a label set partially labels a group of instances, where the within-group instance-label link annotations are missing, and 2) cross-group links are allowed -- instances in a group may be partially linked to the label set from another group.

Partial Label Learning

Two Causal Principles for Improving Visual Dialog

1 code implementation CVPR 2020 Jiaxin Qi, Yulei Niu, Jianqiang Huang, Hanwang Zhang

This paper unravels the design tricks adopted by us, the champion team MReaL-BDAI, for Visual Dialog Challenge 2019: two causal principles for improving Visual Dialog (VisDial).

Visual Dialog Vocal Bursts Valence Prediction

Variational Context: Exploiting Visual and Textual Context for Grounding Referring Expressions

no code implementations8 Jul 2019 Yulei Niu, Hanwang Zhang, Zhiwu Lu, Shih-Fu Chang

Specifically, our framework exploits the reciprocal relation between the referent and context, i. e., either of them influences estimation of the posterior distribution of the other, and thereby the search space of context can be greatly reduced.

Multiple Instance Learning Referring Expression

Joint Visual Grounding with Language Scene Graphs

no code implementations9 Jun 2019 Daqing Liu, Hanwang Zhang, Zheng-Jun Zha, Meng Wang, Qianru Sun

In this paper, we alleviate the missing-annotation problem and enable the joint reasoning by leveraging the language scene graph which covers both labeled referent and unlabeled contexts (other objects, attributes, and relationships).

Referring Expression Visual Grounding

Context-Aware Visual Policy Network for Fine-Grained Image Captioning

1 code implementation6 Jun 2019 Zheng-Jun Zha, Daqing Liu, Hanwang Zhang, Yongdong Zhang, Feng Wu

With the maturity of visual detection techniques, we are more ambitious in describing visual content with open-vocabulary, fine-grained and free-form language, i. e., the task of image captioning.

Image Captioning Image Paragraph Captioning +1

Learning to Compose and Reason with Language Tree Structures for Visual Grounding

no code implementations5 Jun 2019 Richang Hong, Daqing Liu, Xiaoyu Mo, Xiangnan He, Hanwang Zhang

Grounding natural language in images, such as localizing "the black dog on the left of the tree", is one of the core problems in artificial intelligence, as it needs to comprehend the fine-grained and compositional language space.

Visual Grounding Visual Reasoning

Learning to Collocate Neural Modules for Image Captioning

no code implementations ICCV 2019 Xu Yang, Hanwang Zhang, Jianfei Cai

To this end, we make the following technical contributions for CNM training: 1) compact module design --- one for function words and three for visual content words (eg, noun, adjective, and verb), 2) soft module fusion and multi-step module execution, robustifying the visual reasoning in partial observation, 3) a linguistic loss for module controller being faithful to part-of-speech collocations (eg, adjective is before noun).

Image Captioning Visual Question Answering (VQA) +1

Making History Matter: History-Advantage Sequence Training for Visual Dialog

no code implementations ICCV 2019 Tianhao Yang, Zheng-Jun Zha, Hanwang Zhang

We study the multi-round response generation in visual dialog, where a response is generated according to a visually grounded conversational history.

Answer Generation Response Generation +2

Learning to Assemble Neural Module Tree Networks for Visual Grounding

no code implementations ICCV 2019 Daqing Liu, Hanwang Zhang, Feng Wu, Zheng-Jun Zha

In particular, we develop a novel modular network called Neural Module Tree network (NMTree) that regularizes the visual grounding along the dependency parsing tree of the sentence, where each node is a neural module that calculates visual attention according to its linguistic feature, and the grounding score is accumulated in a bottom-up direction where as needed.

Dependency Parsing Natural Language Visual Grounding +4

Auto-Encoding Scene Graphs for Image Captioning

2 code implementations CVPR 2019 Xu Yang, Kaihua Tang, Hanwang Zhang, Jianfei Cai

We propose Scene Graph Auto-Encoder (SGAE) that incorporates the language inductive bias into the encoder-decoder image captioning framework for more human-like captions.

Image Captioning Inductive Bias

Recursive Visual Attention in Visual Dialog

1 code implementation CVPR 2019 Yulei Niu, Hanwang Zhang, Manli Zhang, Jianhong Zhang, Zhiwu Lu, Ji-Rong Wen

Visual dialog is a challenging vision-language task, which requires the agent to answer multi-round questions about an image.

Question Answering Visual Dialog +1

Counterfactual Critic Multi-Agent Training for Scene Graph Generation

no code implementations ICCV 2019 Long Chen, Hanwang Zhang, Jun Xiao, Xiangnan He, ShiLiang Pu, Shih-Fu Chang

CMAT is a multi-agent policy gradient method that frames objects as cooperative agents, and then directly maximizes a graph-level metric as the reward.

Graph Generation Scene Graph Generation +1

Learning to Compose Dynamic Tree Structures for Visual Contexts

6 code implementations CVPR 2019 Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, Wei Liu

We propose to compose dynamic tree structures that place the objects in an image into a visual context, helping visual reasoning tasks such as scene graph generation and visual Q&A.

Graph Generation Panoptic Scene Graph Generation +2

Explainable and Explicit Visual Reasoning over Scene Graphs

2 code implementations CVPR 2019 Jiaxin Shi, Hanwang Zhang, Juanzi Li

We aim to dismantle the prevalent black-box neural architectures used in complex visual reasoning tasks, into the proposed eXplainable and eXplicit Neural Modules (XNMs), which advance beyond existing neural module networks towards using scene graphs --- objects as nodes and the pairwise relationships as edges --- for explainable and explicit reasoning with structured knowledge.

Inductive Bias Visual Question Answering (VQA) +1

Learning to Embed Sentences Using Attentive Recursive Trees

2 code implementations6 Nov 2018 Jiaxin Shi, Lei Hou, Juanzi Li, Zhiyuan Liu, Hanwang Zhang

Sentence embedding is an effective feature representation for most deep learning-based NLP tasks.

Sentence Embedding Sentence-Embedding

Stochastic Dynamics for Video Infilling

no code implementations1 Sep 2018 Qiangeng Xu, Hanwang Zhang, Weiyue Wang, Peter N. Belhumeur, Ulrich Neumann

In this paper, we introduce a stochastic dynamics video infilling (SDVI) framework to generate frames between long intervals in a video.

Context-Aware Visual Policy Network for Sequence-Level Image Captioning

1 code implementation16 Aug 2018 Daqing Liu, Zheng-Jun Zha, Hanwang Zhang, Yongdong Zhang, Feng Wu

To fill the gap, we propose a Context-Aware Visual Policy network (CAVP) for sequence-level image captioning.

Image Captioning Reinforcement Learning (RL)

Shuffle-Then-Assemble: Learning Object-Agnostic Visual Relationship Features

1 code implementation ECCV 2018 Xu Yang, Hanwang Zhang, Jianfei Cai

By "agnostic", we mean that the feature is less likely biased to the classes of paired objects.

Discrete Factorization Machines for Fast Feature-based Recommendation

1 code implementation6 May 2018 Han Liu, Xiangnan He, Fuli Feng, Liqiang Nie, Rui Liu, Hanwang Zhang

In this paper, we develop a generic feature-based recommendation model, called Discrete Factorization Machine (DFM), for fast and accurate recommendation.

Binarization Quantization

Learning to Guide Decoding for Image Captioning

no code implementations3 Apr 2018 Wenhao Jiang, Lin Ma, Xinpeng Chen, Hanwang Zhang, Wei Liu

Recently, much advance has been made in image captioning, and an encoder-decoder framework has achieved outstanding performance for this task.

Image Captioning

Self-Supervised Video Hashing with Hierarchical Binary Auto-encoder

no code implementations7 Feb 2018 Jingkuan Song, Hanwang Zhang, Xiangpeng Li, Lianli Gao, Meng Wang, Richang Hong

Existing video hash functions are built on three isolated stages: frame pooling, relaxed learning, and binarization, which have not adequately explored the temporal order of video frames in a joint binary optimization model, resulting in severe information loss.

Binarization Retrieval +1

Zero-Shot Visual Recognition using Semantics-Preserving Adversarial Embedding Networks

1 code implementation CVPR 2018 Long Chen, Hanwang Zhang, Jun Xiao, Wei Liu, Shih-Fu Chang

We propose a novel framework called Semantics-Preserving Adversarial Embedding Network (SP-AEN) for zero-shot visual recognition (ZSL), where test images and their classes are both unseen during training.

General Classification Zero-Shot Learning

Grounding Referring Expressions in Images by Variational Context

1 code implementation CVPR 2018 Hanwang Zhang, Yulei Niu, Shih-Fu Chang

This is a general yet challenging vision-language task since it does not only require the localization of objects, but also the multimodal comprehension of context --- visual attributes (e. g., "largest", "baby") and relationships (e. g., "behind") that help to distinguish the referent from other objects, especially those of the same category.

Multiple Instance Learning Referring Expression

Neural Collaborative Filtering

42 code implementations WWW 2017 Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, Tat-Seng Chua

When it comes to model the key factor in collaborative filtering -- the interaction between user and item features, they still resorted to matrix factorization and applied an inner product on the latent features of users and items.

Collaborative Filtering Recommendation Systems

Fast Matrix Factorization for Online Recommendation with Implicit Feedback

3 code implementations16 Aug 2017 Xiangnan He, Hanwang Zhang, Min-Yen Kan, Tat-Seng Chua

To address this, we specifically design a new learning algorithm based on the element-wise Alternating Least Squares (eALS) technique, for efficiently optimizing a MF model with variably-weighted missing data.

Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks

6 code implementations15 Aug 2017 Jun Xiao, Hao Ye, Xiangnan He, Hanwang Zhang, Fei Wu, Tat-Seng Chua

Factorization Machines (FMs) are a supervised learning approach that enhances the linear regression model by incorporating the second-order feature interactions.


PPR-FCN: Weakly Supervised Visual Relation Detection via Parallel Pairwise R-FCN

no code implementations ICCV 2017 Hanwang Zhang, Zawlin Kyaw, Jinyang Yu, Shih-Fu Chang

We aim to tackle a novel vision task called Weakly Supervised Visual Relation Detection (WSVRD) to detect "subject-predicate-object" relations in an image with object relation groundtruths available only at the image level.

object-detection Weakly Supervised Object Detection

Attributed Social Network Embedding

1 code implementation14 May 2017 Lizi Liao, Xiangnan He, Hanwang Zhang, Tat-Seng Chua

For social networks, besides the network structure, there also exists rich information about social actors, such as user profiles of friendship networks and textual content of citation networks.

Social and Information Networks

SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning

2 code implementations CVPR 2017 Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, Tat-Seng Chua

Existing visual attention models are generally spatial, i. e., the attention is modeled as spatial probabilities that re-weight the last conv-layer feature map of a CNN encoding an input image.

Image Captioning

Online Collaborative Learning for Open-Vocabulary Visual Classifiers

no code implementations CVPR 2016 Hanwang Zhang, Xindi Shang, Wenzhuo Yang, Huan Xu, Huanbo Luan, Tat-Seng Chua

Leveraging on the structure of the proposed collaborative learning formulation, we develop an efficient online algorithm that can jointly learn the label embeddings and visual classifiers.

Learning Image and User Features for Recommendation in Social Networks

no code implementations ICCV 2015 Xue Geng, Hanwang Zhang, Jingwen Bian, Tat-Seng Chua

It is often a great challenge for traditional recommender systems to learn representative features of both users and images in large social networks, in particular, social curation networks, which are characterized as the extremely sparse links between users and images, and the extremely diverse visual contents of images.

Recommendation Systems

Cannot find the paper you are looking for? You can Submit a new open access paper.