Search Results for author: Xiaohan Wang

Found 55 papers, 36 papers with code

Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models

1 code implementation19 Mar 2024 Elaine Sui, Xiaohan Wang, Serena Yeung-Levy

Advancements in vision-language models (VLMs) have propelled the field of computer vision, particularly in the zero-shot learning setting.

Prompt Engineering Zero-shot Generalization +1

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

no code implementations15 Mar 2024 Xiaohan Wang, Yuhui Zhang, Orr Zohar, Serena Yeung-Levy

Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences.

Language Modelling Large Language Model +2

Editing Conceptual Knowledge for Large Language Models

1 code implementation10 Mar 2024 Xiaohan Wang, Shengyu Mao, Ningyu Zhang, Shumin Deng, Yunzhi Yao, Yue Shen, Lei Liang, Jinjie Gu, Huajun Chen

Recently, there has been a growing interest in knowledge editing for Large Language Models (LLMs).

knowledge editing

DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval

1 code implementation19 Jan 2024 Xiangpeng Yang, Linchao Zhu, Xiaohan Wang, Yi Yang

(2) Equipping the visual and text encoder with separated prompts failed to mitigate the visual-text modality gap.

Retrieval Video Retrieval

Describing Differences in Image Sets with Natural Language

1 code implementation5 Dec 2023 Lisa Dunlap, Yuhui Zhang, Xiaohan Wang, Ruiqi Zhong, Trevor Darrell, Jacob Steinhardt, Joseph E. Gonzalez, Serena Yeung-Levy

To aid in this discovery process, we explore the task of automatically describing the differences between two $\textbf{sets}$ of images, which we term Set Difference Captioning.

Language Modelling

Exploring Large Language Models for Human Mobility Prediction under Public Events

no code implementations29 Nov 2023 Yuebing Liang, Yichao Liu, Xiaohan Wang, Zhan Zhao

Accurate human mobility prediction for public events is thus crucial for event planning as well as traffic or crowd management.

Misinformation

IcoCap: Improving Video Captioning by Compounding Images

no code implementations IEEE Transactions on Multimedia 2023 Yuanzhi Liang, Linchao Zhu, Xiaohan Wang, Yi Yang

Video captioning is a more challenging task compared to image captioning, primarily due to differences in content density.

Ranked #5 on Video Captioning on VATEX (using extra training data)

Image Captioning Video Captioning

Editing Personality for Large Language Models

1 code implementation3 Oct 2023 Shengyu Mao, Xiaohan Wang, Mengru Wang, Yong Jiang, Pengjun Xie, Fei Huang, Ningyu Zhang

This task seeks to adjust the models' responses to opinion-related questions on specified topics since an individual's personality often manifests in the form of their expressed opinions, thereby showcasing different personality traits.

DiverseMotion: Towards Diverse Human Motion Generation via Discrete Diffusion

1 code implementation4 Sep 2023 Yunhong Lou, Linchao Zhu, Yaxiong Wang, Xiaohan Wang, Yi Yang

We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions while preserving motion diversity. Despite the recent significant process in text-based human motion generation, existing methods often prioritize fitting training motions at the expense of action diversity.

Ranked #2 on Motion Synthesis on HumanML3D (using extra training data)

Language Modelling Motion Synthesis

EasyEdit: An Easy-to-use Knowledge Editing Framework for Large Language Models

2 code implementations14 Aug 2023 Peng Wang, Ningyu Zhang, Bozhong Tian, Zekun Xi, Yunzhi Yao, Ziwen Xu, Mengru Wang, Shengyu Mao, Xiaohan Wang, Siyuan Cheng, Kangwei Liu, Yuansheng Ni, Guozhou Zheng, Huajun Chen

Large Language Models (LLMs) usually suffer from knowledge cutoff or fallacy issues, which means they are unaware of unseen events or generate text with incorrect facts owing to outdated/noisy data.

knowledge editing

Bird's-Eye-View Scene Graph for Vision-Language Navigation

1 code implementation ICCV 2023 Rui Liu, Xiaohan Wang, Wenguan Wang, Yi Yang

Vision-language navigation (VLN), which entails an agent to navigate 3D environments following human instructions, has shown great advances.

Navigate Vision-Language Navigation

Methods for Acquiring and Incorporating Knowledge into Stock Price Prediction: A Survey

no code implementations9 Aug 2023 Liping Wang, Jiawei Li, Lifan Zhao, Zhizhuo Kou, Xiaohan Wang, Xinyi Zhu, Hao Wang, Yanyan Shen, Lei Chen

Predicting stock prices presents a challenging research problem due to the inherent volatility and non-linear nature of the stock market.

Stock Price Prediction

JOTR: 3D Joint Contrastive Learning with Transformers for Occluded Human Mesh Recovery

1 code implementation ICCV 2023 Jiahao Li, Zongxin Yang, Xiaohan Wang, Jianxin Ma, Chang Zhou, Yi Yang

Our method includes an encoder-decoder transformer architecture to fuse 2D and 3D representations for achieving 2D$\&$3D aligned results in a coarse-to-fine manner and a novel 3D joint contrastive learning approach for adding explicitly global supervision for the 3D feature space.

Contrastive Learning Human Mesh Recovery

Clustering based Point Cloud Representation Learning for 3D Analysis

1 code implementation ICCV 2023 Tuo Feng, Wenguan Wang, Xiaohan Wang, Yi Yang, Qinghua Zheng

The mined patterns are, in turn, used to repaint the embedding space, so as to respect the underlying distribution of the entire training dataset and improve the robustness to the variations.

Clustering Point Cloud Segmentation +2

Action Sensitivity Learning for the Ego4D Episodic Memory Challenge 2023

1 code implementation15 Jun 2023 Jiayi Shao, Xiaohan Wang, Ruijie Quan, Yi Yang

This report presents ReLER submission to two tracks in the Ego4D Episodic Memory Benchmark in CVPR 2023, including Natural Language Queries and Moment Queries.

Moment Queries Natural Language Queries

Relieving Triplet Ambiguity: Consensus Network for Language-Guided Image Retrieval

no code implementations3 Jun 2023 Xu Zhang, Zhedong Zheng, Xiaohan Wang, Yi Yang

We propose a novel Consensus Network (Css-Net) that self-adaptively learns from noisy triplets to minimize the negative effects of triplet ambiguity.

Image Retrieval Image Retrieval with Multi-Modal Query +1

Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models

1 code implementation29 May 2023 Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yi Yang

Given a single test sample, the VLM is forced to maximize the CLIP reward between the input and sampled results from the VLM output distribution.

Image Captioning Image Classification +5

Whitening-based Contrastive Learning of Sentence Embeddings

1 code implementation28 May 2023 Wenjie Zhuo, Yifan Sun, Xiaohan Wang, Linchao Zhu, Yi Yang

Consequently, using multiple positive samples with enhanced diversity further improves contrastive learning due to better alignment.

Contrastive Learning Semantic Textual Similarity +4

CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model

1 code implementation23 May 2023 Shuai Zhao, Xiaohan Wang, Linchao Zhu, Ruijie Quan, Yi Yang

With such merits, we transform CLIP into a scene text reader and introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP.

 Ranked #1 on Scene Text Recognition on WOST (using extra training data)

Language Modelling Scene Text Recognition

Gloss-Free End-to-End Sign Language Translation

1 code implementation22 May 2023 Kezhou Lin, Xiaohan Wang, Linchao Zhu, Ke Sun, Bang Zhang, Yi Yang

In this paper, we tackle the problem of sign language translation (SLT) without gloss annotations.

Sign Language Translation Translation

LLMs for Knowledge Graph Construction and Reasoning: Recent Capabilities and Future Opportunities

1 code implementation22 May 2023 Yuqi Zhu, Xiaohan Wang, Jing Chen, Shuofei Qiao, Yixin Ou, Yunzhi Yao, Shumin Deng, Huajun Chen, Ningyu Zhang

We engage in experiments across eight diverse datasets, focusing on four representative tasks encompassing entity and relation extraction, event extraction, link prediction, and question-answering, thereby thoroughly exploring LLMs' performance in the domain of construction and inference.

Event Extraction graph construction +4

Continual Multimodal Knowledge Graph Construction

1 code implementation15 May 2023 Xiang Chen, Ningyu Zhang, Jintian Zhang, Xiaohan Wang, Tongtong Wu, Xi Chen, Yongheng Wang, Huajun Chen

Multimodal Knowledge Graph Construction (MKGC) involves creating structured representations of entities and relations using multiple modalities, such as text and images.

Continual Learning graph construction +1

How to Unleash the Power of Large Language Models for Few-shot Relation Extraction?

2 code implementations2 May 2023 Xin Xu, Yuqi Zhu, Xiaohan Wang, Ningyu Zhang

Scaling language models have revolutionized widespread NLP tasks, yet little comprehensively explored few-shot relation extraction with large language models.

In-Context Learning Language Modelling +3

Global-to-Local Modeling for Video-based 3D Human Pose and Shape Estimation

1 code implementation CVPR 2023 Xiaolong Shen, Zongxin Yang, Xiaohan Wang, Jianxin Ma, Chang Zhou, Yi Yang

However, using a single kind of modeling structure is difficult to balance the learning of short-term and long-term temporal correlations, and may bias the network to one of them, leading to undesirable predictions like global location shift, temporal inconsistency, and insufficient local details.

3D human pose and shape estimation

Lana: A Language-Capable Navigator for Instruction Following and Generation

1 code implementation CVPR 2023 Xiaohan Wang, Wenguan Wang, Jiayi Shao, Yi Yang

Recently, visual-language navigation (VLN) -- entailing robot agents to follow navigation instructions -- has shown great advance.

Instruction Following Text Generation

MAAL: Multimodality-Aware Autoencoder-Based Affordance Learning for 3D Articulated Objects

1 code implementation ICCV 2023 Yuanzhi Liang, Xiaohan Wang, Linchao Zhu, Yi Yang

Experimental results and visualizations, based on a large-scale dataset PartNet-Mobility, show the effectiveness of MAAL in learning multi-modal data and solving the 3D articulated object affordance problem.

Object

Adversarially Masking Synthetic To Mimic Real: Adaptive Noise Injection for Point Cloud Segmentation Adaptation

no code implementations CVPR 2023 Guangrui Li, Guoliang Kang, Xiaohan Wang, Yunchao Wei, Yi Yang

With the help of adversarial training, the masking module can learn to generate source masks to mimic the pattern of irregular target noise, thereby narrowing the domain gap.

Point Cloud Segmentation Semantic Segmentation

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

5 code implementations CVPR 2023 Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang

In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition.

Action Classification Action Recognition +3

EASpace: Enhanced Action Space for Policy Transfer

1 code implementation7 Dec 2022 Zheng Zhang, Qingrui Zhang, Bo Zhu, Xiaohan Wang, Tianjiang Hu

In this paper, a novel algorithm named EASpace (Enhanced Action Space) is proposed, which formulates macro actions in an alternative form to accelerate the learning process using multiple available sub-optimal expert policies.

Q-Learning Transfer Learning

ReLER@ZJU Submission to the Ego4D Moment Queries Challenge 2022

1 code implementation17 Nov 2022 Jiayi Shao, Xiaohan Wang, Yi Yang

Moreover, in order to better capture the long-term temporal dependencies in the long videos, we propose a segment-level recurrence mechanism.

Moment Queries Temporal Action Localization

LambdaKG: A Library for Pre-trained Language Model-Based Knowledge Graph Embeddings

2 code implementations1 Oct 2022 Xin Xie, Zhoubo Li, Xiaohan Wang, Zekun Xi, Ningyu Zhang

Knowledge Graphs (KGs) often have two characteristics: heterogeneous graph structure and text-rich entity/relation information.

Graph Representation Learning Knowledge Graph Embeddings +3

Slimmable Networks for Contrastive Self-supervised Learning

no code implementations30 Sep 2022 Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yi Yang

In this work, we present a one-stage solution to obtain pre-trained small models without the need for extra teachers, namely, slimmable networks for contrastive self-supervised learning (\emph{SlimCLR}).

Contrastive Learning Knowledge Distillation +1

ReLER@ZJU-Alibaba Submission to the Ego4D Natural Language Queries Challenge 2022

1 code implementation1 Jul 2022 Naiyuan Liu, Xiaohan Wang, Xiaobo Li, Yi Yang, Yueting Zhuang

In this report, we present the ReLER@ZJU-Alibaba submission to the Ego4D Natural Language Queries (NLQ) Challenge in CVPR 2022.

Data Augmentation Natural Language Queries

CenterCLIP: Token Clustering for Efficient Text-Video Retrieval

1 code implementation2 May 2022 Shuai Zhao, Linchao Zhu, Xiaohan Wang, Yi Yang

In this paper, to reduce the number of redundant video tokens, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones.

Ranked #11 on Video Retrieval on MSVD (using extra training data)

Clustering Retrieval +1

Scalable Video Object Segmentation with Identification Mechanism

2 code implementations22 Mar 2022 Zongxin Yang, Jiaxu Miao, Yunchao Wei, Wenguan Wang, Xiaohan Wang, Yi Yang

This paper delves into the challenges of achieving scalable and effective multi-object modeling for semi-supervised Video Object Segmentation (VOS).

Object Segmentation +3

Multi-robot Cooperative Pursuit via Potential Field-Enhanced Reinforcement Learning

no code implementations9 Mar 2022 Zheng Zhang, Xiaohan Wang, Qingrui Zhang, Tianjiang Hu

It is shown by numerical simulations that the proposed hybrid design outperforms the pursuit policies either learned from vanilla reinforcement learning or designed by the potential field method.

reinforcement-learning Reinforcement Learning (RL)

Action Keypoint Network for Efficient Video Recognition

no code implementations17 Jan 2022 Xu Chen, Yahong Han, Xiaohan Wang, Yifan Sun, Yi Yang

An effective approach is to select informative content from the holistic video, yielding a popular family of dynamic video recognition methods.

Action Recognition Point Cloud Classification +1

Reasoning Through Memorization: Nearest Neighbor Knowledge Graph Embeddings

1 code implementation14 Jan 2022 Peng Wang, Xin Xie, Xiaohan Wang, Ningyu Zhang

Previous knowledge graph embedding approaches usually map entities to representations and utilize score functions to predict the target entities, yet they typically struggle to reason rare or emerging unseen entities.

Knowledge Graph Embedding Knowledge Graph Embeddings +2

Large-Scale Video Panoptic Segmentation in the Wild: A Benchmark

1 code implementation CVPR 2022 Jiaxu Miao, Xiaohan Wang, Yu Wu, Wei Li, Xu Zhang, Yunchao Wei, Yi Yang

In contrast, our large-scale VIdeo Panoptic Segmentation in the Wild (VIPSeg) dataset provides 3, 536 videos and 84, 750 frames with pixel-level panoptic annotations, covering a wide range of real-world scenarios and categories.

Segmentation Video Panoptic Segmentation

Self-supervised Point Cloud Representation Learning via Separating Mixed Shapes

1 code implementation1 Sep 2021 Chao Sun, Zhedong Zheng, Xiaohan Wang, Mingliang Xu, Yi Yang

Albeit simple, the pre-trained encoder can capture the key points of an unseen point cloud and surpasses the encoder trained from scratch on downstream tasks.

3D Part Segmentation 3D Point Cloud Classification +3

PR-RRN: Pairwise-Regularized Residual-Recursive Networks for Non-rigid Structure-from-Motion

no code implementations ICCV 2021 Haitian Zeng, Yuchao Dai, Xin Yu, Xiaohan Wang, Yi Yang

As NRSfM is a highly under-constrained problem, we propose two new pairwise regularization to further regularize the reconstruction.

Less is More: Sparse Sampling for Dense Reaction Predictions

no code implementations3 Jun 2021 Kezhou Lin, Xiaohan Wang, Zhedong Zheng, Linchao Zhu, Yi Yang

Obtaining viewer responses from videos can be useful for creators and streaming platforms to analyze the video performance and improve the future user experience.

Connecting Language and Vision for Natural Language-Based Vehicle Retrieval

1 code implementation31 May 2021 Shuai Bai, Zhedong Zheng, Xiaohan Wang, Junyang Lin, Zhu Zhang, Chang Zhou, Yi Yang, Hongxia Yang

In this paper, we apply one new modality, i. e., the language description, to search the vehicle of interest and explore the potential of this task in the real-world scenario.

Language Modelling Management +2

T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval

1 code implementation CVPR 2021 Xiaohan Wang, Linchao Zhu, Yi Yang

Moreover, a global alignment method is proposed to provide a global cross-modal measurement that is complementary to the local perspective.

Retrieval Video Retrieval

Learning to Anticipate Egocentric Actions by Imagination

no code implementations13 Jan 2021 Yu Wu, Linchao Zhu, Xiaohan Wang, Yi Yang, Fei Wu

We further improve ImagineRNN by residual anticipation, i. e., changing its target to predicting the feature difference of adjacent frames instead of the frame content.

Action Anticipation Autonomous Driving +1

Interactive Prototype Learning for Egocentric Action Recognition

no code implementations ICCV 2021 Xiaohan Wang, Linchao Zhu, Heng Wang, Yi Yang

To avoid these additional costs, we propose an end-to-end Interactive Prototype Learning (IPL) framework to learn better active object representations by leveraging the motion cues from the actor.

Action Recognition Object +1

Variable-Viewpoint Representations for 3D Object Recognition

no code implementations8 Feb 2020 Tengyu Ma, Joel Michelson, James Ainooson, Deepayan Sanyal, Xiaohan Wang, Maithilee Kunda

For the problem of 3D object recognition, researchers using deep learning methods have developed several very different input representations, including "multi-view" snapshots taken from discrete viewpoints around an object, as well as "spherical" representations consisting of a dense map of essentially ray-traced samples of the object from all directions.

3D Object Recognition Object

Symbiotic Attention with Privileged Information for Egocentric Action Recognition

no code implementations8 Feb 2020 Xiaohan Wang, Yu Wu, Linchao Zhu, Yi Yang

Due to the large action vocabulary in egocentric video datasets, recent studies usually utilize a two-branch structure for action recognition, ie, one branch for verb classification and the other branch for noun classification.

Action Recognition Egocentric Activity Recognition +5

Baidu-UTS Submission to the EPIC-Kitchens Action Recognition Challenge 2019

no code implementations22 Jun 2019 Xiaohan Wang, Yu Wu, Linchao Zhu, Yi Yang

In this report, we present the Baidu-UTS submission to the EPIC-Kitchens Action Recognition Challenge in CVPR 2019.

Action Recognition Object +2

The Toybox Dataset of Egocentric Visual Object Transformations

no code implementations15 Jun 2018 Xiaohan Wang, Tengyu Ma, James Ainooson, Seunghwan Cha, Xiaotian Wang, Azhar Molla, Maithilee Kunda

In object recognition research, many commonly used datasets (e. g., ImageNet and similar) contain relatively sparse distributions of object instances and views, e. g., one might see a thousand different pictures of a thousand different giraffes, mostly taken from a few conventionally photographed angles.

Object Object Recognition +1

Cannot find the paper you are looking for? You can Submit a new open access paper.