Search Results for author: Wei Ji

Found 62 papers, 37 papers with code

Context-Aware Deep Spatio-Temporal Network for Hand Pose Estimation from Depth Images

no code implementations6 Oct 2018 Yiming Wu, Wei Ji, Xi Li, Gang Wang, Jianwei Yin, Fei Wu

As a fundamental and challenging problem in computer vision, hand pose estimation aims to estimate the hand joint locations from depth images.

Hand Pose Estimation

An Early Study on Intelligent Analysis of Speech under COVID-19: Severity, Sleep Quality, Fatigue, and Anxiety

no code implementations30 Apr 2020 Jing Han, Kun Qian, Meishu Song, Zijiang Yang, Zhao Ren, Shuo Liu, Juan Liu, Huaiyuan Zheng, Wei Ji, Tomoya Koike, Xiao Li, Zixing Zhang, Yoshiharu Yamamoto, Björn W. Schuller

In particular, by analysing speech recordings from these patients, we construct audio-only-based models to automatically categorise the health state of patients from four aspects, including the severity of illness, sleep quality, fatigue, and anxiety.

Sleep Quality

Accurate RGB-D Salient Object Detection via Collaborative Learning

2 code implementations ECCV 2020 Wei Ji, Jingjing Li, Miao Zhang, Yongri Piao, Huchuan Lu

The explicitly extracted edge information goes together with saliency to give more emphasis to the salient regions and object boundaries.

Object object-detection +5

ChemistryQA: A Complex Question Answering Dataset from Chemistry

no code implementations1 Jan 2021 Zhuoyu Wei, Wei Ji, Xiubo Geng, Yining Chen, Baihua Chen, Tao Qin, Daxin Jiang

We notice that some real-world QA tasks are more complex, which cannot be solved by end-to-end neural networks or translated to any kind of formal representations.

Machine Reading Comprehension Math +1

Boundary Proposal Network for Two-Stage Natural Language Video Localization

no code implementations15 Mar 2021 Shaoning Xiao, Long Chen, Songyang Zhang, Wei Ji, Jian Shao, Lu Ye, Jun Xiao

State-of-the-art NLVL methods are almost in one-stage fashion, which can be typically grouped into two categories: 1) anchor-based approach: it first pre-defines a series of video segment candidates (e. g., by sliding window), and then does classification for each candidate; 2) anchor-free approach: it directly predicts the probabilities for each video frame as a boundary or intermediate frame inside the positive segment.

Vocal Bursts Valence Prediction

Conditional Hyper-Network for Blind Super-Resolution with Multiple Degradations

1 code implementation8 Apr 2021 Guanghao Yin, Wei Wang, Zehuan Yuan, Wei Ji, Dongdong Yu, Shouqian Sun, Tat-Seng Chua, Changhu Wang

We extract degradation prior at task-level with the proposed ConditionNet, which will be used to adapt the parameters of the basic SR network (BaseNet).

Blind Super-Resolution Image Super-Resolution

Deep Learning for Weakly-Supervised Object Detection and Object Localization: A Survey

no code implementations26 May 2021 Feifei Shao, Long Chen, Jian Shao, Wei Ji, Shaoning Xiao, Lu Ye, Yueting Zhuang, Jun Xiao

With the success of deep neural networks in object detection, both WSOD and WSOL have received unprecedented attention.

Object object-detection +2

Deconfounded Video Moment Retrieval with Causal Intervention

1 code implementation3 Jun 2021 Xun Yang, Fuli Feng, Wei Ji, Meng Wang, Tat-Seng Chua

To fill the research gap, we propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction.

Moment Retrieval Retrieval

Calibrated RGB-D Salient Object Detection

1 code implementation CVPR 2021 Wei Ji, Jingjing Li, Shuang Yu, Miao Zhang, Yongri Piao, Shunyu Yao, Qi Bi, Kai Ma, Yefeng Zheng, Huchuan Lu, Li Cheng

Complex backgrounds and similar appearances between objects and their surroundings are generally recognized as challenging scenarios in Salient Object Detection (SOD).

Object object-detection +3

Advancing biological super-resolution microscopy through deep learning: a brief review

no code implementations24 Jun 2021 Tianjie Yang, Yaoru Luo, Wei Ji, Ge Yang

We conclude with an outlook on how deep learning could shape the future of this new generation of light microscopy technology.

Specificity Super-Resolution

Decoupling Strategy and Surface Realization for Task-oriented Dialogues

no code implementations29 Sep 2021 Chenchen Ye, Lizi Liao, Fuli Feng, Wei Ji, Tat-Seng Chua

The core is to construct a latent content space for strategy optimization and disentangle the surface style from it.

Reinforcement Learning (RL) Style Transfer +1

Meeting Summarization with Pre-training and Clustering Methods

1 code implementation16 Nov 2021 Andras Huebner, Wei Ji, Xiang Xiao

Lastly, we compare the performance of our baseline models with BART, a state-of-the-art language model that is effective for summarization.

Clustering Language Modelling +2

Rethinking the Two-Stage Framework for Grounded Situation Recognition

1 code implementation10 Dec 2021 Meng Wei, Long Chen, Wei Ji, Xiaoyu Yue, Tat-Seng Chua

Since each verb is associated with a specific set of semantic roles, all existing GSR methods resort to a two-stage framework: predicting the verb in the first stage and detecting the semantic roles in the second stage.

Grounded Situation Recognition Object Recognition +1

Video as Conditional Graph Hierarchy for Multi-Granular Question Answering

1 code implementation12 Dec 2021 Junbin Xiao, Angela Yao, Zhiyuan Liu, Yicong Li, Wei Ji, Tat-Seng Chua

To align with the multi-granular essence of linguistic concepts in language queries, we propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner, with the guidance of corresponding textual cues.

Question Answering Video Question Answering +1

Exploring Denoised Cross-Video Contrast for Weakly-Supervised Temporal Action Localization

no code implementations CVPR 2022 Jingjing Li, Tianyu Yang, Wei Ji, Jue Wang, Li Cheng

Inspired by recent success in unsupervised contrastive representation learning, we propose a novel denoised cross-video contrastive algorithm, aiming to enhance the feature discrimination ability of video snippets for accurate temporal action localization in the weakly-supervised setting.

Contrastive Learning Denoising +4

Video Question Answering: Datasets, Algorithms and Challenges

1 code implementation2 Mar 2022 Yaoyao Zhong, Junbin Xiao, Wei Ji, Yicong Li, Weihong Deng, Tat-Seng Chua

Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos.

Question Answering Video Question Answering

3D Magic Mirror: Clothing Reconstruction from a Single Image via a Causal Perspective

1 code implementation27 Apr 2022 Zhedong Zheng, Jiayin Zhu, Wei Ji, Yi Yang, Tat-Seng Chua

This research aims to study a self-supervised 3D clothing reconstruction method, which recovers the geometry shape and texture of human clothing from a single image.

3D Reconstruction Person Re-Identification +2

PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models

1 code implementation23 May 2022 Yuan YAO, Qianyu Chen, Ao Zhang, Wei Ji, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun

We show that PEVL enables state-of-the-art performance of detector-free VLP models on position-sensitive tasks such as referring expression comprehension and phrase grounding, and also improves the performance on position-insensitive tasks with grounded inputs.

Language Modelling Object +7

Invariant Grounding for Video Question Answering

1 code implementation CVPR 2022 Yicong Li, Xiang Wang, Junbin Xiao, Wei Ji, Tat-Seng Chua

At its core is understanding the alignments between visual scenes in video and linguistic semantics in question to yield the answer.

Question Answering Video Question Answering

Structured and Natural Responses Co-generation for Conversational Search

1 code implementation ACM SIGIR Conference on Research and Development in Information Retrieval 2022 Chenchen Ye, Lizi Liao, Fuli Feng, Wei Ji, Tat-Seng Chua

Existing approaches either 1) predict structured dialog acts first and then generate natural response; or 2) map conversation context to natural responses directly in an end-to-end manner.

Conversational Search

MetaComp: Learning to Adapt for Online Depth Completion

no code implementations21 Jul 2022 Yang Chen, Shanshan Zhao, Wei Ji, Mingming Gong, Liping Xie

However, facing a new environment where the test data occurs online and differs from the training data in the RGB image content and depth sparsity, the trained model might suffer severe performance drop.

Depth Completion Meta-Learning +1

Composed Image Retrieval with Text Feedback via Multi-grained Uncertainty Regularization

1 code implementation14 Nov 2022 Yiyang Chen, Zhedong Zheng, Wei Ji, Leigang Qu, Tat-Seng Chua

The key idea underpinning the proposed method is to integrate fine- and coarse-grained retrieval as matching data points with small and large fluctuations, respectively.

Composed Image Retrieval (CoIR) Image Retrieval with Multi-Modal Query +1

Driving Style Recognition at First Impression for Online Trajectory Prediction

no code implementations21 Dec 2022 Tu Xu, Kan Wu, Yongdong Zhu, Wei Ji

This paper proposes a new driving style recognition approach that allows autonomous vehicles (AVs) to perform trajectory predictions for surrounding vehicles with minimal data.

Autonomous Vehicles Trajectory Prediction

Multi-queue Momentum Contrast for Microvideo-Product Retrieval

1 code implementation22 Dec 2022 Yali Du, Yinwei Wei, Wei Ji, Fan Liu, Xin Luo, Liqiang Nie

The booming development and huge market of micro-videos bring new e-commerce channels for merchants.

Representation Learning Retrieval

MRTNet: Multi-Resolution Temporal Network for Video Sentence Grounding

no code implementations26 Dec 2022 Wei Ji, Long Chen, Yinwei Wei, Yiming Wu, Tat-Seng Chua

In this work, we propose a novel multi-resolution temporal video sentence grounding network: MRTNet, which consists of a multi-modal feature encoder, a Multi-Resolution Temporal (MRT) module, and a predictor module.

Descriptive Sentence

Are Binary Annotations Sufficient? Video Moment Retrieval via Hierarchical Uncertainty-Based Active Learning

1 code implementation CVPR 2023 Wei Ji, Renjie Liang, Zhedong Zheng, Wenqiao Zhang, Shengyu Zhang, Juncheng Li, Mengze Li, Tat-Seng Chua

Moreover, we treat the uncertainty score of frames in a video as a whole, and estimate the difficulty of each video, which can further relieve the burden of video selection.

Active Learning Moment Retrieval +1

WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding

no code implementations CVPR 2023 Mengze Li, Han Wang, Wenqiao Zhang, Jiaxu Miao, Zhou Zhao, Shengyu Zhang, Wei Ji, Fei Wu

WINNER first builds the language decomposition tree in a bottom-up manner, upon which the structural attention mechanism and top-down feature backtracking jointly build a multi-modal decomposition tree, permitting a hierarchical understanding of unstructured videos.

Contrastive Learning Spatio-Temporal Video Grounding +1

MedSegDiff-V2: Diffusion based Medical Image Segmentation with Transformer

2 code implementations19 Jan 2023 Junde Wu, Wei Ji, Huazhu Fu, Min Xu, Yueming Jin, Yanwu Xu

To effectively integrate these two cutting-edge techniques for the Medical image segmentation, we propose a novel Transformer-based Diffusion framework, called MedSegDiff-V2.

Image Generation Image Segmentation +3

Scalable Attribution of Adversarial Attacks via Multi-Task Learning

no code implementations25 Feb 2023 Zhongyi Guo, Keji Han, Yao Ge, Wei Ji, Yun Li

In this paper, AAP is defined as the recognition of three signatures, i. e., {\em attack algorithm}, {\em victim model} and {\em hyperparameter}.

Multi-Task Learning

Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models

no code implementations ICCV 2023 Juncheng Li, Minghe Gao, Longhui Wei, Siliang Tang, Wenqiao Zhang, Mengze Li, Wei Ji, Qi Tian, Tat-Seng Chua, Yueting Zhuang

Prompt tuning, a recently emerging paradigm, enables the powerful vision-language pre-training models to adapt to downstream tasks in a parameter -- and data -- efficient way, by learning the ``soft prompts'' to condition frozen pre-training models.

Domain Generalization Few-Shot Learning +1

Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World

1 code implementation ICCV 2023 Qifan Yu, Juncheng Li, Yu Wu, Siliang Tang, Wei Ji, Yueting Zhuang

Based on that, we further introduce a novel Entangled cross-modal prompt approach for open-world predicate scene graph generation (Epic), where models can generalize to unseen predicates in a zero-shot manner.

Graph Generation Language Modelling +1

Segment Anything Is Not Always Perfect: An Investigation of SAM on Different Real-world Applications

1 code implementation12 Apr 2023 Wei Ji, Jingjing Li, Qi Bi, TingWei Liu, Wenbo Li, Li Cheng

Recently, Meta AI Research approaches a general, promptable Segment Anything Model (SAM) pre-trained on an unprecedentedly large segmentation dataset (SA-1B).

Image Segmentation Segmentation +1

Medical SAM Adapter: Adapting Segment Anything Model for Medical Image Segmentation

3 code implementations25 Apr 2023 Junde Wu, Wei Ji, Yuanpei Liu, Huazhu Fu, Min Xu, Yanwu Xu, Yueming Jin

In Med-SA, we propose Space-Depth Transpose (SD-Trans) to adapt 2D SAM to 3D medical images and Hyper-Prompting Adapter (HyP-Adpt) to achieve prompt-conditioned adaptation.

Image Segmentation Medical Image Segmentation +2

VPGTrans: Transfer Visual Prompt Generator across LLMs

1 code implementation NeurIPS 2023 Ao Zhang, Hao Fei, Yuan YAO, Wei Ji, Li Li, Zhiyuan Liu, Tat-Seng Chua

While developing a new multimodal LLM (MLLM) by pre-training on tremendous image-text pairs from scratch can be exceedingly resource-consuming, connecting an existing LLM with a comparatively lightweight visual prompt generator (VPG) becomes a feasible paradigm.

Transfer Learning

Generating Visual Spatial Description via Holistic 3D Scene Understanding

1 code implementation19 May 2023 Yu Zhao, Hao Fei, Wei Ji, Jianguo Wei, Meishan Zhang, Min Zhang, Tat-Seng Chua

With an external 3D scene extractor, we obtain the 3D objects and scene features for input images, based on which we construct a target object-centered 3D spatial scene graph (Go3D-S2G), such that we model the spatial semantics of target objects within the holistic 3D scenes.

Scene Understanding Text Generation

Cross2StrA: Unpaired Cross-lingual Image Captioning with Cross-lingual Cross-modal Structure-pivoted Alignment

no code implementations20 May 2023 Shengqiong Wu, Hao Fei, Wei Ji, Tat-Seng Chua

Unpaired cross-lingual image captioning has long suffered from irrelevancy and disfluency issues, due to the inconsistencies of the semantic scene and syntax attributes during transfer.

Image Captioning Translation

In Defense of Clip-based Video Relation Detection

no code implementations18 Jul 2023 Meng Wei, Long Chen, Wei Ji, Xiaoyu Yue, Roger Zimmermann

While recent video-based methods utilizing video tubelets have shown promising results, we argue that the effective modeling of spatial and temporal context plays a more significant role than the choice between clip tubelets and video tubelets.

Feature Compression Object Tracking +2

Panoptic Scene Graph Generation with Semantics-Prototype Learning

1 code implementation28 Jul 2023 Li Li, Wei Ji, Yiming Wu, Mengze Li, You Qin, Lina Wei, Roger Zimmermann

To promise consistency and accuracy during the transfer process, we propose to measure the invariance of representations in each predicate class, and learn unbiased prototypes of predicates with different intensities.

Graph Generation Panoptic Scene Graph Generation

Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions

1 code implementation8 Aug 2023 Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Hanwang Zhang, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Yueting Zhuang

This shortcoming results in MLLMs' underperformance in comprehending demonstrative instructions consisting of multiple, interleaved, and multimodal instructions that demonstrate the required context to complete a task.

Caption Generation Image Captioning +1

Online Distillation-enhanced Multi-modal Transformer for Sequential Recommendation

1 code implementation8 Aug 2023 Wei Ji, Xiangyan Liu, An Zhang, Yinwei Wei, Yongxin Ni, Xiang Wang

To be specific, we first introduce an ID-aware Multi-modal Transformer module in the item representation learning stage to facilitate information interaction among different features.

Collaborative Filtering Representation Learning +1

ControlRetriever: Harnessing the Power of Instructions for Controllable Retrieval

no code implementations19 Aug 2023 Kaihang Pan, Juncheng Li, Hongye Song, Hao Fei, Wei Ji, Shuo Zhang, Jun Lin, Xiaozhong Liu, Siliang Tang

Recent studies have shown that dense retrieval models, lacking dedicated training data, struggle to perform well across diverse retrieval tasks, as different retrieval tasks often entail distinct search intents.

Retrieval Text-to-Image Generation

Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs

no code implementations26 Aug 2023 Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Tat-Seng Chua

In this work, we investigate strengthening the awareness of video dynamics for DMs, for high-quality T2V generation.

In-Context Learning Video Generation

NExT-GPT: Any-to-Any Multimodal LLM

1 code implementation11 Sep 2023 Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, Tat-Seng Chua

While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities.

Towards Complex-query Referring Image Segmentation: A Novel Benchmark

no code implementations29 Sep 2023 Wei Ji, Li Li, Hao Fei, Xiangyan Liu, Xun Yang, Juncheng Li, Roger Zimmermann

Referring Image Understanding (RIS) has been extensively studied over the past decade, leading to the development of advanced algorithms.

Image Segmentation Semantic Segmentation

Domain-wise Invariant Learning for Panoptic Scene Graph Generation

no code implementations9 Oct 2023 Li Li, You Qin, Wei Ji, Yuxiao Zhou, Roger Zimmermann

Panoptic Scene Graph Generation (PSG) involves the detection of objects and the prediction of their corresponding relationships (predicates).

Graph Generation Panoptic Scene Graph Generation

Towards Robust Multi-Modal Reasoning via Model Selection

1 code implementation12 Oct 2023 Xiangyan Liu, Rongxue Li, Wei Ji, Tao Lin

The reasoning capabilities of LLM (Large Language Model) are widely acknowledged in recent research, inspiring studies on tool learning and autonomous agents.

Language Modelling Large Language Model +1

NExT-Chat: An LMM for Chat, Detection and Segmentation

1 code implementation8 Nov 2023 Ao Zhang, Yuan YAO, Wei Ji, Zhiyuan Liu, Tat-Seng Chua

The development of large language models (LLMs) has greatly advanced the field of multimodal understanding, leading to the emergence of large multimodal models (LMMs).

Referring Expression Referring Expression Segmentation +1

De-fine: Decomposing and Refining Visual Programs with Auto-Feedback

no code implementations21 Nov 2023 Minghe Gao, Juncheng Li, Hao Fei, Liang Pang, Wei Ji, Guoming Wang, Wenqiao Zhang, Siliang Tang, Yueting Zhuang

Visual programming, a modular and generalizable paradigm, integrates different modules and Python operators to solve various vision-language tasks.

Logical Reasoning

Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching

no code implementations21 Nov 2023 Meng Chu, Zhedong Zheng, Wei Ji, Tingyu Wang, Tat-Seng Chua

Navigating drones through natural language commands remains challenging due to the dearth of accessible multi-modal datasets and the stringent precision requirements for aligning visual and textual data.

Drone navigation Language Modelling +2

Cross-Level Multi-Instance Distillation for Self-Supervised Fine-Grained Visual Categorization

no code implementations16 Jan 2024 Qi Bi, Wei Ji, Jingjun Yi, Haolan Zhan, Gui-Song Xia

To comprehensively learn the relation between informative patches and fine-grained semantics, the multi-instance knowledge distillation is implemented on both the region/image crop pairs from the teacher and student net, and the region-image crops inside the teacher / student net, which we term as intra-level multi-instance distillation and inter-level multi-instance distillation.

Fine-Grained Visual Categorization Knowledge Distillation +2

GOOD: Towards Domain Generalized Orientated Object Detection

no code implementations20 Feb 2024 Qi Bi, Beichen Zhou, Jingjun Yi, Wei Ji, Haolan Zhan, Gui-Song Xia

In this paper, we propose the task of domain generalized oriented object detection, which intends to explore the generalization of oriented object detectors on arbitrary unseen target domains.

Hallucination Object +3

Cannot find the paper you are looking for? You can Submit a new open access paper.