Search Results for author: Wei Ji

Found 87 papers, 48 papers with code

SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and $\mathcal{O}(T)$ Complexity

no code implementations15 May 2025 Shihao Zou, Qingfeng Li, Wei Ji, Jingjing Li, Yongkui Yang, Guoqi Li, Chao Dong

Building on SDHA, we further analyze various spike-driven space-time attention designs and identify an optimal scheme that delivers appealing performance for video tasks, while maintaining only linear temporal complexity.

Pose Tracking Semantic Segmentation

DGFamba: Learning Flow Factorized State Space for Visual Domain Generalization

no code implementations10 Apr 2025 Qi Bi, Jingjun Yi, Hao Zheng, Haolan Zhan, Wei Ji, Yawen Huang, Yuexiang Li

By aligning these probability paths in the latent space, the state embeddings are able to represent the same content distribution regardless of the style differences.

Domain Generalization

TAIL: Text-Audio Incremental Learning

no code implementations6 Mar 2025 Yingfei Sun, Xu Gu, Wei Ji, Hanbin Zhao, Hao Fei, Yifang Yin, Roger Zimmermann

Many studies combine text and audio to capture multi-modal information but they overlook the model's generalization ability on new datasets.

AudioCaps Incremental Learning +1

Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

1 code implementation17 Feb 2025 Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, Peng Liu, Ruihang Miao, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Gong, Zixin Zhang, HongYu Zhou, Jianjian Sun, Brian Li, Chengting Feng, Changyi Wan, Hanpeng Hu, Jianchang Wu, Jiangjie Zhen, Ranchen Ming, Song Yuan, Xuelin Zhang, Yu Zhou, Bingxin Li, Buyun Ma, Hongyuan Wang, Kang An, Wei Ji, Wen Li, Xuan Wen, Xiangwen Kong, Yuankai Ma, Yuanwei Liang, Yun Mou, Bahtiyar Ahmidi, Bin Wang, Bo Li, Changxin Miao, Chen Xu, Chenrun Wang, Dapeng Shi, Deshan Sun, Dingyuan Hu, Dula Sai, Enle Liu, Guanzhe Huang, Gulin Yan, Heng Wang, Haonan Jia, Haoyang Zhang, Jiahao Gong, Junjing Guo, Jiashuai Liu, Jiahong Liu, Jie Feng, Jie Wu, Jiaoren Wu, Jie Yang, Jinguo Wang, Jingyang Zhang, Junzhe Lin, Kaixiang Li, Lei Xia, Li Zhou, Liang Zhao, Longlong Gu, Mei Chen, Menglin Wu, Ming Li, Mingxiao Li, Mingliang Li, Mingyao Liang, Na Wang, Nie Hao, Qiling Wu, Qinyuan Tan, Ran Sun, Shuai Shuai, Shaoliang Pang, Shiliang Yang, Shuli Gao, Shanshan Yuan, SiQi Liu, Shihong Deng, Shilei Jiang, Sitong Liu, Tiancheng Cao, Tianyu Wang, Wenjin Deng, Wuxun Xie, Weipeng Ming, Wenqing He, Wen Sun, Xin Han, Xin Huang, Xiaomin Deng, Xiaojia Liu, Xin Wu, Xu Zhao, Yanan Wei, Yanbo Yu, Yang Cao, Yangguang Li, Yangzhen Ma, Yanming Xu, Yaoyu Wang, Yaqiang Shi, Yilei Wang, Yizhuang Zhou, Yinmin Zhong, Yang Zhang, Yaoben Wei, Yu Luo, Yuanwei Lu, Yuhe Yin, Yuchu Luo, Yuanhao Ding, Yuting Yan, Yaqi Dai, Yuxiang Yang, Zhe Xie, Zheng Ge, Zheng Sun, Zhewei Huang, Zhichao Chang, Zhisheng Guan, Zidong Yang, Zili Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu

Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following.

Instruction Following Voice Cloning

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

3 code implementations14 Feb 2025 Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Chenguang Yu, Dapeng Shi, Dingyuan Hu, Enle Liu, Gang Yu, Ge Yang, Guanzhe Huang, Gulin Yan, Haiyang Feng, Hao Nie, Haonan Jia, Hanpeng Hu, Hanqi Chen, Haolong Yan, Heng Wang, Hongcheng Guo, Huilin Xiong, Huixin Xiong, Jiahao Gong, Jianchang Wu, Jiaoren Wu, Jie Wu, Jie Yang, Jiashuai Liu, Jiashuo Li, Jingyang Zhang, Junjing Guo, Junzhe Lin, Kaixiang Li, Lei Liu, Lei Xia, Liang Zhao, Liguo Tan, Liwen Huang, Liying Shi, Ming Li, Mingliang Li, Muhua Cheng, Na Wang, Qiaohui Chen, Qinglin He, Qiuyan Liang, Quan Sun, Ran Sun, Rui Wang, Shaoliang Pang, Shiliang Yang, Sitong Liu, SiQi Liu, Shuli Gao, Tiancheng Cao, Tianyu Wang, Weipeng Ming, Wenqing He, Xu Zhao, Xuelin Zhang, Xianfang Zeng, Xiaojia Liu, Xuan Yang, Yaqi Dai, Yanbo Yu, Yang Li, Yineng Deng, Yingming Wang, Yilei Wang, Yuanwei Lu, Yu Chen, Yu Luo, Yuchu Luo, Yuhe Yin, Yuheng Feng, Yuxiang Yang, Zecheng Tang, Zekai Zhang, Zidong Yang, Binxing Jiao, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu, Heung-Yeung Shum, Daxin Jiang

We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length.

Video Generation Video Reconstruction

WisdomBot: Tuning Large Language Models with Artificial Intelligence Knowledge

no code implementations22 Jan 2025 Jingyuan Chen, Tao Wu, Wei Ji, Fei Wu

Large language models (LLMs) have emerged as powerful tools in natural language processing (NLP), showing a promising future of artificial generated intelligence (AGI).

Retrieval

A Generalizable 3D Diffusion Framework for Low-Dose and Few-View Cardiac SPECT

no code implementations21 Dec 2024 Huidong Xie, Weijie Gan, Wei Ji, Xiongchao Chen, Alaa Alashi, Stephanie L. Thorn, Bo Zhou, Qiong Liu, Menghua Xia, Xueqi Guo, Yi-Hwa Liu, Hongyu An, Ulugbek S. Kamilov, Ge Wang, Albert J. Sinusas, Chi Liu

This work introduced DiffSPECT-3D, a diffusion framework for 3D cardiac SPECT imaging that effectively adapts to different acquisition settings without requiring further network re-training or fine-tuning.

Diagnostic

MoTe: Learning Motion-Text Diffusion Model for Multiple Generation Tasks

no code implementations29 Nov 2024 Yiming Wu, Wei Ji, Kecheng Zheng, Zicheng Wang, Dong Xu

Recently, human motion analysis has experienced great improvement due to inspiring generative models such as the denoising diffusion model and large language model.

Decoder Denoising +5

Discretized Gaussian Representation for Tomographic Reconstruction

no code implementations7 Nov 2024 Shaokai Wu, Yuxiang Lu, Wei Ji, Suizhi Huang, Fengyu Yang, Shalayiding Sirejiding, Qichen He, Jing Tong, Yanbiao Ji, Yue Ding, Hongtao Lu

To further enhance computational efficiency, we introduce a Fast Volume Reconstruction technique that aggregates the contributions of these Gaussians into a discretized volume in a highly parallelized fashion.

3DGS Computational Efficiency +4

Towards Small Object Editing: A Benchmark Dataset and A Training-Free Approach

1 code implementation3 Nov 2024 Qihe Pan, Zhen Zhao, Zicheng Wang, Sifan Long, Yiming Wu, Wei Ji, Haoran Liang, Ronghua Liang

A plethora of text-guided image editing methods has recently been developed by leveraging the impressive capabilities of large-scale diffusion-based generative models especially Stable Diffusion.

Image Generation Object +1

Grounding is All You Need? Dual Temporal Grounding for Video Dialog

no code implementations8 Oct 2024 You Qin, Wei Ji, Xinze Lan, Hao Fei, Xun Yang, Dan Guo, Roger Zimmermann, Lizi Liao

In the realm of video dialog response generation, the understanding of video content and the temporal nuances of conversation history are paramount.

All Contrastive Learning +1

Personalized Knowledge Tracing through Student Representation Reconstruction and Class Imbalance Mitigation

no code implementations10 Sep 2024 Zhiyu Chen, Wei Ji, Jing Xiao, Zitao Liu

Extensive experimental results on four publicly available educational datasets demonstrate the advanced predictive performance of PKT in comparison with 16 state-of-the-art models.

Knowledge Tracing

Semantic Alignment for Multimodal Large Language Models

no code implementations23 Aug 2024 Tao Wu, Mengze Li, Jingyuan Chen, Wei Ji, Wang Lin, Jinyang Gao, Kun Kuang, Zhou Zhao, Fei Wu

By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis and align the semantics of different images before feeding them into LLM.

Large Language Model Visual Storytelling

DriveDiTFit: Fine-tuning Diffusion Transformers for Autonomous Driving

1 code implementation22 Jul 2024 Jiahang Tu, Wei Ji, Hanbin Zhao, Chao Zhang, Roger Zimmermann, Hui Qian

Such datasets are expected to cover various driving scenarios with adverse weather, lighting conditions and diverse moving objects.

Autonomous Driving Diversity +2

Described Spatial-Temporal Video Detection

no code implementations8 Jul 2024 Wei Ji, Xiangyan Liu, Yingfei Sun, Jiajun Deng, You Qin, Ammar Nuwanna, Mengyao Qiu, Lina Wei, Roger Zimmermann

However, in the video domain, the existing setting, i. e., spatial-temporal video grounding (STVG), is formulated to only detect one pre-existing object in each frame, ignoring the fact that language descriptions can involve none or multiple entities within a video.

Multi-class Classification Temporal Localization +1

Backpropagation-Free Multi-modal On-Device Model Adaptation via Cloud-Device Collaboration

no code implementations21 May 2024 Wei Ji, Li Li, Zheqi Lv, Wenqiao Zhang, Mengze Li, Zhen Wan, Wenqiang Lei, Roger Zimmermann

As these systems grapple with shifting data distributions between the cloud and devices, the traditional approach of fine-tuning-based adaptation (FTA) exists the following issues: the costly and time-consuming data annotation required by FTA and the looming risk of model overfitting.

Question Answering Video Question Answering

Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition

no code implementations7 May 2024 Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, Wynne Hsu

Existing research of video understanding still struggles to achieve in-depth comprehension and reasoning in complex videos, primarily due to the under-exploration of two key bottlenecks: fine-grained spatial-temporal perceptive understanding and cognitive-level video scene comprehension.

Large Language Model Multimodal Large Language Model +2

Spider: A Unified Framework for Context-dependent Concept Segmentation

2 code implementations2 May 2024 Xiaoqi Zhao, Youwei Pang, Wei Ji, Baicheng Sheng, Jiaming Zuo, Lihe Zhang, Huchuan Lu

Different from the context-independent (CI) concepts such as human, car, and airplane, context-dependent (CD) concepts require higher visual understanding ability, such as camouflaged object and medical lesion.

Transparent objects

GOOD: Towards Domain Generalized Orientated Object Detection

no code implementations20 Feb 2024 Qi Bi, Beichen Zhou, Jingjun Yi, Wei Ji, Haolan Zhan, Gui-Song Xia

In this paper, we propose the task of domain generalized oriented object detection, which intends to explore the generalization of oriented object detectors on arbitrary unseen target domains.

Hallucination Object +3

Cross-Level Multi-Instance Distillation for Self-Supervised Fine-Grained Visual Categorization

no code implementations16 Jan 2024 Qi Bi, Wei Ji, Jingjun Yi, Haolan Zhan, Gui-Song Xia

To comprehensively learn the relation between informative patches and fine-grained semantics, the multi-instance knowledge distillation is implemented on both the region/image crop pairs from the teacher and student net, and the region-image crops inside the teacher / student net, which we term as intra-level multi-instance distillation and inter-level multi-instance distillation.

Fine-Grained Visual Categorization Knowledge Distillation +2

Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching

2 code implementations21 Nov 2023 Meng Chu, Zhedong Zheng, Wei Ji, Tingyu Wang, Tat-Seng Chua

Navigating drones through natural language commands remains challenging due to the dearth of accessible multi-modal datasets and the stringent precision requirements for aligning visual and textual data.

Drone navigation geo-localization +4

De-fine: Decomposing and Refining Visual Programs with Auto-Feedback

no code implementations21 Nov 2023 Minghe Gao, Juncheng Li, Hao Fei, Liang Pang, Wei Ji, Guoming Wang, Zheqi Lv, Wenqiao Zhang, Siliang Tang, Yueting Zhuang

Visual programming, a modular and generalizable paradigm, integrates different modules and Python operators to solve various vision-language tasks.

Logical Reasoning

NExT-Chat: An LMM for Chat, Detection and Segmentation

1 code implementation8 Nov 2023 Ao Zhang, Yuan YAO, Wei Ji, Zhiyuan Liu, Tat-Seng Chua

The development of large language models (LLMs) has greatly advanced the field of multimodal understanding, leading to the emergence of large multimodal models (LMMs).

Referring Expression Referring Expression Segmentation +1

Towards Robust Multi-Modal Reasoning via Model Selection

1 code implementation12 Oct 2023 Xiangyan Liu, Rongxue Li, Wei Ji, Tao Lin

The reasoning capabilities of LLM (Large Language Model) are widely acknowledged in recent research, inspiring studies on tool learning and autonomous agents.

Language Modelling Large Language Model +2

Domain-wise Invariant Learning for Panoptic Scene Graph Generation

no code implementations9 Oct 2023 Li Li, You Qin, Wei Ji, Yuxiao Zhou, Roger Zimmermann

Panoptic Scene Graph Generation (PSG) involves the detection of objects and the prediction of their corresponding relationships (predicates).

Graph Generation Panoptic Scene Graph Generation

Towards Complex-query Referring Image Segmentation: A Novel Benchmark

no code implementations29 Sep 2023 Wei Ji, Li Li, Hao Fei, Xiangyan Liu, Xun Yang, Juncheng Li, Roger Zimmermann

Referring Image Understanding (RIS) has been extensively studied over the past decade, leading to the development of advanced algorithms.

Image Segmentation Semantic Segmentation

NExT-GPT: Any-to-Any Multimodal LLM

1 code implementation11 Sep 2023 Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, Tat-Seng Chua

While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities.

AI Agent

I3: Intent-Introspective Retrieval Conditioned on Instructions

no code implementations19 Aug 2023 Kaihang Pan, Juncheng Li, Wenjie Wang, Hao Fei, Hongye Song, Wei Ji, Jun Lin, Xiaozhong Liu, Tat-Seng Chua, Siliang Tang

Recent studies indicate that dense retrieval models struggle to perform well on a wide variety of retrieval tasks that lack dedicated training data, as different retrieval tasks often entail distinct search intents.

Retrieval Text-to-Image Generation

Online Distillation-enhanced Multi-modal Transformer for Sequential Recommendation

1 code implementation8 Aug 2023 Wei Ji, Xiangyan Liu, An Zhang, Yinwei Wei, Yongxin Ni, Xiang Wang

To be specific, we first introduce an ID-aware Multi-modal Transformer module in the item representation learning stage to facilitate information interaction among different features.

Collaborative Filtering Multi-modal Recommendation +2

Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions

1 code implementation8 Aug 2023 Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Hanwang Zhang, Yueting Zhuang

This shortcoming results in MLLMs' underperformance in comprehending demonstrative instructions consisting of multiple, interleaved, and multimodal instructions that demonstrate the required context to complete a task.

Caption Generation Image Captioning +2

Panoptic Scene Graph Generation with Semantics-Prototype Learning

1 code implementation28 Jul 2023 Li Li, Wei Ji, Yiming Wu, Mengze Li, You Qin, Lina Wei, Roger Zimmermann

To promise consistency and accuracy during the transfer process, we propose to measure the invariance of representations in each predicate class, and learn unbiased prototypes of predicates with different intensities.

Graph Generation Panoptic Scene Graph Generation

In Defense of Clip-based Video Relation Detection

no code implementations18 Jul 2023 Meng Wei, Long Chen, Wei Ji, Xiaoyu Yue, Roger Zimmermann

While recent video-based methods utilizing video tubelets have shown promising results, we argue that the effective modeling of spatial and temporal context plays a more significant role than the choice between clip tubelets and video tubelets.

Feature Compression Object Tracking +2

Cross2StrA: Unpaired Cross-lingual Image Captioning with Cross-lingual Cross-modal Structure-pivoted Alignment

no code implementations20 May 2023 Shengqiong Wu, Hao Fei, Wei Ji, Tat-Seng Chua

Unpaired cross-lingual image captioning has long suffered from irrelevancy and disfluency issues, due to the inconsistencies of the semantic scene and syntax attributes during transfer.

Image Captioning Translation

Generating Visual Spatial Description via Holistic 3D Scene Understanding

1 code implementation19 May 2023 Yu Zhao, Hao Fei, Wei Ji, Jianguo Wei, Meishan Zhang, Min Zhang, Tat-Seng Chua

With an external 3D scene extractor, we obtain the 3D objects and scene features for input images, based on which we construct a target object-centered 3D spatial scene graph (Go3D-S2G), such that we model the spatial semantics of target objects within the holistic 3D scenes.

Scene Understanding Text Generation

VPGTrans: Transfer Visual Prompt Generator across LLMs

1 code implementation NeurIPS 2023 Ao Zhang, Hao Fei, Yuan YAO, Wei Ji, Li Li, Zhiyuan Liu, Tat-Seng Chua

While developing a new multimodal LLM (MLLM) by pre-training on tremendous image-text pairs from scratch can be exceedingly resource-consuming, connecting an existing LLM with a comparatively lightweight visual prompt generator (VPG) becomes a feasible paradigm.

Transfer Learning

Medical SAM Adapter: Adapting Segment Anything Model for Medical Image Segmentation

4 code implementations25 Apr 2023 Junde Wu, Wei Ji, Yuanpei Liu, Huazhu Fu, Min Xu, Yanwu Xu, Yueming Jin

In Med-SA, we propose Space-Depth Transpose (SD-Trans) to adapt 2D SAM to 3D medical images and Hyper-Prompting Adapter (HyP-Adpt) to achieve prompt-conditioned adaptation.

Image Segmentation Medical Image Segmentation +2

Segment Anything Is Not Always Perfect: An Investigation of SAM on Different Real-world Applications

1 code implementation12 Apr 2023 Wei Ji, Jingjing Li, Qi Bi, TingWei Liu, Wenbo Li, Li Cheng

Recently, Meta AI Research approaches a general, promptable Segment Anything Model (SAM) pre-trained on an unprecedentedly large segmentation dataset (SA-1B).

Image Segmentation Segmentation +1

Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World

1 code implementation ICCV 2023 Qifan Yu, Juncheng Li, Yu Wu, Siliang Tang, Wei Ji, Yueting Zhuang

Based on that, we further introduce a novel Entangled cross-modal prompt approach for open-world predicate scene graph generation (Epic), where models can generalize to unseen predicates in a zero-shot manner.

Graph Generation Language Modeling +2

Highly Efficient 3D Human Pose Tracking from Events with Spiking Spatiotemporal Transformer

1 code implementation16 Mar 2023 Shihao Zou, Yuxuan Mu, Wei Ji, Zi-An Wang, Xinxin Zuo, Sen Wang, Weixin Si, Li Cheng

Event camera, as an asynchronous vision sensor capturing scene dynamics, presents new opportunities for highly efficient 3D human pose tracking.

3D Human Pose Estimation 3D Human Pose Tracking

Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models

no code implementations ICCV 2023 Juncheng Li, Minghe Gao, Longhui Wei, Siliang Tang, Wenqiao Zhang, Mengze Li, Wei Ji, Qi Tian, Tat-Seng Chua, Yueting Zhuang

Prompt tuning, a recently emerging paradigm, enables the powerful vision-language pre-training models to adapt to downstream tasks in a parameter -- and data -- efficient way, by learning the ``soft prompts'' to condition frozen pre-training models.

Domain Generalization Few-Shot Learning +2

Scalable Attribution of Adversarial Attacks via Multi-Task Learning

no code implementations25 Feb 2023 Zhongyi Guo, Keji Han, Yao Ge, Wei Ji, Yun Li

In this paper, AAP is defined as the recognition of three signatures, i. e., {\em attack algorithm}, {\em victim model} and {\em hyperparameter}.

Multi-Task Learning

MedSegDiff-V2: Diffusion based Medical Image Segmentation with Transformer

2 code implementations19 Jan 2023 Junde Wu, Wei Ji, Huazhu Fu, Min Xu, Yueming Jin, Yanwu Xu

To effectively integrate these two cutting-edge techniques for the Medical image segmentation, we propose a novel Transformer-based Diffusion framework, called MedSegDiff-V2.

Image Generation Image Segmentation +4

Are Binary Annotations Sufficient? Video Moment Retrieval via Hierarchical Uncertainty-Based Active Learning

1 code implementation CVPR 2023 Wei Ji, Renjie Liang, Zhedong Zheng, Wenqiao Zhang, Shengyu Zhang, Juncheng Li, Mengze Li, Tat-Seng Chua

Moreover, we treat the uncertainty score of frames in a video as a whole, and estimate the difficulty of each video, which can further relieve the burden of video selection.

Active Learning Moment Retrieval +1

WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding

no code implementations CVPR 2023 Mengze Li, Han Wang, Wenqiao Zhang, Jiaxu Miao, Zhou Zhao, Shengyu Zhang, Wei Ji, Fei Wu

WINNER first builds the language decomposition tree in a bottom-up manner, upon which the structural attention mechanism and top-down feature backtracking jointly build a multi-modal decomposition tree, permitting a hierarchical understanding of unstructured videos.

Contrastive Learning Spatio-Temporal Video Grounding +1

MRTNet: Multi-Resolution Temporal Network for Video Sentence Grounding

no code implementations26 Dec 2022 Wei Ji, Long Chen, Yinwei Wei, Yiming Wu, Tat-Seng Chua

In this work, we propose a novel multi-resolution temporal video sentence grounding network: MRTNet, which consists of a multi-modal feature encoder, a Multi-Resolution Temporal (MRT) module, and a predictor module.

Decoder Descriptive +1

Multi-queue Momentum Contrast for Microvideo-Product Retrieval

1 code implementation22 Dec 2022 Yali Du, Yinwei Wei, Wei Ji, Fan Liu, Xin Luo, Liqiang Nie

The booming development and huge market of micro-videos bring new e-commerce channels for merchants.

Representation Learning Retrieval

Driving Style Recognition at First Impression for Online Trajectory Prediction

no code implementations21 Dec 2022 Tu Xu, Kan Wu, Yongdong Zhu, Wei Ji

This paper proposes a new driving style recognition approach that allows autonomous vehicles (AVs) to perform trajectory predictions for surrounding vehicles with minimal data.

Autonomous Vehicles Trajectory Prediction

Composed Image Retrieval with Text Feedback via Multi-grained Uncertainty Regularization

1 code implementation14 Nov 2022 Yiyang Chen, Zhedong Zheng, Wei Ji, Leigang Qu, Tat-Seng Chua

The key idea underpinning the proposed method is to integrate fine- and coarse-grained retrieval as matching data points with small and large fluctuations, respectively.

Composed Image Retrieval (CoIR) Image Retrieval with Multi-Modal Query +1

MetaComp: Learning to Adapt for Online Depth Completion

no code implementations21 Jul 2022 Yang Chen, Shanshan Zhao, Wei Ji, Mingming Gong, Liping Xie

However, facing a new environment where the test data occurs online and differs from the training data in the RGB image content and depth sparsity, the trained model might suffer severe performance drop.

Depth Completion Meta-Learning +1

Structured and Natural Responses Co-generation for Conversational Search

1 code implementation ACM SIGIR Conference on Research and Development in Information Retrieval 2022 Chenchen Ye, Lizi Liao, Fuli Feng, Wei Ji, Tat-Seng Chua

Existing approaches either 1) predict structured dialog acts first and then generate natural response; or 2) map conversation context to natural responses directly in an end-to-end manner.

Conversational Search

Invariant Grounding for Video Question Answering

1 code implementation CVPR 2022 Yicong Li, Xiang Wang, Junbin Xiao, Wei Ji, Tat-Seng Chua

At its core is understanding the alignments between visual scenes in video and linguistic semantics in question to yield the answer.

Question Answering Video Question Answering

PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models

1 code implementation23 May 2022 Yuan YAO, Qianyu Chen, Ao Zhang, Wei Ji, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun

We show that PEVL enables state-of-the-art performance of detector-free VLP models on position-sensitive tasks such as referring expression comprehension and phrase grounding, and also improves the performance on position-insensitive tasks with grounded inputs.

Language Modeling Language Modelling +8

3D Magic Mirror: Clothing Reconstruction from a Single Image via a Causal Perspective

1 code implementation27 Apr 2022 Zhedong Zheng, Jiayin Zhu, Wei Ji, Yi Yang, Tat-Seng Chua

This research aims to study a self-supervised 3D clothing reconstruction method, which recovers the geometry shape and texture of human clothing from a single image.

3D Reconstruction Person Re-Identification +2

Video Question Answering: Datasets, Algorithms and Challenges

1 code implementation2 Mar 2022 Yaoyao Zhong, Junbin Xiao, Wei Ji, Yicong Li, Weihong Deng, Tat-Seng Chua

Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos.

Question Answering Video Question Answering

Content-Variant Reference Image Quality Assessment via Knowledge Distillation

1 code implementation26 Feb 2022 Guanghao Yin, Wei Wang, Zehuan Yuan, Chuchu Han, Wei Ji, Shouqian Sun, Changhu Wang

The comparisons of distribution differences between HQ and LQ images can help our model better assess the image quality.

Knowledge Distillation

Exploring Denoised Cross-Video Contrast for Weakly-Supervised Temporal Action Localization

no code implementations CVPR 2022 Jingjing Li, Tianyu Yang, Wei Ji, Jue Wang, Li Cheng

Inspired by recent success in unsupervised contrastive representation learning, we propose a novel denoised cross-video contrastive algorithm, aiming to enhance the feature discrimination ability of video snippets for accurate temporal action localization in the weakly-supervised setting.

Contrastive Learning Denoising +4

Video as Conditional Graph Hierarchy for Multi-Granular Question Answering

1 code implementation12 Dec 2021 Junbin Xiao, Angela Yao, Zhiyuan Liu, Yicong Li, Wei Ji, Tat-Seng Chua

To align with the multi-granular essence of linguistic concepts in language queries, we propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner, with the guidance of corresponding textual cues.

Question Answering Video Question Answering +1

Rethinking the Two-Stage Framework for Grounded Situation Recognition

1 code implementation10 Dec 2021 Meng Wei, Long Chen, Wei Ji, Xiaoyu Yue, Tat-Seng Chua

Since each verb is associated with a specific set of semantic roles, all existing GSR methods resort to a two-stage framework: predicting the verb in the first stage and detecting the semantic roles in the second stage.

Grounded Situation Recognition Object Recognition +2

Meeting Summarization with Pre-training and Clustering Methods

1 code implementation16 Nov 2021 Andras Huebner, Wei Ji, Xiang Xiao

Lastly, we compare the performance of our baseline models with BART, a state-of-the-art language model that is effective for summarization.

Clustering Language Modeling +3

Decoupling Strategy and Surface Realization for Task-oriented Dialogues

no code implementations29 Sep 2021 Chenchen Ye, Lizi Liao, Fuli Feng, Wei Ji, Tat-Seng Chua

The core is to construct a latent content space for strategy optimization and disentangle the surface style from it.

Reinforcement Learning (RL) Style Transfer +1

Advancing biological super-resolution microscopy through deep learning: a brief review

no code implementations24 Jun 2021 Tianjie Yang, Yaoru Luo, Wei Ji, Ge Yang

We conclude with an outlook on how deep learning could shape the future of this new generation of light microscopy technology.

Deep Learning Specificity +1

Calibrated RGB-D Salient Object Detection

1 code implementation CVPR 2021 Wei Ji, Jingjing Li, Shuang Yu, Miao Zhang, Yongri Piao, Shunyu Yao, Qi Bi, Kai Ma, Yefeng Zheng, Huchuan Lu, Li Cheng

Complex backgrounds and similar appearances between objects and their surroundings are generally recognized as challenging scenarios in Salient Object Detection (SOD).

Object object-detection +3

Learning Calibrated Medical Image Segmentation via Multi-Rater Agreement Modeling

1 code implementation CVPR 2021 Wei Ji, Shuang Yu, Junde Wu, Kai Ma, Cheng Bian, Qi Bi, Jingjing Li, Hanruo Liu, Li Cheng, Yefeng Zheng

To our knowledge, our work is the first in producing calibrated predictions under different expertise levels for medical image segmentation.

Diagnostic Image Segmentation +4

Deconfounded Video Moment Retrieval with Causal Intervention

1 code implementation3 Jun 2021 Xun Yang, Fuli Feng, Wei Ji, Meng Wang, Tat-Seng Chua

To fill the research gap, we propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction.

Moment Retrieval Retrieval

Deep Learning for Weakly-Supervised Object Detection and Object Localization: A Survey

no code implementations26 May 2021 Feifei Shao, Long Chen, Jian Shao, Wei Ji, Shaoning Xiao, Lu Ye, Yueting Zhuang, Jun Xiao

With the success of deep neural networks in object detection, both WSOD and WSOL have received unprecedented attention.

Object object-detection +2

Conditional Hyper-Network for Blind Super-Resolution with Multiple Degradations

1 code implementation8 Apr 2021 Guanghao Yin, Wei Wang, Zehuan Yuan, Wei Ji, Dongdong Yu, Shouqian Sun, Tat-Seng Chua, Changhu Wang

We extract degradation prior at task-level with the proposed ConditionNet, which will be used to adapt the parameters of the basic SR network (BaseNet).

Blind Super-Resolution Image Super-Resolution

Boundary Proposal Network for Two-Stage Natural Language Video Localization

no code implementations15 Mar 2021 Shaoning Xiao, Long Chen, Songyang Zhang, Wei Ji, Jian Shao, Lu Ye, Jun Xiao

State-of-the-art NLVL methods are almost in one-stage fashion, which can be typically grouped into two categories: 1) anchor-based approach: it first pre-defines a series of video segment candidates (e. g., by sliding window), and then does classification for each candidate; 2) anchor-free approach: it directly predicts the probabilities for each video frame as a boundary or intermediate frame inside the positive segment.

Vocal Bursts Valence Prediction

ChemistryQA: A Complex Question Answering Dataset from Chemistry

no code implementations1 Jan 2021 Zhuoyu Wei, Wei Ji, Xiubo Geng, Yining Chen, Baihua Chen, Tao Qin, Daxin Jiang

We notice that some real-world QA tasks are more complex, which cannot be solved by end-to-end neural networks or translated to any kind of formal representations.

Machine Reading Comprehension Math +1

Accurate RGB-D Salient Object Detection via Collaborative Learning

2 code implementations ECCV 2020 Wei Ji, Jingjing Li, Miao Zhang, Yongri Piao, Huchuan Lu

The explicitly extracted edge information goes together with saliency to give more emphasis to the salient regions and object boundaries.

Object object-detection +5

An Early Study on Intelligent Analysis of Speech under COVID-19: Severity, Sleep Quality, Fatigue, and Anxiety

no code implementations30 Apr 2020 Jing Han, Kun Qian, Meishu Song, Zijiang Yang, Zhao Ren, Shuo Liu, Juan Liu, Huaiyuan Zheng, Wei Ji, Tomoya Koike, Xiao Li, Zixing Zhang, Yoshiharu Yamamoto, Björn W. Schuller

In particular, by analysing speech recordings from these patients, we construct audio-only-based models to automatically categorise the health state of patients from four aspects, including the severity of illness, sleep quality, fatigue, and anxiety.

Sleep Quality

Context-Aware Deep Spatio-Temporal Network for Hand Pose Estimation from Depth Images

no code implementations6 Oct 2018 Yiming Wu, Wei Ji, Xi Li, Gang Wang, Jianwei Yin, Fei Wu

As a fundamental and challenging problem in computer vision, hand pose estimation aims to estimate the hand joint locations from depth images.

Hand Pose Estimation

Cannot find the paper you are looking for? You can Submit a new open access paper.