Search Results for author: Dongxu Li

Found 32 papers, 21 papers with code

Automatic Gloss Dictionary for Sign Language Learners

no code implementations ACL 2022 Chenchen Xu, Dongxu Li, Hongdong Li, Hanna Suominen, Ben Swift

A multi-language dictionary is a fundamental tool for language learning, allowing the learner to look up unfamiliar words.

EZSR: Event-based Zero-Shot Recognition

no code implementations31 Jul 2024 Yan Yang, Liyuan Pan, Dongxu Li, Liu Liu

Furthermore, to scale up the number of events and RGB data pairs for training, we also propose a pipeline for synthesizing event data from static RGB images.

Object Object Recognition +1

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

1 code implementation22 Jul 2024 HaoNing Wu, Dongxu Li, Bei Chen, Junnan Li

In addition, our results indicate that model performance on the benchmark improves only when they are capable of processing more frames, positioning LongVideoBench as a valuable benchmark for evaluating future-generation long-context LMMs.

Multiple-choice Question Answering +2

PyramidMamba: Rethinking Pyramid Feature Fusion with Selective Space State Model for Semantic Segmentation of Remote Sensing Imagery

2 code implementations16 Jun 2024 Libo Wang, Dongxu Li, Sijun Dong, Xiaoliang Meng, Xiaokang Zhang, Danfeng Hong

Semantic segmentation, as a basic tool for intelligent interpretation of remote sensing images, plays a vital role in many Earth Observation (EO) applications.

Decoder Earth Observation +4

Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

2 code implementations3 Jan 2024 David Junhao Zhang, Dongxu Li, Hung Le, Mike Zheng Shou, Caiming Xiong, Doyen Sahoo

This work presents Moonshot, a new video generation model that conditions simultaneously on multimodal inputs of image and text.

Image Animation Video Editing +1

Fundamental Limitation of Semantic Communications: Neural Estimation for Rate-Distortion

no code implementations2 Jan 2024 Dongxu Li, Jianhao Huang, Chuan Huang, Xiaoqi Qin, Han Zhang, Ping Zhang

For the case with unknown semantic source distribution, while only a set of the source samples is available, we propose a neural-network-based method by leveraging the generative networks to learn the semantic source distribution.

X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning

2 code implementations30 Nov 2023 Artemis Panagopoulou, Le Xue, Ning Yu, Junnan Li, Dongxu Li, Shafiq Joty, ran Xu, Silvio Savarese, Caiming Xiong, Juan Carlos Niebles

To enable this framework, we devise a scalable pipeline that automatically generates high-quality, instruction-tuning datasets from readily available captioning data across different modalities, and contribute 24K QA data for audio and 250K QA data for 3D.

Visual Reasoning

Linearized Relative Positional Encoding

no code implementations18 Jul 2023 Zhen Qin, Weixuan Sun, Kaiyue Lu, Hui Deng, Dongxu Li, Xiaodong Han, Yuchao Dai, Lingpeng Kong, Yiran Zhong

Meanwhile, it emphasizes a general paradigm for designing broadly more relative positional encoding methods that are applicable to linear transformers.

Image Classification Language Modelling +2

Toeplitz Neural Network for Sequence Modeling

2 code implementations8 May 2023 Zhen Qin, Xiaodong Han, Weixuan Sun, Bowen He, Dong Li, Dongxu Li, Yuchao Dai, Lingpeng Kong, Yiran Zhong

Sequence modeling has important applications in natural language processing and computer vision.

Language Modelling Position

Joint Task and Data Oriented Semantic Communications: A Deep Separate Source-channel Coding Scheme

no code implementations27 Feb 2023 Jianhao Huang, Dongxu Li, Chuan Huang, Xiaoqi Qin, Wei zhang

This paper proposes a deep separate source-channel coding (DSSCC) framework for the joint task and data oriented semantic communications (JTD-SC) and utilizes the variational autoencoder approach to solve the rate-distortion problem with semantic distortion.

Bayesian Inference Data Compression

From Images to Textual Prompts: Zero-Shot Visual Question Answering With Frozen Large Language Models

no code implementations CVPR 2023 Jiaxian Guo, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Boyang Li, DaCheng Tao, Steven Hoi

To address this issue, we propose Img2Prompt, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections, so that LLMs can perform zero-shot VQA tasks without end-to-end training.

Question Answering Visual Question Answering +1

From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models

3 code implementations21 Dec 2022 Jiaxian Guo, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Boyang Li, DaCheng Tao, Steven C. H. Hoi

To address this issue, we propose \emph{Img2Prompt}, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections, so that LLMs can perform zero-shot VQA tasks without end-to-end training.

Question Answering Visual Question Answering +1

PCRED: Zero-shot Relation Triplet Extraction with Potential Candidate Relation Selection and Entity Boundary Detection

no code implementations26 Nov 2022 Yuquan Lan, Dongxu Li, Yunqi Zhang, Hui Zhao, Gang Zhao

To address the above issues, we propose a novel method named PCRED for ZeroRTE with Potential Candidate Relation Selection and Entity Boundary Detection.

Boundary Detection Relation +2

The Devil in Linear Transformer

1 code implementation19 Oct 2022 Zhen Qin, Xiaodong Han, Weixuan Sun, Dongxu Li, Lingpeng Kong, Nick Barnes, Yiran Zhong

In this paper, we examine existing kernel-based linear transformers and identify two key issues that lead to such performance gaps: 1) unbounded gradients in the attention computation adversely impact the convergence of linear transformer models; 2) attention dilution which trivially distributes attention scores over long sequences while neglecting neighbouring structures.

Language Modelling Text Classification

TODE-Trans: Transparent Object Depth Estimation with Transformer

1 code implementation18 Sep 2022 Kang Chen, Shaochen Wang, Beihao Xia, Dongxu Li, Zhen Kan, Bin Li

We observe that the global characteristics of the transformer make it easier to extract contextual information to perform depth estimation of transparent areas.

Depth Estimation Object +2

LAVIS: A Library for Language-Vision Intelligence

1 code implementation15 Sep 2022 Dongxu Li, Junnan Li, Hung Le, Guangsen Wang, Silvio Savarese, Steven C. H. Hoi

We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research and applications.

Benchmarking Image Captioning +8

cosFormer: Rethinking Softmax in Attention

3 code implementations ICLR 2022 Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, Yiran Zhong

As one of its core components, the softmax attention helps to capture long-range dependencies yet prohibits its scale-up due to the quadratic space and time complexity to the sequence length.

D4RL Language Modelling +1

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

9 code implementations28 Jan 2022 Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi

Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision.

Ranked #3 on Open Vocabulary Attribute Detection on OVAD-Box benchmark (using extra training data)

Image Captioning Image-text matching +5

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

1 code implementation CVPR 2022 Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, Steven C. H. Hoi

To achieve this, we first introduce an entity prompter module, which is trained with VTC to produce the similarity between a video crop and text prompts instantiated with entity names.

cross-modal alignment Entity Alignment +4

Transcribing Natural Languages for The Deaf via Neural Editing Programs

1 code implementation17 Dec 2021 Dongxu Li, Chenchen Xu, Liu Liu, Yiran Zhong, Rong Wang, Lars Petersson, Hongdong Li

This work studies the task of glossification, of which the aim is to em transcribe natural spoken language sentences for the Deaf (hard-of-hearing) community to ordered sign language glosses.

Sentence

Enhanced Spatio-Temporal Interaction Learning for Video Deraining: A Faster and Better Framework

1 code implementation23 Mar 2021 Kaihao Zhang, Dongxu Li, Wenhan Luo, Wenqi Ren, Wei Liu

Video deraining is an important task in computer vision as the unwanted rain hampers the visibility of videos and deteriorates the robustness of most outdoor vision systems.

Rain Removal Video deraining

Dual Attention-in-Attention Model for Joint Rain Streak and Raindrop Removal

no code implementations12 Mar 2021 Kaihao Zhang, Dongxu Li, Wenhan Luo, Wenqi Ren

In addition, to further refine the result, a Differential-driven Dual Attention-in-Attention Model (D-DAiAM) is proposed with a "heavy-to-light" scheme to remove rain via addressing the unsatisfying deraining regions.

Raindrop Removal Rain Removal

Benchmarking Ultra-High-Definition Image Super-Resolution

no code implementations ICCV 2021 Kaihao Zhang, Dongxu Li, Wenhan Luo, Wenqi Ren, Bjorn Stenger, Wei Liu, Hongdong Li, Ming-Hsuan Yang

Increasingly, modern mobile devices allow capturing images at Ultra-High-Definition (UHD) resolution, which includes 4K and 8K images.

4k 8k +3

Transferring Cross-domain Knowledge for Video Sign Language Recognition

no code implementations CVPR 2020 Dongxu Li, Xin Yu, Chenchen Xu, Lars Petersson, Hongdong Li

To this end, we extract news signs using a base WSLR model, and then design a classifier jointly trained on news and isolated signs to coarsely align these two domain features.

Sign Language Recognition

Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison

3 code implementations24 Oct 2019 Dongxu Li, Cristian Rodriguez Opazo, Xin Yu, Hongdong Li

Based on this new large-scale dataset, we are able to experiment with several deep learning methods for word-level sign recognition and evaluate their performances in large scale scenarios.

Action Classification Benchmarking +3

Cannot find the paper you are looking for? You can Submit a new open access paper.