Search Results for author: Yu Qiao

Found 464 papers, 289 papers with code

Mining Inter-Video Proposal Relations for Video Object Detection

1 code implementation ECCV 2020 Mingfei Han, Yali Wang, Xiaojun Chang, Yu Qiao

Recent studies have shown that, context aggregating information from proposals in different frames can clearly enhance the performance of video object detection.

Object object-detection +3

The Best of Both Worlds: Combining Engineered Features with Transformers for Improved Mental Health Prediction from Reddit Posts

no code implementations SMM4H (COLING) 2022 Sourabh Zanwar, Daniel Wiechmann, Yu Qiao, Elma Kerz

In recent years, there has been increasing interest in the application of natural language processing and machine learning techniques to the detection of mental health conditions (MHC) based on social media data.

MANTIS at SMM4H’2022: Pre-Trained Language Models Meet a Suite of Psycholinguistic Features for the Detection of Self-Reported Chronic Stress

no code implementations SMM4H (COLING) 2022 Sourabh Zanwar, Daniel Wiechmann, Yu Qiao, Elma Kerz

This paper describes our submission to Social Media Mining for Health (SMM4H) 2022 Shared Task 8, aimed at detecting self-reported chronic stress on Twitter.

Automated Classification of Written Proficiency Levels on the CEFR-Scale through Complexity Contours and RNNs

no code implementations EACL (BEA) 2021 Elma Kerz, Daniel Wiechmann, Yu Qiao, Emma Tseng, Marcus Ströbel

The key to the present paper is the combined use of what we refer to as ‘complexity contours’, a series of measurements of indices of L2 proficiency obtained by a computational tool that implements a sliding window technique, and recurrent neural network (RNN) classifiers that adequately capture the sequential information in those contours.

RBF-Softmax: Learning Deep Representative Prototypes with Radial Basis Function Softmax

1 code implementation ECCV 2020 Xiao Zhang, Rui Zhao, Yu Qiao, Hongsheng Li

To address this problem, this paper introduces a novel Radial Basis Function (RBF) distances to replace the commonly used inner products in the softmax loss function, such that it can adaptively assign losses to regularize the intra-class and inter-class distances by reshaping the relative differences, and thus creating more representative prototypes of classes to improve optimization.

Language that Captivates the Audience: Predicting Affective Ratings of TED Talks in a Multi-Label Classification Task

no code implementations EACL (WASSA) 2021 Elma Kerz, Yu Qiao, Daniel Wiechmann

The aim of the paper is twofold: (1) to automatically predict the ratings assigned by viewers to 14 categories available for TED talks in a multi-label classification task and (2) to determine what types of features drive classification accuracy for each of the categories.

Multi-Label Classification

A Language-Based Approach to Fake News Detection Through Interpretable Features and BRNN

no code implementations RDSM (COLING) 2020 Yu Qiao, Daniel Wiechmann, Elma Kerz

We demonstrate that our approach is promising as it achieves similar results on these two datasets as the best performing black box models reported in the literature.

Explainable Models Fake News Detection +1

FANG-COVID: A New Large-Scale Benchmark Dataset for Fake News Detection in German

1 code implementation EMNLP (FEVER) 2021 Justus Mattern, Yu Qiao, Elma Kerz, Daniel Wiechmann, Markus Strohmaier

As the world continues to fight the COVID-19 pandemic, it is simultaneously fighting an ‘infodemic’ – a flood of disinformation and spread of conspiracy theories leading to health threats and the division of society.

Fake News Detection

A Preliminary Exploration Towards General Image Restoration

no code implementations27 Aug 2024 Xiangtao Kong, Jinjin Gu, Yihao Liu, Wenlong Zhang, Xiangyu Chen, Yu Qiao, Chao Dong

Existing deep models, tailored for specific individual image restoration tasks, often fall short in effectively addressing these challenges.

Deblurring Image Denoising +3

MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

no code implementations20 Aug 2024 Yanbo Ding, Shaobin Zhuang, Kunchang Li, Zhengrong Yue, Yu Qiao, Yali Wang

Despite recent advancements in text-to-image generation, most existing methods struggle to create images with multiple objects and complex spatial relationships in 3D world.

Text-to-Image Generation

Learning A Low-Level Vision Generalist via Visual Task Prompt

1 code implementation16 Aug 2024 Xiangyu Chen, Yihao Liu, Yuandong Pu, Wenlong Zhang, Jiantao Zhou, Yu Qiao, Chao Dong

In addition, these methods are sensitive to prompt image content and often struggle with low-frequency information processing.

Image Reconstruction Image Stylization

VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge

1 code implementation5 Aug 2024 Zihan Li, Diping Song, Zefeng Yang, Deming Wang, Fei Li, Xiulan Zhang, Paul E. Kinahan, Yu Qiao

The need for improved diagnostic methods in ophthalmology is acute, especially in the less developed regions with limited access to specialists and advanced equipment.

Clinical Knowledge

Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining

2 code implementations5 Aug 2024 Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yu Qiao, Hongsheng Li, Peng Gao

We present Lumina-mGPT, a family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions.

Decoder Depth Estimation +3

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

no code implementations5 Aug 2024 Fanqing Meng, Jin Wang, Chuanhao Li, Quanfeng Lu, Hao Tian, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, Ping Luo, Kaipeng Zhang, Wenqi Shao

To fill this gap, we introduce the Multimodal Multi-image Understanding (MMIU) benchmark, a comprehensive evaluation suite designed to assess LVLMs across a wide range of multi-image tasks.

Image Comprehension Multiple-choice

Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model

1 code implementation24 Jul 2024 Lirui Zhao, Tianshuo Yang, Wenqi Shao, Yuxin Zhang, Yu Qiao, Ping Luo, Kaipeng Zhang, Rongrong Ji

To tackle this challenge, we introduce Diffree, a Text-to-Image (T2I) model that facilitates text-guided object addition with only text control.

Image Inpainting Object

Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models

1 code implementation22 Jul 2024 Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Yuan-Fang Li, Cunjian Chen, Yu Qiao

Diffusion models have achieved great progress in image animation due to powerful generative capabilities.

Image Animation

The Better Angels of Machine Personality: How Personality Relates to LLM Safety

1 code implementation17 Jul 2024 Jie Zhang, Dongrui Liu, Chen Qian, Ziyue Gan, Yong liu, Yu Qiao, Jing Shao

In this paper, we discover that LLMs' personality traits are closely related to their safety abilities, i. e., toxicity, privacy, and fairness, based on the reliable MBTI-M scale.

Fairness Safety Alignment

GRIDS: Grouped Multiple-Degradation Restoration with Image Degradation Similarity

no code implementations17 Jul 2024 Shuo Cao, Yihao Liu, Wenlong Zhang, Yu Qiao, Chao Dong

Based on the degradation similarity, GRIDS divides restoration tasks into one of the optimal groups, where tasks within the same group are highly correlated.

Image Restoration Model Selection

Building Intelligence Identification System via Large Language Model Watermarking: A Survey and Beyond

no code implementations15 Jul 2024 Xuhong Wang, Haoyu Jiang, Yi Yu, Jingru Yu, Yilun Lin, Ping Yi, Yingchun Wang, Yu Qiao, Li Li, Fei-Yue Wang

Large Language Models (LLMs) are increasingly integrated into diverse industries, posing substantial security risks due to unauthorized replication and misuse.

Language Modelling Large Language Model

Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification

no code implementations11 Jul 2024 Wenshuo Peng, Kaipeng Zhang, Yue Yang, Hao Zhang, Yu Qiao

Due to the limitations of these weak-paired samples, the pre-training model are unable to mine all the knowledge from pre-training data.

Contrastive Learning Image Classification +1

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

1 code implementation10 Jul 2024 Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, Yu Qiao, Ping Luo

Extensive experiments demonstrate that EfficientQAT outperforms previous quantization methods across a range of models, including base LLMs, instruction-tuned LLMs, and multimodal LLMs, with scales from 7B to 70B parameters at various quantization bits.

Quantization

VEnhancer: Generative Space-Time Enhancement for Video Generation

no code implementations10 Jul 2024 Jingwen He, Tianfan Xue, Dongyang Liu, Xinqi Lin, Peng Gao, Dahua Lin, Yu Qiao, Wanli Ouyang, Ziwei Liu

Given a generated low-quality video, our approach can increase its spatial and temporal resolution simultaneously with arbitrary up-sampling space and time scales through a unified video diffusion model.

Data Augmentation Video Generation +1

LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages

1 code implementation8 Jul 2024 Yinquan Lu, Wenhao Zhu, Lei LI, Yu Qiao, Fei Yuan

To address this, we dedicate 35, 000 A100-SXM4-80GB GPU hours in conducting extensive multilingual continual pre-training on the LLaMA series models, enabling translation support across more than 100 languages.

Data Augmentation Translation

EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation

1 code implementation26 Jun 2024 Baoqi Pei, Guo Chen, Jilan Xu, Yuping He, Yicheng Liu, Kanghua Pan, Yifei HUANG, Yali Wang, Tong Lu, LiMin Wang, Yu Qiao

In this report, we present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge.

 Ranked #1 on Long Term Action Anticipation on Ego4D (using extra training data)

Action Anticipation Action Recognition +6

OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

1 code implementation18 Jun 2024 Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, Yikai Zhang, Yuqing Yang, Ting Wu, Binjie Wang, Shichao Sun, Yang Xiao, Yiyuan Li, Fan Zhou, Steffi Chern, Yiwei Qin, Yan Ma, Jiadi Su, Yixiu Liu, Yuxiang Zheng, Shaoting Zhang, Dahua Lin, Yu Qiao, PengFei Liu

We delve into the models' cognitive reasoning abilities, their performance across different modalities, and their outcomes in process-level evaluations, which are vital for tasks requiring complex reasoning with lengthy solutions.

Benchmarking scientific discovery

Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models

1 code implementation17 Jun 2024 Fangzhi Xu, Qiushi Sun, Kanzhi Cheng, Jun Liu, Yu Qiao, Zhiyong Wu

One of the primary driving forces contributing to the superior performance of Large Language Models (LLMs) is the extensive availability of human-annotated natural language data, which is used for alignment fine-tuning.

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

1 code implementation17 Jun 2024 Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, Jiaqi Wang

Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models(LVLMs).

Visual Question Answering

PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models

no code implementations17 Jun 2024 Fanqing Meng, Wenqi Shao, Lixin Luo, Yahong Wang, Yiran Chen, Quanfeng Lu, Yue Yang, Tianshuo Yang, Kaipeng Zhang, Yu Qiao, Ping Luo

Our findings reveal that: (1) even advanced models frequently err in various physical scenarios, except for optics; (2) GPT-4o, with item-specific scoring instructions, effectively evaluates the models' understanding of physical commonsense, closely aligning with human assessments; and (3) current T2I models are primarily focused on text-to-image translation, lacking profound reasoning regarding physical commonsense.

Image Generation

Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality

1 code implementation13 Jun 2024 Tianle Zhang, Langtian Ma, Yuchen Yan, Yuchen Zhang, Kai Wang, Yue Yang, Ziyao Guo, Wenqi Shao, Yang You, Yu Qiao, Ping Luo, Kaipeng Zhang

To address these challenges, this paper introduces the Text-to-Video Human Evaluation (T2VHE) protocol, a comprehensive and standardized protocol for T2V models.

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

1 code implementation12 Jun 2024 Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai, Zhaoyang Liu, Wenhai Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Ping Luo, Yu Qiao, Jifeng Dai

It not only allows flexible transmission of task information and gradient feedback between the MLLM and multiple downstream decoders but also effectively resolves training conflicts in multi-tasking scenarios.

Image Generation Language Modelling +6

GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

1 code implementation12 Jun 2024 Quanfeng Lu, Wenqi Shao, Zitao Liu, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, Yu Qiao, Ping Luo

Smartphone users often navigate across multiple applications (apps) to complete tasks such as sharing content between social media platforms.

Navigate

MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models

1 code implementation11 Jun 2024 Tianle Gu, Zeyang Zhou, Kexin Huang, Dandan Liang, Yixu Wang, Haiquan Zhao, Yuanqi Yao, Xingge Qiao, Keqing Wang, Yujiu Yang, Yan Teng, Yu Qiao, Yingchun Wang

In this paper, we present MLLMGuard, a multidimensional safety evaluation suite for MLLMs, including a bilingual image-text evaluation dataset, inference utilities, and a lightweight evaluator.

Needle In A Multimodal Haystack

1 code implementation11 Jun 2024 Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, Xizhou Zhu, Ping Luo, Yu Qiao, Jifeng Dai, Wenqi Shao, Wenhai Wang

In this work, we present Needle In A Multimodal Haystack (MM-NIAH), the first benchmark specifically designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.

Retrieval

Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

1 code implementation11 Jun 2024 Chenyu Yang, Xizhou Zhu, Jinguo Zhu, Weijie Su, Junjie Wang, Xuan Dong, Wenhai Wang, Lewei Lu, Bin Li, Jie zhou, Yu Qiao, Jifeng Dai

Recently, vision model pre-training has evolved from relying on manually annotated datasets to leveraging large-scale, web-crawled image-text data.

Contrastive Learning

MoPS: Modular Story Premise Synthesis for Open-Ended Automatic Story Generation

1 code implementation9 Jun 2024 Yan Ma, Yu Qiao, PengFei Liu

In supplementary materials, we provide the MoPS code suite, along with 7. 6k generated premises and 1k extended stories.

Diversity Sentence +1

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

no code implementations6 Jun 2024 Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, Jiaqi Wang

To this end, we meticulously designed a differential video captioning strategy, which is stable, scalable, and efficient for generating captions for videos with arbitrary resolution, aspect ratios, and length.

Video Captioning Video Generation +2

Parameter-Inverted Image Pyramid Networks

1 code implementation6 Jun 2024 Xizhou Zhu, Xue Yang, Zhaokai Wang, Hao Li, Wenhan Dou, Junqi Ge, Lewei Lu, Yu Qiao, Jifeng Dai

Our core idea is to use models with different parameter sizes to process different resolution levels of the image pyramid, thereby balancing computational efficiency and performance.

Computational Efficiency Image Classification +2

Learning 1D Causal Visual Representation with De-focus Attention Networks

1 code implementation6 Jun 2024 Chenxin Tao, Xizhou Zhu, Shiqian Su, Lewei Lu, Changyao Tian, Xuan Luo, Gao Huang, Hongsheng Li, Yu Qiao, Jie zhou, Jifeng Dai

The issue of "over-focus" hinders the model's ability to extract diverse visual features and to receive effective gradients for optimization.

Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT

1 code implementation5 Jun 2024 Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, Xu Luo, Zehan Wang, Kaipeng Zhang, Xiangyang Zhu, Si Liu, Xiangyu Yue, Dingning Liu, Wanli Ouyang, Ziwei Liu, Yu Qiao, Hongsheng Li, Peng Gao

Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions.

Point Cloud Generation Text-to-Image Generation

Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion

1 code implementation5 Jun 2024 Hao Wen, Zehuan Huang, Yaohui Wang, Xinyuan Chen, Yu Qiao, Lu Sheng

However, training these two stages separately leads to significant data bias in the inference phase, thus affecting the quality of reconstructed results.

3D Generation 3D Reconstruction +2

Learning Manipulation by Predicting Interaction

1 code implementation1 Jun 2024 Jia Zeng, Qingwen Bu, Bangjun Wang, Wenke Xia, Li Chen, Hao Dong, Haoming Song, Dong Wang, Di Hu, Ping Luo, Heming Cui, Bin Zhao, Xuelong Li, Yu Qiao, Hongyang Li

To this end, we propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction (MPI) and enhances the visual representation. Given a pair of keyframes representing the initial and final states, along with language instructions, our algorithm predicts the transition frame and detects the interaction object, respectively.

Representation Learning

4Diffusion: Multi-view Video Diffusion Model for 4D Generation

no code implementations31 May 2024 Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yunhong Wang, Yu Qiao

We first design a unified diffusion model tailored for multi-view video generation by incorporating a learnable motion module into a frozen 3D-aware diffusion model to capture multi-view spatial-temporal correlations.

Video Generation

Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models

1 code implementation29 May 2024 Zhanhui Zhou, Zhixuan Liu, Jie Liu, Zhichen Dong, Chao Yang, Yu Qiao

In this work, we introduce $\textit{weak-to-strong search}$, framing the alignment of a large language model as a test-time greedy search to maximize the log-likelihood difference between small tuned and untuned models while sampling from the frozen large model.

Instruction Following Language Modelling +1

Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving

1 code implementation24 May 2024 Jianbiao Mei, Yukai Ma, Xuemeng Yang, Licheng Wen, Xinyu Cai, Xin Li, Daocheng Fu, Bo Zhang, Pinlong Cai, Min Dou, Botian Shi, Liang He, Yong liu, Yu Qiao

Experiments also demonstrate that as the memory bank expands, the Heuristic Process with only 1. 8B parameters can inherit the knowledge from a GPT-4 powered Analytic Process and achieve continuous performance improvement.

Autonomous Driving Decision Making +1

SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge

no code implementations23 May 2024 Chuanhao Li, Zhen Li, Chenchen Jing, Shuo Liu, Wenqi Shao, Yuwei Wu, Ping Luo, Yu Qiao, Kaipeng Zhang

In this paper, we propose a plug-and-play framework, for augmenting existing LVLMs in handling visual question answering (VQA) about up-to-date knowledge, dubbed SearchLVLMs.

Question Answering RAG +1

FLoRA: Low-Rank Core Space for N-dimension

2 code implementations23 May 2024 Chongjie Si, Xuehui Wang, Xue Yang, Zhengqin Xu, Qingyun Li, Jifeng Dai, Yu Qiao, Xiaokang Yang, Wei Shen

To tackle the diversity of dimensional spaces across different foundation models and provide a more precise representation of the changes within these spaces, this paper introduces a generalized parameter-efficient fine-tuning framework, FLoRA, designed for various dimensional parameter space.

parameter-efficient fine-tuning Tensor Decomposition

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

2 code implementations9 May 2024 Peng Gao, Le Zhuo, Dongyang Liu, Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, Chen Lin, Rongjie Huang, Shijie Geng, Renrui Zhang, Junlin Xi, Wenqi Shao, Zhengkai Jiang, Tianshuo Yang, Weicai Ye, He Tong, Jingwen He, Yu Qiao, Hongsheng Li

Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details.

Causal Evaluation of Language Models

2 code implementations1 May 2024 Sirui Chen, Bo Peng, Meiqi Chen, Ruiqi Wang, Mengying Xu, Xingyu Zeng, Rui Zhao, Shengjie Zhao, Yu Qiao, Chaochao Lu

Recent advances in language models have expanded the horizons of artificial intelligence across various domains, sparking inquiries into their potential for causal reasoning.

Causal Discovery Causal Inference +1

Towards Real-world Video Face Restoration: A New Benchmark

no code implementations30 Apr 2024 Ziyan Chen, Jingwen He, Xinqi Lin, Yu Qiao, Chao Dong

Blind face restoration (BFR) on images has significantly progressed over the last several years, while real-world video face restoration (VFR), which is more challenging for more complex face motions such as moving gaze directions and facial orientations involved, remains unsolved.

Blind Face Restoration Image Quality Assessment +1

FedCCL: Federated Dual-Clustered Feature Contrast Under Domain Heterogeneity

no code implementations14 Apr 2024 Yu Qiao, Huy Q. Le, Mengchun Zhang, Apurba Adhikary, Chaoning Zhang, Choong Seon Hong

First, we employ clustering on the local representations of each client, aiming to capture intra-class information based on these local clusters at a high level of granularity.

Clustering Federated Learning +1

Logit Calibration and Feature Contrast for Robust Federated Learning on Non-IID Data

no code implementations10 Apr 2024 Yu Qiao, Chaoning Zhang, Apurba Adhikary, Choong Seon Hong

Federated learning (FL) is a privacy-preserving distributed framework for collaborative model training on devices in edge networks.

Adversarial Robustness Federated Learning +1

DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement

no code implementations CVPR 2024 Hao Wu, Huabin Liu, Yu Qiao, Xiao Sun

We present Dive Into the BoundarieS (DIBS), a novel pretraining framework for dense video captioning (DVC), that elaborates on improving the quality of the generated event captions and their associated pseudo event boundaries from unlabeled videos.

Dense Video Captioning Diversity

Linear Attention Sequence Parallelism

1 code implementation3 Apr 2024 Weigao Sun, Zhen Qin, Dong Li, Xuyang Shen, Yu Qiao, Yiran Zhong

In this paper, we introduce Linear Attention Sequence Parallel (LASP), an efficient SP method tailored to linear attention-based language models.

LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction

no code implementations CVPR 2024 Bo Zou, Chao Yang, Yu Qiao, Chengbin Quan, Youjian Zhao

LLaMA-Excitor ensures a self-adaptive allocation of additional attention to input instructions, thus effectively preserving LLMs' pre-trained knowledge when fine-tuning LLMs on low-quality instruction-following datasets.

Image Captioning Instruction Following +1

VideoDistill: Language-aware Vision Distillation for Video Question Answering

no code implementations1 Apr 2024 Bo Zou, Chao Yang, Yu Qiao, Chengbin Quan, Youjian Zhao

In this paper, we are inspired by the human recognition and learning pattern and propose VideoDistill, a framework with language-aware (i. e., goal-driven) behavior in both vision perception and answer generation process.

Answer Generation Question Answering +1

DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model

1 code implementation CVPR 2024 Lirui Zhao, Yue Yang, Kaipeng Zhang, Wenqi Shao, Yuxin Zhang, Yu Qiao, Ping Luo, Rongrong Ji

Text-to-image (T2I) generative models have attracted significant attention and found extensive applications within and beyond academic research.

Diversity Language Modelling +1

Are We on the Right Way for Evaluating Large Vision-Language Models?

1 code implementation29 Mar 2024 Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, Feng Zhao

We evaluate 16 leading LVLMs on MMStar to assess their multi-modal capabilities, and on 7 benchmarks with the proposed metrics to investigate their data leakage and actual multi-modal gain.

World Knowledge

RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents

no code implementations28 Mar 2024 Zeren Chen, Zhelun Shi, Xiaoya Lu, Lehan He, Sucheng Qian, Hao Shu Fang, Zhenfei Yin, Wanli Ouyang, Jing Shao, Yu Qiao, Cewu Lu, Lu Sheng

The ultimate goals of robotic learning is to acquire a comprehensive and generalizable robotic system capable of performing both seen skills within the training distribution and unseen skills in novel environments.

Motion Planning

InternLM2 Technical Report

3 code implementations26 Mar 2024 Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao, Zhenjiang Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, Linyang Li, Shuaibin Li, Wei Li, Yining Li, Hongwei Liu, Jiangning Liu, Jiawei Hong, Kaiwen Liu, Kuikun Liu, Xiaoran Liu, Chengqi Lv, Haijun Lv, Kai Lv, Li Ma, Runyuan Ma, Zerun Ma, Wenchang Ning, Linke Ouyang, Jiantao Qiu, Yuan Qu, FuKai Shang, Yunfan Shao, Demin Song, Zifan Song, Zhihao Sui, Peng Sun, Yu Sun, Huanze Tang, Bin Wang, Guoteng Wang, Jiaqi Wang, Jiayu Wang, Rui Wang, Yudong Wang, Ziyi Wang, Xingjian Wei, Qizhen Weng, Fan Wu, Yingtong Xiong, Chao Xu, Ruiliang Xu, Hang Yan, Yirong Yan, Xiaogui Yang, Haochen Ye, Huaiyuan Ying, JIA YU, Jing Yu, Yuhang Zang, Chuyu Zhang, Li Zhang, Pan Zhang, Peng Zhang, Ruijie Zhang, Shuo Zhang, Songyang Zhang, Wenjian Zhang, Wenwei Zhang, Xingcheng Zhang, Xinyue Zhang, Hui Zhao, Qian Zhao, Xiaomeng Zhao, Fengzhe Zhou, Zaida Zhou, Jingming Zhuo, Yicheng Zou, Xipeng Qiu, Yu Qiao, Dahua Lin

The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI).

4k Long-Context Understanding

Assessment of Multimodal Large Language Models in Alignment with Human Values

1 code implementation26 Mar 2024 Zhelun Shi, Zhipin Wang, Hongxing Fan, Zaibin Zhang, Lijun Li, Yongting Zhang, Zhenfei Yin, Lu Sheng, Yu Qiao, Jing Shao

Large Language Models (LLMs) aim to serve as versatile assistants aligned with human values, as defined by the principles of being helpful, honest, and harmless (hhh).

EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World

1 code implementation CVPR 2024 Yifei HUANG, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, LiMin Wang, Yu Qiao

Along with the videos we record high-quality gaze data and provide detailed multimodal annotations, formulating a playground for modeling the human ability to bridge asynchronous procedural actions from different viewpoints.

 Ranked #1 on Action Anticipation on EgoExoLearn (using extra training data)

Action Anticipation Action Quality Assessment +2

DreamDA: Generative Data Augmentation with Diffusion Models

1 code implementation19 Mar 2024 Yunxiang Fu, Chaoqi Chen, Yu Qiao, Yizhou Yu

The acquisition of large-scale, high-quality data is a resource-intensive and time-consuming endeavor.

Data Augmentation Diversity

MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control

1 code implementation18 Mar 2024 Enshen Zhou, Yiran Qin, Zhenfei Yin, Yuzhou Huang, Ruimao Zhang, Lu Sheng, Yu Qiao, Jing Shao

It is a long-lasting goal to design a generalist-embodied agent that can follow diverse instructions in human-like ways.

Instruction Following

AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Adversarial Visual-Instructions

no code implementations14 Mar 2024 Hao Zhang, Wenqi Shao, Hong Liu, Yongqiang Ma, Ping Luo, Yu Qiao, Kaipeng Zhang

To bridge this gap, we introduce AVIBench, a framework designed to analyze the robustness of LVLMs when facing various adversarial visual-instructions (AVIs), including four types of image-based AVIs, ten types of text-based AVIs, and nine types of content bias AVIs (such as gender, violence, cultural, and racial biases, among others).

Fairness Language Modelling

Desigen: A Pipeline for Controllable Design Template Generation

no code implementations CVPR 2024 Haohan Weng, Danqing Huang, Yu Qiao, Zheng Hu, Chin-Yew Lin, Tong Zhang, C. L. Philip Chen

In this paper, we present Desigen, an automatic template creation pipeline which generates background images as well as harmonious layout elements over the background.

VideoMamba: State Space Model for Efficient Video Understanding

2 code implementations11 Mar 2024 Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, LiMin Wang, Yu Qiao

Addressing the dual challenges of local redundancy and global dependencies in video understanding, this work innovatively adapts the Mamba to the video domain.

Video Understanding

Embodied Understanding of Driving Scenarios

1 code implementation7 Mar 2024 Yunsong Zhou, Linyan Huang, Qingwen Bu, Jia Zeng, Tianyu Li, Hang Qiu, Hongzi Zhu, Minyi Guo, Yu Qiao, Hongyang Li

Hereby, we introduce the Embodied Language Model (ELM), a comprehensive framework tailored for agents' understanding of driving scenes with large spatial and temporal spans.

Autonomous Driving Language Modelling +1

Towards Robust Federated Learning via Logits Calibration on Non-IID Data

no code implementations5 Mar 2024 Yu Qiao, Apurba Adhikary, Chaoning Zhang, Choong Seon Hong

Meanwhile, the non-independent and identically distributed (non-IID) challenge of data distribution between edge devices can further degrade the performance of models.

Federated Learning Privacy Preserving

Position: Towards Implicit Prompt For Text-To-Image Models

no code implementations4 Mar 2024 Yue Yang, Yuqi Lin, Hong Liu, Wenqi Shao, Runjian Chen, Hailong Shang, Yu Wang, Yu Qiao, Kaipeng Zhang, Ping Luo

We call for increased attention to the potential and risks of implicit prompts in the T2I community and further investigation into the capabilities and impacts of implicit prompts, advocating for a balanced approach that harnesses their benefits while mitigating their risks.

Position

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

1 code implementation4 Mar 2024 Yuchen Duan, Weiyun Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Hongsheng Li, Jifeng Dai, Wenhai Wang

Our evaluations demonstrate that VRWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage processing high-resolution inputs.

Image Classification

Efficient Action Counting with Dynamic Queries

1 code implementation3 Mar 2024 Zishi Li, Xiaoxuan Ma, Qiuyan Shang, Wentao Zhu, Hai Ci, Yu Qiao, Yizhou Wang

Temporal repetition counting aims to quantify the repeated action cycles within a video.

Contrastive Learning

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

1 code implementation29 Feb 2024 Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, Yu Qiao, Jifeng Dai

In addition, we design a new benchmark, termed Circular-based Relation Probing Evaluation (CRPE) for comprehensively evaluating the relation comprehension capabilities of MLLMs.

Hallucination Object Localization +3

Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models

1 code implementation29 Feb 2024 Chen Qian, Jie Zhang, Wei Yao, Dongrui Liu, Zhenfei Yin, Yu Qiao, Yong liu, Jing Shao

This research provides an initial exploration of trustworthiness modeling during LLM pre-training, seeking to unveil new insights and spur further developments in the field.

Fairness Mutual Information Estimation

Rethinking Mutual Information for Language Conditioned Skill Discovery on Imitation Learning

no code implementations27 Feb 2024 Zhaoxun Ju, Chao Yang, Hongbo Wang, Yu Qiao, Fuchun Sun

Language-conditioned robot behavior plays a vital role in executing complex tasks by associating human commands or instructions with perception and actions.

Imitation Learning Quantization

RoboScript: Code Generation for Free-Form Manipulation Tasks across Real and Simulation

no code implementations22 Feb 2024 Junting Chen, Yao Mu, Qiaojun Yu, Tianming Wei, Silang Wu, Zhecheng Yuan, Zhixuan Liang, Chao Yang, Kaipeng Zhang, Wenqi Shao, Yu Qiao, Huazhe Xu, Mingyu Ding, Ping Luo

To bridge this ``ideal-to-real'' gap, this paper presents \textbf{RobotScript}, a platform for 1) a deployable robot manipulation pipeline powered by code generation; and 2) a code generation benchmark for robot manipulation tasks in free-form natural language.

Code Generation Common Sense Reasoning +2

BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation

2 code implementations18 Feb 2024 Peng Xu, Wenqi Shao, Mengzhao Chen, Shitao Tang, Kaipeng Zhang, Peng Gao, Fengwei An, Yu Qiao, Ping Luo

Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc.

Question Answering Text Summarization

Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey

1 code implementation14 Feb 2024 Zhichen Dong, Zhanhui Zhou, Chao Yang, Jing Shao, Yu Qiao

Large Language Models (LLMs) are now commonplace in conversation applications.

OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

1 code implementation CVPR 2024 Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, Ping Luo

Importantly, all images in this benchmark are sourced from authentic medical scenarios, ensuring alignment with the requirements of the medical field and suitability for evaluating LVLMs.

Medical Visual Question Answering Question Answering +1

Safety of Multimodal Large Language Models on Images and Texts

2 code implementations1 Feb 2024 Xin Liu, Yichen Zhu, Yunshi Lan, Chao Yang, Yu Qiao

In this paper, we systematically survey current efforts on the evaluation, attack, and defense of MLLMs' safety on images and text.

CO2: Efficient Distributed Training with Full Communication-Computation Overlap

1 code implementation29 Jan 2024 Weigao Sun, Zhen Qin, Weixuan Sun, Shidi Li, Dong Li, Xuyang Shen, Yu Qiao, Yiran Zhong

CO2 is able to attain a high scalability even on extensive multi-node clusters constrained by very limited communication bandwidth.

Cross-Modal Prototype based Multimodal Federated Learning under Severely Missing Modality

no code implementations25 Jan 2024 Huy Q. Le, Chu Myaet Thwal, Yu Qiao, Ye Lin Tun, Minh N. H. Nguyen, Choong Seon Hong

In this paper, we propose Multimodal Federated Cross Prototype Learning (MFCPL), a novel approach for MFL under severely missing modalities by conducting the complete prototypes to provide diverse modality knowledge in modality-shared level with the cross-modal regularization and modality-specific level with cross-modal contrastive mechanism.

cross-modal alignment Federated Learning

Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild

no code implementations CVPR 2024 Fanghua Yu, Jinjin Gu, Zheyuan Li, JinFan Hu, Xiangtao Kong, Xintao Wang, Jingwen He, Yu Qiao, Chao Dong

We introduce SUPIR (Scaling-UP Image Restoration), a groundbreaking image restoration method that harnesses generative prior and the power of model scaling up.

Descriptive Image Restoration

SEER: Facilitating Structured Reasoning and Explanation via Reinforcement Learning

1 code implementation24 Jan 2024 Guoxin Chen, Kexin Tang, Chao Yang, Fuying Ye, Yu Qiao, Yiming Qian

Moreover, existing reinforcement learning (RL) based methods overlook the structured relationships, underutilizing the potential of RL in structured reasoning.

Question Answering reinforcement-learning +1

PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety

1 code implementation22 Jan 2024 Zaibin Zhang, Yongting Zhang, Lijun Li, Hongzhi Gao, Lijun Wang, Huchuan Lu, Feng Zhao, Yu Qiao, Jing Shao

In this paper, we explore these concerns through the innovative lens of agent psychology, revealing that the dark psychological states of agents constitute a significant threat to safety.

Vlogger: Make Your Dream A Vlog

1 code implementation CVPR 2024 Shaobin Zhuang, Kunchang Li, Xinyuan Chen, Yaohui Wang, Ziwei Liu, Yu Qiao, Yali Wang

More importantly, Vlogger can generate over 5-minute vlogs from open-world descriptions, without loss of video coherence on script and actor.

Language Modelling Large Language Model +1

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

2 code implementations CVPR 2024 Yuwen Xiong, Zhiqi Li, Yuntao Chen, Feng Wang, Xizhou Zhu, Jiapeng Luo, Wenhai Wang, Tong Lu, Hongsheng Li, Yu Qiao, Lewei Lu, Jie zhou, Jifeng Dai

The advancements in speed and efficiency of DCNv4, combined with its robust performance across diverse vision tasks, show its potential as a foundational building block for future vision models.

Image Classification Image Generation +1

ScoreHypo: Probabilistic Human Mesh Estimation with Hypothesis Scoring

no code implementations CVPR 2024 Yuan Xu, Xiaoxuan Ma, Jiajun Su, Wentao Zhu, Yu Qiao, Yizhou Wang

Experimental results demonstrate that HypoNet outperforms existing state-of-the-art probabilistic methods as a multi-hypothesis mesh estimator.

Denoising

Language-aware Visual Semantic Distillation for Video Question Answering

no code implementations CVPR 2024 Bo Zou, Chao Yang, Yu Qiao, Chengbin Quan, Youjian Zhao

In this paper we are inspired by the human recognition and learning pattern and propose VideoDistill a framework with language-aware (i. e. goal-driven) behavior in both vision perception and answer generation process.

Answer Generation Question Answering +1

Generate Like Experts: Multi-Stage Font Generation by Incorporating Font Transfer Process into Diffusion Models

2 code implementations CVPR 2024 Bin Fu, Fanghua Yu, Anran Liu, Zixuan Wang, Jie Wen, Junjun He, Yu Qiao

Based on this observation we generalize diffusion methods to model font generative process by separating the reverse diffusion process into three stages with different functions: The structure construction stage first generates the structure information for the target character based on the source image and the font transfer stage subsequently transforms the source font to the target font.

Disentanglement Font Generation +1

Critic-Guided Decision Transformer for Offline Reinforcement Learning

no code implementations21 Dec 2023 Yuanfu Wang, Chao Yang, Ying Wen, Yu Liu, Yu Qiao

Recent advancements in offline reinforcement learning (RL) have underscored the capabilities of Return-Conditioned Supervised Learning (RCSL), a paradigm that learns the action distribution based on target returns for each state in a supervised manner.

D4RL Offline RL +3

M-BEV: Masked BEV Perception for Robust Autonomous Driving

1 code implementation19 Dec 2023 Siran Chen, Yue Ma, Yu Qiao, Yali Wang

It mimics various missing cases by randomly masking features of different camera views, then leverages the original features of these views as self-supervision, and reconstructs the masked ones with the distinct spatio-temporal context across views.

Autonomous Driving

Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

1 code implementation19 Dec 2023 Lingjun Zhang, Xinyuan Chen, Yaohui Wang, Yue Lu, Yu Qiao

To tackle this problem, we propose Diff-Text, which is a training-free scene text generation framework for any language.

Text Generation Text-to-Image Generation

Towards the Unification of Generative and Discriminative Visual Foundation Model: A Survey

no code implementations15 Dec 2023 Xu Liu, Tong Zhou, Yuanxin Wang, Yuping Wang, Qinjingwen Cao, Weizhi Du, Yonghuan Yang, Junjun He, Yu Qiao, Yiqing Shen

The advent of foundation models, which are pre-trained on vast datasets, has ushered in a new era of computer vision, characterized by their robustness and remarkable zero-shot generalization capabilities.

Image Generation Image Segmentation +2