1 code implementation • ECCV 2020 • Mingfei Han, Yali Wang, Xiaojun Chang, Yu Qiao
Recent studies have shown that, context aggregating information from proposals in different frames can clearly enhance the performance of video object detection.
Ranked #13 on Video Object Detection on ImageNet VID
no code implementations • SMM4H (COLING) 2022 • Sourabh Zanwar, Daniel Wiechmann, Yu Qiao, Elma Kerz
In recent years, there has been increasing interest in the application of natural language processing and machine learning techniques to the detection of mental health conditions (MHC) based on social media data.
no code implementations • SMM4H (COLING) 2022 • Sourabh Zanwar, Daniel Wiechmann, Yu Qiao, Elma Kerz
This paper describes our submission to Social Media Mining for Health (SMM4H) 2022 Shared Task 8, aimed at detecting self-reported chronic stress on Twitter.
1 code implementation • LREC 2022 • Elma Kerz, Yu Qiao, Sourabh Zanwar, Daniel Wiechmann
In recent years, there has been increasing interest in automatic personality detection based on language.
no code implementations • EACL (BEA) 2021 • Elma Kerz, Daniel Wiechmann, Yu Qiao, Emma Tseng, Marcus Ströbel
The key to the present paper is the combined use of what we refer to as ‘complexity contours’, a series of measurements of indices of L2 proficiency obtained by a computational tool that implements a sliding window technique, and recurrent neural network (RNN) classifiers that adequately capture the sequential information in those contours.
1 code implementation • ECCV 2020 • Xiao Zhang, Rui Zhao, Yu Qiao, Hongsheng Li
To address this problem, this paper introduces a novel Radial Basis Function (RBF) distances to replace the commonly used inner products in the softmax loss function, such that it can adaptively assign losses to regularize the intra-class and inter-class distances by reshaping the relative differences, and thus creating more representative prototypes of classes to improve optimization.
no code implementations • EACL (WASSA) 2021 • Elma Kerz, Yu Qiao, Daniel Wiechmann
The aim of the paper is twofold: (1) to automatically predict the ratings assigned by viewers to 14 categories available for TED talks in a multi-label classification task and (2) to determine what types of features drive classification accuracy for each of the categories.
no code implementations • RDSM (COLING) 2020 • Yu Qiao, Daniel Wiechmann, Elma Kerz
We demonstrate that our approach is promising as it achieves similar results on these two datasets as the best performing black box models reported in the literature.
1 code implementation • EMNLP (FEVER) 2021 • Justus Mattern, Yu Qiao, Elma Kerz, Daniel Wiechmann, Markus Strohmaier
As the world continues to fight the COVID-19 pandemic, it is simultaneously fighting an ‘infodemic’ – a flood of disinformation and spread of conspiracy theories leading to health threats and the division of society.
no code implementations • 10 Sep 2024 • Junyi Chen, Weicai Ye, Yifan Wang, Danpeng Chen, Di Huang, Wanli Ouyang, Guofeng Zhang, Yu Qiao, Tong He
To this end, we propose GigaGS, the first work for high-quality surface reconstruction for large-scale scenes using 3DGS.
no code implementations • 27 Aug 2024 • Xiangtao Kong, Jinjin Gu, Yihao Liu, Wenlong Zhang, Xiangyu Chen, Yu Qiao, Chao Dong
Existing deep models, tailored for specific individual image restoration tasks, often fall short in effectively addressing these challenges.
no code implementations • 20 Aug 2024 • Yanbo Ding, Shaobin Zhuang, Kunchang Li, Zhengrong Yue, Yu Qiao, Yali Wang
Despite recent advancements in text-to-image generation, most existing methods struggle to create images with multiple objects and complex spatial relationships in 3D world.
1 code implementation • 16 Aug 2024 • Xiangyu Chen, Yihao Liu, Yuandong Pu, Wenlong Zhang, Jiantao Zhou, Yu Qiao, Chao Dong
In addition, these methods are sensitive to prompt image content and often struggle with low-frequency information processing.
1 code implementation • 6 Aug 2024 • Pengcheng Chen, Jin Ye, Guoan Wang, Yanjun Li, Zhongying Deng, Wei Li, Tianbin Li, Haodong Duan, Ziyan Huang, Yanzhou Su, Benyou Wang, Shaoting Zhang, Bin Fu, Jianfei Cai, Bohan Zhuang, Eric J Seibel, Junjun He, Yu Qiao
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals, and can be applied in various fields.
1 code implementation • 5 Aug 2024 • Zihan Li, Diping Song, Zefeng Yang, Deming Wang, Fei Li, Xiulan Zhang, Paul E. Kinahan, Yu Qiao
The need for improved diagnostic methods in ophthalmology is acute, especially in the less developed regions with limited access to specialists and advanced equipment.
2 code implementations • 5 Aug 2024 • Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yu Qiao, Hongsheng Li, Peng Gao
We present Lumina-mGPT, a family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions.
no code implementations • 5 Aug 2024 • Fanqing Meng, Jin Wang, Chuanhao Li, Quanfeng Lu, Hao Tian, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, Ping Luo, Kaipeng Zhang, Wenqi Shao
To fill this gap, we introduce the Multimodal Multi-image Understanding (MMIU) benchmark, a comprehensive evaluation suite designed to assess LVLMs across a wide range of multi-image tasks.
1 code implementation • 24 Jul 2024 • Lirui Zhao, Tianshuo Yang, Wenqi Shao, Yuxin Zhang, Yu Qiao, Ping Luo, Kaipeng Zhang, Rongrong Ji
To tackle this challenge, we introduce Diffree, a Text-to-Image (T2I) model that facilitates text-guided object addition with only text control.
1 code implementation • 22 Jul 2024 • Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Yuan-Fang Li, Cunjian Chen, Yu Qiao
Diffusion models have achieved great progress in image animation due to powerful generative capabilities.
1 code implementation • 22 Jul 2024 • Yangzhou Liu, Yue Cao, Zhangwei Gao, Weiyun Wang, Zhe Chen, Wenhai Wang, Hao Tian, Lewei Lu, Xizhou Zhu, Tong Lu, Yu Qiao, Jifeng Dai
To construct MMInstruct, we propose an instruction generation data engine that leverages GPT-4V, GPT-3. 5, and manual correction.
1 code implementation • 18 Jul 2024 • Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, Yu Qiao, Hengshuang Zhao
To bridge the gap between image and video, in this work, we propose a new video segmentation task - video reasoning segmentation.
1 code implementation • 17 Jul 2024 • Jie Zhang, Dongrui Liu, Chen Qian, Ziyue Gan, Yong liu, Yu Qiao, Jing Shao
In this paper, we discover that LLMs' personality traits are closely related to their safety abilities, i. e., toxicity, privacy, and fairness, based on the reliable MBTI-M scale.
no code implementations • 17 Jul 2024 • Shuo Cao, Yihao Liu, Wenlong Zhang, Yu Qiao, Chao Dong
Based on the degradation similarity, GRIDS divides restoration tasks into one of the optimal groups, where tasks within the same group are highly correlated.
no code implementations • 15 Jul 2024 • Xuhong Wang, Haoyu Jiang, Yi Yu, Jingru Yu, Yilun Lin, Ping Yi, Yingchun Wang, Yu Qiao, Li Li, Fei-Yue Wang
Large Language Models (LLMs) are increasingly integrated into diverse industries, posing substantial security risks due to unauthorized replication and misuse.
1 code implementation • 15 Jul 2024 • Hanqing Wang, Jiahe Chen, Wensi Huang, Qingwei Ben, Tai Wang, Boyu Mi, Tao Huang, Siheng Zhao, Yilun Chen, Sizhe Yang, Peizhou Cao, Wenye Yu, Zichao Ye, Jialun Li, Junfeng Long, ZiRui Wang, Huiling Wang, Ying Zhao, Zhongying Tu, Yu Qiao, Dahua Lin, Jiangmiao Pang
Recent works have been exploring the scaling laws in the field of Embodied AI.
no code implementations • 11 Jul 2024 • Wenshuo Peng, Kaipeng Zhang, Yue Yang, Hao Zhang, Yu Qiao
Due to the limitations of these weak-paired samples, the pre-training model are unable to mine all the knowledge from pre-training data.
1 code implementation • 10 Jul 2024 • Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, Yu Qiao, Ping Luo
Extensive experiments demonstrate that EfficientQAT outperforms previous quantization methods across a range of models, including base LLMs, instruction-tuned LLMs, and multimodal LLMs, with scales from 7B to 70B parameters at various quantization bits.
no code implementations • 10 Jul 2024 • Jingwen He, Tianfan Xue, Dongyang Liu, Xinqi Lin, Peng Gao, Dahua Lin, Yu Qiao, Wanli Ouyang, Ziwei Liu
Given a generated low-quality video, our approach can increase its spatial and temporal resolution simultaneously with arbitrary up-sampling space and time scales through a unified video diffusion model.
1 code implementation • 8 Jul 2024 • Yinquan Lu, Wenhao Zhu, Lei LI, Yu Qiao, Fei Yuan
To address this, we dedicate 35, 000 A100-SXM4-80GB GPU hours in conducting extensive multilingual continual pre-training on the LLaMA series models, enabling translation support across more than 100 languages.
1 code implementation • 3 Jul 2024 • Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, Songyang Zhang, Wenwei Zhang, Yining Li, Yang Gao, Peng Sun, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Hang Yan, Conghui He, Xingcheng Zhang, Kai Chen, Jifeng Dai, Yu Qiao, Dahua Lin, Jiaqi Wang
This long-context capability allows IXC-2. 5 to excel in tasks requiring extensive input and output contexts.
Ranked #25 on Visual Question Answering on MM-Vet
1 code implementation • 26 Jun 2024 • Baoqi Pei, Guo Chen, Jilan Xu, Yuping He, Yicheng Liu, Kanghua Pan, Yifei HUANG, Yali Wang, Tong Lu, LiMin Wang, Yu Qiao
In this report, we present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge.
Ranked #1 on Long Term Action Anticipation on Ego4D (using extra training data)
1 code implementation • 18 Jun 2024 • Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, Yikai Zhang, Yuqing Yang, Ting Wu, Binjie Wang, Shichao Sun, Yang Xiao, Yiyuan Li, Fan Zhou, Steffi Chern, Yiwei Qin, Yan Ma, Jiadi Su, Yixiu Liu, Yuxiang Zheng, Shaoting Zhang, Dahua Lin, Yu Qiao, PengFei Liu
We delve into the models' cognitive reasoning abilities, their performance across different modalities, and their outcomes in process-level evaluations, which are vital for tasks requiring complex reasoning with lengthy solutions.
1 code implementation • 17 Jun 2024 • Fangzhi Xu, Qiushi Sun, Kanzhi Cheng, Jun Liu, Yu Qiao, Zhiyong Wu
One of the primary driving forces contributing to the superior performance of Large Language Models (LLMs) is the extensive availability of human-annotated natural language data, which is used for alignment fine-tuning.
1 code implementation • 17 Jun 2024 • Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, Jiaqi Wang
Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models(LVLMs).
Ranked #60 on Visual Question Answering on MM-Vet
no code implementations • 17 Jun 2024 • Fanqing Meng, Wenqi Shao, Lixin Luo, Yahong Wang, Yiran Chen, Quanfeng Lu, Yue Yang, Tianshuo Yang, Kaipeng Zhang, Yu Qiao, Ping Luo
Our findings reveal that: (1) even advanced models frequently err in various physical scenarios, except for optics; (2) GPT-4o, with item-specific scoring instructions, effectively evaluates the models' understanding of physical commonsense, closely aligning with human assessments; and (3) current T2I models are primarily focused on text-to-image translation, lacking profound reasoning regarding physical commonsense.
1 code implementation • 17 Jun 2024 • Renqiu Xia, Song Mao, Xiangchao Yan, Hongbin Zhou, Bo Zhang, Haoyang Peng, Jiahao Pi, Daocheng Fu, Wenjie Wu, Hancheng Ye, Shiyang Feng, Chao Xu, Conghui He, Pinlong Cai, Min Dou, Botian Shi, Sheng Zhou, Yongwei Wang, Bin Wang, Junchi Yan, Fei Wu, Yu Qiao
Scientific documents record research findings and valuable human knowledge, comprising a vast corpus of high-quality data.
1 code implementation • 17 Jun 2024 • Yongting Zhang, Lu Chen, Guodong Zheng, Yifeng Gao, Rui Zheng, Jinlan Fu, Zhenfei Yin, Senjie Jin, Yu Qiao, Xuanjing Huang, Feng Zhao, Tao Gui, Jing Shao
To address these limitations, we propose a Safety Preference Alignment dataset for Vision Language Models named SPA-VL.
1 code implementation • 13 Jun 2024 • Tianle Zhang, Langtian Ma, Yuchen Yan, Yuchen Zhang, Kai Wang, Yue Yang, Ziyao Guo, Wenqi Shao, Yang You, Yu Qiao, Ping Luo, Kaipeng Zhang
To address these challenges, this paper introduces the Text-to-Video Human Evaluation (T2VHE) protocol, a comprehensive and standardized protocol for T2V models.
1 code implementation • 12 Jun 2024 • Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai, Zhaoyang Liu, Wenhai Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Ping Luo, Yu Qiao, Jifeng Dai
It not only allows flexible transmission of task information and gradient feedback between the MLLM and multiple downstream decoders but also effectively resolves training conflicts in multi-tasking scenarios.
1 code implementation • 12 Jun 2024 • Quanfeng Lu, Wenqi Shao, Zitao Liu, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, Yu Qiao, Ping Luo
Smartphone users often navigate across multiple applications (apps) to complete tasks such as sharing content between social media platforms.
1 code implementation • 12 Jun 2024 • Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, Jiashuo Yu, Hao Tian, Jiasheng Zhou, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Zhenxiang Li, Pei Chu, Yi Wang, Min Dou, Changyao Tian, Xizhou Zhu, Lewei Lu, Yushi Chen, Junjun He, Zhongying Tu, Tong Lu, Yali Wang, LiMin Wang, Dahua Lin, Yu Qiao, Botian Shi, Conghui He, Jifeng Dai
In this paper, we introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset.
1 code implementation • 11 Jun 2024 • Tianle Gu, Zeyang Zhou, Kexin Huang, Dandan Liang, Yixu Wang, Haiquan Zhao, Yuanqi Yao, Xingge Qiao, Keqing Wang, Yujiu Yang, Yan Teng, Yu Qiao, Yingchun Wang
In this paper, we present MLLMGuard, a multidimensional safety evaluation suite for MLLMs, including a bilingual image-text evaluation dataset, inference utilities, and a lightweight evaluator.
1 code implementation • 11 Jun 2024 • Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, Xizhou Zhu, Ping Luo, Yu Qiao, Jifeng Dai, Wenqi Shao, Wenhai Wang
In this work, we present Needle In A Multimodal Haystack (MM-NIAH), the first benchmark specifically designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.
1 code implementation • 11 Jun 2024 • Chenyu Yang, Xizhou Zhu, Jinguo Zhu, Weijie Su, Junjie Wang, Xuan Dong, Wenhai Wang, Lewei Lu, Bin Li, Jie zhou, Yu Qiao, Jifeng Dai
Recently, vision model pre-training has evolved from relying on manually annotated datasets to leveraging large-scale, web-crawled image-text data.
1 code implementation • 9 Jun 2024 • Yan Ma, Yu Qiao, PengFei Liu
In supplementary materials, we provide the MoPS code suite, along with 7. 6k generated premises and 1k extended stories.
no code implementations • 6 Jun 2024 • Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, Jiaqi Wang
To this end, we meticulously designed a differential video captioning strategy, which is stable, scalable, and efficient for generating captions for videos with arbitrary resolution, aspect ratios, and length.
1 code implementation • 6 Jun 2024 • Xizhou Zhu, Xue Yang, Zhaokai Wang, Hao Li, Wenhan Dou, Junqi Ge, Lewei Lu, Yu Qiao, Jifeng Dai
Our core idea is to use models with different parameter sizes to process different resolution levels of the image pyramid, thereby balancing computational efficiency and performance.
1 code implementation • 6 Jun 2024 • Chenxin Tao, Xizhou Zhu, Shiqian Su, Lewei Lu, Changyao Tian, Xuan Luo, Gao Huang, Hongsheng Li, Yu Qiao, Jie zhou, Jifeng Dai
The issue of "over-focus" hinders the model's ability to extract diverse visual features and to receive effective gradients for optimization.
1 code implementation • 5 Jun 2024 • Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, Xu Luo, Zehan Wang, Kaipeng Zhang, Xiangyang Zhu, Si Liu, Xiangyu Yue, Dingning Liu, Wanli Ouyang, Ziwei Liu, Yu Qiao, Hongsheng Li, Peng Gao
Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions.
1 code implementation • 5 Jun 2024 • Hao Wen, Zehuan Huang, Yaohui Wang, Xinyuan Chen, Yu Qiao, Lu Sheng
However, training these two stages separately leads to significant data bias in the inference phase, thus affecting the quality of reconstructed results.
1 code implementation • 1 Jun 2024 • Jia Zeng, Qingwen Bu, Bangjun Wang, Wenke Xia, Li Chen, Hao Dong, Haoming Song, Dong Wang, Di Hu, Ping Luo, Heming Cui, Bin Zhao, Xuelong Li, Yu Qiao, Hongyang Li
To this end, we propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction (MPI) and enhances the visual representation. Given a pair of keyframes representing the initial and final states, along with language instructions, our algorithm predicts the transition frame and detects the interaction object, respectively.
no code implementations • 31 May 2024 • Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yunhong Wang, Yu Qiao
We first design a unified diffusion model tailored for multi-view video generation by incorporating a learnable motion module into a frozen 3D-aware diffusion model to capture multi-view spatial-temporal correlations.
1 code implementation • 29 May 2024 • Zhanhui Zhou, Zhixuan Liu, Jie Liu, Zhichen Dong, Chao Yang, Yu Qiao
In this work, we introduce $\textit{weak-to-strong search}$, framing the alignment of a large language model as a test-time greedy search to maximize the log-likelihood difference between small tuned and untuned models while sampling from the frozen large model.
1 code implementation • 24 May 2024 • Jianbiao Mei, Yukai Ma, Xuemeng Yang, Licheng Wen, Xinyu Cai, Xin Li, Daocheng Fu, Bo Zhang, Pinlong Cai, Min Dou, Botian Shi, Liang He, Yong liu, Yu Qiao
Experiments also demonstrate that as the memory bank expands, the Heuristic Process with only 1. 8B parameters can inherit the knowledge from a GPT-4 powered Analytic Process and achieve continuous performance improvement.
no code implementations • 23 May 2024 • Chuanhao Li, Zhen Li, Chenchen Jing, Shuo Liu, Wenqi Shao, Yuwei Wu, Ping Luo, Yu Qiao, Kaipeng Zhang
In this paper, we propose a plug-and-play framework, for augmenting existing LVLMs in handling visual question answering (VQA) about up-to-date knowledge, dubbed SearchLVLMs.
2 code implementations • 23 May 2024 • Chongjie Si, Xuehui Wang, Xue Yang, Zhengqin Xu, Qingyun Li, Jifeng Dai, Yu Qiao, Xiaokang Yang, Wei Shen
To tackle the diversity of dimensional spaces across different foundation models and provide a more precise representation of the changes within these spaces, this paper introduces a generalized parameter-efficient fine-tuning framework, FLoRA, designed for various dimensional parameter space.
2 code implementations • 9 May 2024 • Peng Gao, Le Zhuo, Dongyang Liu, Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, Chen Lin, Rongjie Huang, Shijie Geng, Renrui Zhang, Junlin Xi, Wenqi Shao, Zhengkai Jiang, Tianshuo Yang, Weicai Ye, He Tong, Jingwen He, Yu Qiao, Hongsheng Li
Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details.
2 code implementations • 1 May 2024 • Sirui Chen, Bo Peng, Meiqi Chen, Ruiqi Wang, Mengying Xu, Xingyu Zeng, Rui Zhao, Shengjie Zhao, Yu Qiao, Chaochao Lu
Recent advances in language models have expanded the horizons of artificial intelligence across various domains, sparking inquiries into their potential for causal reasoning.
no code implementations • 30 Apr 2024 • Ziyan Chen, Jingwen He, Xinqi Lin, Yu Qiao, Chao Dong
Blind face restoration (BFR) on images has significantly progressed over the last several years, while real-world video face restoration (VFR), which is more challenging for more complex face motions such as moving gaze directions and facial orientations involved, remains unsolved.
1 code implementation • 25 Apr 2024 • Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang
Compared to both open-source and proprietary models, InternVL 1. 5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks.
Ranked #10 on Visual Question Answering on MM-Vet v2
no code implementations • 24 Apr 2024 • Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang, Wenqi Shao
Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation.
no code implementations • 14 Apr 2024 • Yu Qiao, Huy Q. Le, Mengchun Zhang, Apurba Adhikary, Chaoning Zhang, Choong Seon Hong
First, we employ clustering on the local representations of each client, aiming to capture intra-class information based on these local clusters at a high level of granularity.
no code implementations • 10 Apr 2024 • Yu Qiao, Chaoning Zhang, Apurba Adhikary, Choong Seon Hong
Federated learning (FL) is a privacy-preserving distributed framework for collaborative model training on devices in edge networks.
1 code implementation • 9 Apr 2024 • Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, Jiaqi Wang
The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its progression has been hindered by challenges in comprehending fine-grained visual content due to limited resolution.
Ranked #19 on Visual Question Answering on MM-Vet
no code implementations • CVPR 2024 • Hao Wu, Huabin Liu, Yu Qiao, Xiao Sun
We present Dive Into the BoundarieS (DIBS), a novel pretraining framework for dense video captioning (DVC), that elaborates on improving the quality of the generated event captions and their associated pseudo event boundaries from unlabeled videos.
1 code implementation • 3 Apr 2024 • Weigao Sun, Zhen Qin, Dong Li, Xuyang Shen, Yu Qiao, Yiran Zhong
In this paper, we introduce Linear Attention Sequence Parallel (LASP), an efficient SP method tailored to linear attention-based language models.
no code implementations • CVPR 2024 • Bo Zou, Chao Yang, Yu Qiao, Chengbin Quan, Youjian Zhao
LLaMA-Excitor ensures a self-adaptive allocation of additional attention to input instructions, thus effectively preserving LLMs' pre-trained knowledge when fine-tuning LLMs on low-quality instruction-following datasets.
no code implementations • 1 Apr 2024 • Bo Zou, Chao Yang, Yu Qiao, Chengbin Quan, Youjian Zhao
In this paper, we are inspired by the human recognition and learning pattern and propose VideoDistill, a framework with language-aware (i. e., goal-driven) behavior in both vision perception and answer generation process.
1 code implementation • CVPR 2024 • Lirui Zhao, Yue Yang, Kaipeng Zhang, Wenqi Shao, Yuxin Zhang, Yu Qiao, Ping Luo, Rongrong Ji
Text-to-image (T2I) generative models have attracted significant attention and found extensive applications within and beyond academic research.
1 code implementation • 29 Mar 2024 • Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, Feng Zhao
We evaluate 16 leading LVLMs on MMStar to assess their multi-modal capabilities, and on 7 benchmarks with the proposed metrics to investigate their data leakage and actual multi-modal gain.
no code implementations • 28 Mar 2024 • Yutong Chen, Yifan Zhan, Zhihang Zhong, Wei Wang, Xiao Sun, Yu Qiao, Yinqiang Zheng
Neural rendering techniques have significantly advanced 3D human body modeling.
no code implementations • 28 Mar 2024 • Zeren Chen, Zhelun Shi, Xiaoya Lu, Lehan He, Sucheng Qian, Hao Shu Fang, Zhenfei Yin, Wanli Ouyang, Jing Shao, Yu Qiao, Cewu Lu, Lu Sheng
The ultimate goals of robotic learning is to acquire a comprehensive and generalizable robotic system capable of performing both seen skills within the training distribution and unseen skills in novel environments.
3 code implementations • 26 Mar 2024 • Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao, Zhenjiang Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, Linyang Li, Shuaibin Li, Wei Li, Yining Li, Hongwei Liu, Jiangning Liu, Jiawei Hong, Kaiwen Liu, Kuikun Liu, Xiaoran Liu, Chengqi Lv, Haijun Lv, Kai Lv, Li Ma, Runyuan Ma, Zerun Ma, Wenchang Ning, Linke Ouyang, Jiantao Qiu, Yuan Qu, FuKai Shang, Yunfan Shao, Demin Song, Zifan Song, Zhihao Sui, Peng Sun, Yu Sun, Huanze Tang, Bin Wang, Guoteng Wang, Jiaqi Wang, Jiayu Wang, Rui Wang, Yudong Wang, Ziyi Wang, Xingjian Wei, Qizhen Weng, Fan Wu, Yingtong Xiong, Chao Xu, Ruiliang Xu, Hang Yan, Yirong Yan, Xiaogui Yang, Haochen Ye, Huaiyuan Ying, JIA YU, Jing Yu, Yuhang Zang, Chuyu Zhang, Li Zhang, Pan Zhang, Peng Zhang, Ruijie Zhang, Shuo Zhang, Songyang Zhang, Wenjian Zhang, Wenwei Zhang, Xingcheng Zhang, Xinyue Zhang, Hui Zhao, Qian Zhao, Xiaomeng Zhao, Fengzhe Zhou, Zaida Zhou, Jingming Zhuo, Yicheng Zou, Xipeng Qiu, Yu Qiao, Dahua Lin
The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI).
Ranked #5 on Long-Context Understanding on Ada-LEval (BestAnswer)
1 code implementation • 26 Mar 2024 • Zhelun Shi, Zhipin Wang, Hongxing Fan, Zaibin Zhang, Lijun Li, Yongting Zhang, Zhenfei Yin, Lu Sheng, Yu Qiao, Jing Shao
Large Language Models (LLMs) aim to serve as versatile assistants aligned with human values, as defined by the principles of being helpful, honest, and harmless (hhh).
1 code implementation • CVPR 2024 • Yifei HUANG, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, LiMin Wang, Yu Qiao
Along with the videos we record high-quality gaze data and provide detailed multimodal annotations, formulating a playground for modeling the human ability to bridge asynchronous procedural actions from different viewpoints.
Ranked #1 on Action Anticipation on EgoExoLearn (using extra training data)
2 code implementations • 22 Mar 2024 • Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen, Baoqi Pei, Ziang Yan, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei HUANG, Yu Qiao, Yali Wang, LiMin Wang
We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue.
Ranked #1 on Video Question Answering on Perception Test
1 code implementation • 19 Mar 2024 • Yunxiang Fu, Chaoqi Chen, Yu Qiao, Yizhou Yu
The acquisition of large-scale, high-quality data is a resource-intensive and time-consuming endeavor.
1 code implementation • 18 Mar 2024 • Enshen Zhou, Yiran Qin, Zhenfei Yin, Yuzhou Huang, Ruimao Zhang, Lu Sheng, Yu Qiao, Jing Shao
It is a long-lasting goal to design a generalist-embodied agent that can follow diverse instructions in human-like ways.
no code implementations • 14 Mar 2024 • Hao Zhang, Wenqi Shao, Hong Liu, Yongqiang Ma, Ping Luo, Yu Qiao, Kaipeng Zhang
To bridge this gap, we introduce AVIBench, a framework designed to analyze the robustness of LVLMs when facing various adversarial visual-instructions (AVIs), including four types of image-based AVIs, ten types of text-based AVIs, and nine types of content bias AVIs (such as gender, violence, cultural, and racial biases, among others).
no code implementations • CVPR 2024 • Haohan Weng, Danqing Huang, Yu Qiao, Zheng Hu, Chin-Yew Lin, Tong Zhang, C. L. Philip Chen
In this paper, we present Desigen, an automatic template creation pipeline which generates background images as well as harmonious layout elements over the background.
2 code implementations • CVPR 2024 • Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, Jun Zhang, Andreas Geiger, Yu Qiao, Hongyang Li
In this paper, we introduce the first large-scale video prediction model in the autonomous driving discipline.
2 code implementations • 11 Mar 2024 • Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, LiMin Wang, Yu Qiao
Addressing the dual challenges of local redundancy and global dependencies in video understanding, this work innovatively adapts the Mamba to the video domain.
1 code implementation • 7 Mar 2024 • Yunsong Zhou, Linyan Huang, Qingwen Bu, Jia Zeng, Tianyu Li, Hang Qiu, Hongzi Zhu, Minyi Guo, Yu Qiao, Hongyang Li
Hereby, we introduce the Embodied Language Model (ELM), a comprehensive framework tailored for agents' understanding of driving scenes with large spatial and temporal spans.
no code implementations • 5 Mar 2024 • Yu Qiao, Apurba Adhikary, Chaoning Zhang, Choong Seon Hong
Meanwhile, the non-independent and identically distributed (non-IID) challenge of data distribution between edge devices can further degrade the performance of models.
no code implementations • 4 Mar 2024 • Yue Yang, Yuqi Lin, Hong Liu, Wenqi Shao, Runjian Chen, Hailong Shang, Yu Wang, Yu Qiao, Kaipeng Zhang, Ping Luo
We call for increased attention to the potential and risks of implicit prompts in the T2I community and further investigation into the capabilities and impacts of implicit prompts, advocating for a balanced approach that harnesses their benefits while mitigating their risks.
1 code implementation • 4 Mar 2024 • Yuchen Duan, Weiyun Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Hongsheng Li, Jifeng Dai, Wenhai Wang
Our evaluations demonstrate that VRWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage processing high-resolution inputs.
1 code implementation • 3 Mar 2024 • Zishi Li, Xiaoxuan Ma, Qiuyan Shang, Wentao Zhu, Hai Ci, Yu Qiao, Yizhou Wang
Temporal repetition counting aims to quantify the repeated action cycles within a video.
no code implementations • 29 Feb 2024 • BoYu Chen, Siran Chen, Kunchang Li, Qinglin Xu, Yu Qiao, Yali Wang
Finally, we blend external multimodal knowledge in Adapt stage, by inserting multimodal knowledge adaptation modules into networks.
1 code implementation • 29 Feb 2024 • Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, Yu Qiao, Jifeng Dai
In addition, we design a new benchmark, termed Circular-based Relation Probing Evaluation (CRPE) for comprehensively evaluating the relation comprehension capabilities of MLLMs.
Ranked #49 on Visual Question Answering on MM-Vet
no code implementations • 29 Feb 2024 • Jiantao Qiu, Haijun Lv, Zhenjiang Jin, Rui Wang, Wenchang Ning, JIA YU, Chaobin Zhang, Zhenxiang Li, Pei Chu, Yuan Qu, Jin Shi, Lindong Lu, Runyu Peng, Zhiyuan Zeng, Huanze Tang, Zhikai Lei, Jiawei Hong, Keyu Chen, Zhaoye Fei, Ruiliang Xu, Wei Li, Zhongying Tu, Lin Dahua, Yu Qiao, Hang Yan, Conghui He
To evaluate the quality and utility of the dataset, we trained 1B-parameter and 3B-parameter models using WanJuan-CC and another dataset, RefinedWeb.
1 code implementation • 29 Feb 2024 • Chen Qian, Jie Zhang, Wei Yao, Dongrui Liu, Zhenfei Yin, Yu Qiao, Yong liu, Jing Shao
This research provides an initial exploration of trustworthiness modeling during LLM pre-training, seeking to unveil new insights and spur further developments in the field.
no code implementations • 27 Feb 2024 • Zhaoxun Ju, Chao Yang, Hongbo Wang, Yu Qiao, Fuchun Sun
Language-conditioned robot behavior plays a vital role in executing complex tasks by associating human commands or instructions with perception and actions.
no code implementations • 25 Feb 2024 • Yao Mu, Junting Chen, Qinglong Zhang, Shoufa Chen, Qiaojun Yu, Chongjian Ge, Runjian Chen, Zhixuan Liang, Mengkang Hu, Chaofan Tao, Peize Sun, Haibao Yu, Chao Yang, Wenqi Shao, Wenhai Wang, Jifeng Dai, Yu Qiao, Mingyu Ding, Ping Luo
Robotic behavior synthesis, the problem of understanding multimodal inputs and generating precise physical control for robots, is an important part of Embodied AI.
Ranked #112 on Visual Question Answering on MM-Vet
no code implementations • 22 Feb 2024 • Junting Chen, Yao Mu, Qiaojun Yu, Tianming Wei, Silang Wu, Zhecheng Yuan, Zhixuan Liang, Chao Yang, Kaipeng Zhang, Wenqi Shao, Yu Qiao, Huazhe Xu, Mingyu Ding, Ping Luo
To bridge this ``ideal-to-real'' gap, this paper presents \textbf{RobotScript}, a platform for 1) a deployable robot manipulation pipeline powered by code generation; and 2) a code generation benchmark for robot manipulation tasks in free-form natural language.
1 code implementation • 19 Feb 2024 • Renqiu Xia, Bo Zhang, Hancheng Ye, Xiangchao Yan, Qi Liu, Hongbin Zhou, Zijun Chen, Min Dou, Botian Shi, Junchi Yan, Yu Qiao
Recently, many versatile Multi-modal Large Language Models (MLLMs) have emerged continuously.
1 code implementation • 19 Feb 2024 • Zhanhui Zhou, Jie Liu, Zhichen Dong, Jiaheng Liu, Chao Yang, Wanli Ouyang, Yu Qiao
Large language models (LLMs) undergo safety alignment to ensure safe conversations with humans.
2 code implementations • 18 Feb 2024 • Peng Xu, Wenqi Shao, Mengzhao Chen, Shitao Tang, Kaipeng Zhang, Peng Gao, Fengwei An, Yu Qiao, Ping Luo
Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc.
1 code implementation • 14 Feb 2024 • Zhichen Dong, Zhanhui Zhou, Chao Yang, Jing Shao, Yu Qiao
Large Language Models (LLMs) are now commonplace in conversation applications.
1 code implementation • CVPR 2024 • Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, Ping Luo
Importantly, all images in this benchmark are sourced from authentic medical scenarios, ensuring alignment with the requirements of the medical field and suitability for evaluating LVLMs.
2 code implementations • 8 Feb 2024 • Shikun Ban, Juling Fan, Xiaoxuan Ma, Wentao Zhu, Yu Qiao, Yizhou Wang
Estimating robot pose from RGB images is a crucial problem in computer vision and robotics.
Ranked #1 on Robot Pose Estimation on DREAM-dataset
1 code implementation • 8 Feb 2024 • Dongyang Liu, Renrui Zhang, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, Kaipeng Zhang, Wenqi Shao, Chao Xu, Conghui He, Junjun He, Hao Shao, Pan Lu, Hongsheng Li, Yu Qiao, Peng Gao
We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX.
Ranked #9 on Video Question Answering on MVBench
1 code implementation • 7 Feb 2024 • Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, WangMeng Zuo, Dahua Lin, Yu Qiao, Jing Shao
In the rapidly evolving landscape of Large Language Models (LLMs), ensuring robust safety measures is paramount.
2 code implementations • 1 Feb 2024 • Xin Liu, Yichen Zhu, Yunshi Lan, Chao Yang, Yu Qiao
In this paper, we systematically survey current efforts on the evaluation, attack, and defense of MLLMs' safety on images and text.
1 code implementation • 29 Jan 2024 • Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang
We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension.
Ranked #14 on Visual Question Answering on MM-Vet v2
1 code implementation • 29 Jan 2024 • Weigao Sun, Zhen Qin, Weixuan Sun, Shidi Li, Dong Li, Xuyang Shen, Yu Qiao, Yiran Zhong
CO2 is able to attain a high scalability even on extensive multi-node clusters constrained by very limited communication bandwidth.
no code implementations • 26 Jan 2024 • Chaochao Lu, Chen Qian, Guodong Zheng, Hongxing Fan, Hongzhi Gao, Jie Zhang, Jing Shao, Jingyi Deng, Jinlan Fu, Kexin Huang, Kunchang Li, Lijun Li, LiMin Wang, Lu Sheng, Meiqi Chen, Ming Zhang, Qibing Ren, Sirui Chen, Tao Gui, Wanli Ouyang, Yali Wang, Yan Teng, Yaru Wang, Yi Wang, Yinan He, Yingchun Wang, Yixu Wang, Yongting Zhang, Yu Qiao, Yujiong Shen, Yurong Mou, Yuxi Chen, Zaibin Zhang, Zhelun Shi, Zhenfei Yin, Zhipin Wang
Multi-modal Large Language Models (MLLMs) have shown impressive abilities in generating reasonable responses with respect to multi-modal contents.
no code implementations • 25 Jan 2024 • Huy Q. Le, Chu Myaet Thwal, Yu Qiao, Ye Lin Tun, Minh N. H. Nguyen, Choong Seon Hong
In this paper, we propose Multimodal Federated Cross Prototype Learning (MFCPL), a novel approach for MFL under severely missing modalities by conducting the complete prototypes to provide diverse modality knowledge in modality-shared level with the cross-modal regularization and modality-specific level with cross-modal contrastive mechanism.
no code implementations • CVPR 2024 • Fanghua Yu, Jinjin Gu, Zheyuan Li, JinFan Hu, Xiangtao Kong, Xintao Wang, Jingwen He, Yu Qiao, Chao Dong
We introduce SUPIR (Scaling-UP Image Restoration), a groundbreaking image restoration method that harnesses generative prior and the power of model scaling up.
1 code implementation • 24 Jan 2024 • Guoxin Chen, Kexin Tang, Chao Yang, Fuying Ye, Yu Qiao, Yiming Qian
Moreover, existing reinforcement learning (RL) based methods overlook the structured relationships, underutilizing the potential of RL in structured reasoning.
1 code implementation • 22 Jan 2024 • Zaibin Zhang, Yongting Zhang, Lijun Li, Hongzhi Gao, Lijun Wang, Huchuan Lu, Feng Zhao, Yu Qiao, Jing Shao
In this paper, we explore these concerns through the innovative lens of agent psychology, revealing that the dark psychological states of agents constitute a significant threat to safety.
1 code implementation • 18 Jan 2024 • Changyao Tian, Xizhou Zhu, Yuwen Xiong, Weiyun Wang, Zhe Chen, Wenhai Wang, Yuntao Chen, Lewei Lu, Tong Lu, Jie zhou, Hongsheng Li, Yu Qiao, Jifeng Dai
Developing generative models for interleaved image-text data has both research and practical value.
1 code implementation • CVPR 2024 • Shaobin Zhuang, Kunchang Li, Xinyuan Chen, Yaohui Wang, Ziwei Liu, Yu Qiao, Yali Wang
More importantly, Vlogger can generate over 5-minute vlogs from open-world descriptions, without loss of video coherence on script and actor.
2 code implementations • CVPR 2024 • Yuwen Xiong, Zhiqi Li, Yuntao Chen, Feng Wang, Xizhou Zhu, Jiapeng Luo, Wenhai Wang, Tong Lu, Hongsheng Li, Yu Qiao, Lewei Lu, Jie zhou, Jifeng Dai
The advancements in speed and efficiency of DCNv4, combined with its robust performance across diverse vision tasks, show its potential as a foundational building block for future vision models.
3 code implementations • 5 Jan 2024 • Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, Yu Qiao
We propose a novel Latent Diffusion Transformer, namely Latte, for video generation.
1 code implementation • 4 Jan 2024 • Fanqing Meng, Wenqi Shao, Quanfeng Lu, Peng Gao, Kaipeng Zhang, Yu Qiao, Ping Luo
Charts play a vital role in data visualization, understanding data patterns, and informed decision-making.
no code implementations • CVPR 2024 • Yuan Xu, Xiaoxuan Ma, Jiajun Su, Wentao Zhu, Yu Qiao, Yizhou Wang
Experimental results demonstrate that HypoNet outperforms existing state-of-the-art probabilistic methods as a multi-hypothesis mesh estimator.
no code implementations • CVPR 2024 • Bo Zou, Chao Yang, Yu Qiao, Chengbin Quan, Youjian Zhao
In this paper we are inspired by the human recognition and learning pattern and propose VideoDistill a framework with language-aware (i. e. goal-driven) behavior in both vision perception and answer generation process.
2 code implementations • CVPR 2024 • Bin Fu, Fanghua Yu, Anran Liu, Zixuan Wang, Jie Wen, Junjun He, Yu Qiao
Based on this observation we generalize diffusion methods to model font generative process by separating the reverse diffusion process into three stages with different functions: The structure construction stage first generates the structure information for the target character based on the source image and the font transfer stage subsequently transforms the source font to the target font.
1 code implementation • CVPR 2024 • Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xihui Liu, Yu Qiao, Wanli Ouyang, Tong He, Hengshuang Zhao
This paper is not motivated to seek innovation within the attention mechanism.
2 code implementations • CVPR 2024 • Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, Jifeng Dai
However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs.
Ranked #1 on Zero-Shot Video Retrieval on MSR-VTT-full (using extra training data)
no code implementations • 21 Dec 2023 • Yuanfu Wang, Chao Yang, Ying Wen, Yu Liu, Yu Qiao
Recent advancements in offline reinforcement learning (RL) have underscored the capabilities of Return-Conditioned Supervised Learning (RCSL), a paradigm that learns the action distribution based on target returns for each state in a supervised manner.
1 code implementation • 19 Dec 2023 • Siran Chen, Yue Ma, Yu Qiao, Yali Wang
It mimics various missing cases by randomly masking features of different camera views, then leverages the original features of these views as self-supervision, and reconstructs the masked ones with the distinct spatio-temporal context across views.
1 code implementation • 19 Dec 2023 • Lingjun Zhang, Xinyuan Chen, Yaohui Wang, Yue Lu, Yu Qiao
To tackle this problem, we propose Diff-Text, which is a training-free scene text generation framework for any language.
no code implementations • 15 Dec 2023 • Xu Liu, Tong Zhou, Yuanxin Wang, Yuping Wang, Qinjingwen Cao, Weizhi Du, Yonghuan Yang, Junjun He, Yu Qiao, Yiqing Shen
The advent of foundation models, which are pre-trained on vast datasets, has ushered in a new era of computer vision, characterized by their robustness and remarkable zero-shot generalization capabilities.
3 code implementations • 15 Dec 2023 • Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xihui Liu, Yu Qiao, Wanli Ouyang, Tong He, Hengshuang Zhao
This paper is not motivated to seek innovation within the attention mechanism.
Ranked #1 on 3D Semantic Segmentation on ScanNet++ (using extra training data)
1 code implementation • 14 Dec 2023 • Wenhai Wang, Jiangwei Xie, Chuanyang Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, Hao Tian, Lewei Lu, Xizhou Zhu, Xiaogang Wang, Yu Qiao, Jifeng Dai
In this work, we delve into the potential of large language models (LLMs) in autonomous driving (AD).