Search Results for author: Yu Qiao

Found 304 papers, 170 papers with code

FANG-COVID: A New Large-Scale Benchmark Dataset for Fake News Detection in German

1 code implementation EMNLP (FEVER) 2021 Justus Mattern, Yu Qiao, Elma Kerz, Daniel Wiechmann, Markus Strohmaier

As the world continues to fight the COVID-19 pandemic, it is simultaneously fighting an ‘infodemic’ – a flood of disinformation and spread of conspiracy theories leading to health threats and the division of society.

Fake News Detection

Automated Classification of Written Proficiency Levels on the CEFR-Scale through Complexity Contours and RNNs

no code implementations EACL (BEA) 2021 Elma Kerz, Daniel Wiechmann, Yu Qiao, Emma Tseng, Marcus Ströbel

The key to the present paper is the combined use of what we refer to as ‘complexity contours’, a series of measurements of indices of L2 proficiency obtained by a computational tool that implements a sliding window technique, and recurrent neural network (RNN) classifiers that adequately capture the sequential information in those contours.

The Best of Both Worlds: Combining Engineered Features with Transformers for Improved Mental Health Prediction from Reddit Posts

no code implementations SMM4H (COLING) 2022 Sourabh Zanwar, Daniel Wiechmann, Yu Qiao, Elma Kerz

In recent years, there has been increasing interest in the application of natural language processing and machine learning techniques to the detection of mental health conditions (MHC) based on social media data.

MANTIS at SMM4H’2022: Pre-Trained Language Models Meet a Suite of Psycholinguistic Features for the Detection of Self-Reported Chronic Stress

no code implementations SMM4H (COLING) 2022 Sourabh Zanwar, Daniel Wiechmann, Yu Qiao, Elma Kerz

This paper describes our submission to Social Media Mining for Health (SMM4H) 2022 Shared Task 8, aimed at detecting self-reported chronic stress on Twitter.

Language that Captivates the Audience: Predicting Affective Ratings of TED Talks in a Multi-Label Classification Task

no code implementations EACL (WASSA) 2021 Elma Kerz, Yu Qiao, Daniel Wiechmann

The aim of the paper is twofold: (1) to automatically predict the ratings assigned by viewers to 14 categories available for TED talks in a multi-label classification task and (2) to determine what types of features drive classification accuracy for each of the categories.

Multi-Label Classification

A Language-Based Approach to Fake News Detection Through Interpretable Features and BRNN

no code implementations RDSM (COLING) 2020 Yu Qiao, Daniel Wiechmann, Elma Kerz

We demonstrate that our approach is promising as it achieves similar results on these two datasets as the best performing black box models reported in the literature.

Explainable Models Fake News Detection +1

RBF-Softmax: Learning Deep Representative Prototypes with Radial Basis Function Softmax

1 code implementation ECCV 2020 Xiao Zhang, Rui Zhao, Yu Qiao, Hongsheng Li

To address this problem, this paper introduces a novel Radial Basis Function (RBF) distances to replace the commonly used inner products in the softmax loss function, such that it can adaptively assign losses to regularize the intra-class and inter-class distances by reshaping the relative differences, and thus creating more representative prototypes of classes to improve optimization.

Mining Inter-Video Proposal Relations for Video Object Detection

1 code implementation ECCV 2020 Mingfei Han, Yali Wang, Xiaojun Chang, Yu Qiao

Recent studies have shown that, context aggregating information from proposals in different frames can clearly enhance the performance of video object detection.

object-detection Relation Network +1

DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models

no code implementations28 Sep 2023 Licheng Wen, Daocheng Fu, Xin Li, Xinyu Cai, Tao Ma, Pinlong Cai, Min Dou, Botian Shi, Liang He, Yu Qiao

Recent advancements in autonomous driving have relied on data-driven approaches, which are widely adopted but face challenges including dataset bias, overfitting, and uninterpretability.

Autonomous Driving Common Sense Reasoning +1

LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models

no code implementations26 Sep 2023 Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing Wu, Chenyang Si, Yuming Jiang, Cunjian Chen, Chen Change Loy, Bo Dai, Dahua Lin, Yu Qiao, Ziwei Liu

To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model.

Video Generation Video Super-Resolution

StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding

1 code implementation20 Sep 2023 Renqiu Xia, Bo Zhang, Haoyang Peng, Ning Liao, Peng Ye, Botian Shi, Junchi Yan, Yu Qiao

Charts are common in literature across different scientific fields, conveying rich information easily accessible to readers.

Language Modelling Large Language Model +1

SPOT: Scalable 3D Pre-training via Occupancy Prediction for Autonomous Driving

1 code implementation19 Sep 2023 Xiangchao Yan, Runjian Chen, Bo Zhang, Jiakang Yuan, Xinyu Cai, Botian Shi, Wenqi Shao, Junchi Yan, Ping Luo, Yu Qiao

Our contributions are threefold: (1) Occupancy prediction is shown to be promising for learning general representations, which is demonstrated by extensive experiments on plenty of datasets and tasks.

3D Object Detection Autonomous Driving +3

HAT: Hybrid Attention Transformer for Image Restoration

2 code implementations11 Sep 2023 Xiangyu Chen, Xintao Wang, Wenlong Zhang, Xiangtao Kong, Yu Qiao, Jiantao Zhou, Chao Dong

In the training stage, we additionally adopt a same-task pre-training strategy to further exploit the potential of the model for further improvement.

Image Compression Image Denoising +2

ReSimAD: Zero-Shot 3D Domain Transfer for Autonomous Driving with Source Reconstruction and Target Simulation

1 code implementation11 Sep 2023 Bo Zhang, Xinyu Cai, Jiakang Yuan, Donglin Yang, Jianfei Guo, Xiangchao Yan, Renqiu Xia, Botian Shi, Min Dou, Tao Chen, Si Liu, Junchi Yan, Yu Qiao

Domain shifts such as sensor type changes and geographical situation variations are prevalent in Autonomous Driving (AD), which poses a challenge since AD model relying on the previous-domain knowledge can be hardly directly deployed to a new domain without additional costs.

Autonomous Driving Domain Generalization

A-Eval: A Benchmark for Cross-Dataset Evaluation of Abdominal Multi-Organ Segmentation

1 code implementation7 Sep 2023 Ziyan Huang, Zhongying Deng, Jin Ye, Haoyu Wang, Yanzhou Su, Tianbin Li, Hui Sun, Junlong Cheng, Jianpin Chen, Junjun He, Yun Gu, Shaoting Zhang, Lixu Gu, Yu Qiao

To address these questions, we introduce A-Eval, a benchmark for the cross-dataset Evaluation ('Eval') of Abdominal ('A') multi-organ segmentation.

Organ Segmentation

SEAL: A Framework for Systematic Evaluation of Real-World Super-Resolution

1 code implementation6 Sep 2023 Wenlong Zhang, Xiaohui Li, Xiangyu Chen, Yu Qiao, Xiao-Ming Wu, Chao Dong

Next, we propose a coarse-to-fine evaluation protocol to measure the distributed and relative performance of real-SR methods on the test set.


OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

1 code implementation25 Aug 2023 Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, Ping Luo

To tackle this issue, we introduce an Omnidirectionally calibrated Quantization (OmniQuant) technique for LLMs, which achieves good performance in diverse quantization settings while maintaining the computational efficiency of PTQ by efficiently optimizing various quantization parameters.

Common Sense Reasoning Large Language Model +2

MGMAE: Motion Guided Masking for Video Masked Autoencoding

no code implementations ICCV 2023 Bingkun Huang, Zhiyu Zhao, Guozhen Zhang, Yu Qiao, LiMin Wang

Based on this masking volume, we can track the unmasked tokens in time and sample a set of temporal consistent cubes from videos.

Optical Flow Estimation Representation Learning

Foundation Model is Efficient Multimodal Multitask Model Selector

1 code implementation11 Aug 2023 Fanqing Meng, Wenqi Shao, Zhanglin Peng, Chonghe Jiang, Kaipeng Zhang, Yu Qiao, Ping Luo

This paper investigates an under-explored but important problem: given a collection of pre-trained neural networks, predicting their performance on each multi-modal task without fine-tuning them, such as image recognition, referring, captioning, visual question answering, and text question answering.

Model Selection Question Answering +1

Tiny LVLM-eHub: Early Multimodal Experiments with Bard

1 code implementation7 Aug 2023 Wenqi Shao, Yutao Hu, Peng Gao, Meng Lei, Kaipeng Zhang, Fanqing Meng, Peng Xu, Siyuan Huang, Hongsheng Li, Yu Qiao, Ping Luo

Secondly, it conducts an in-depth analysis of LVLMs' predictions using the ChatGPT Ensemble Evaluation (CEE), which leads to a robust and accurate evaluation and exhibits improved alignment with human evaluation compared to the word matching approach.

Visual Reasoning

Scaling Data Generation in Vision-and-Language Navigation

1 code implementation ICCV 2023 Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, Yu Qiao

Recent research in language-guided visual navigation has demonstrated a significant demand for the diversity of traversable environments and the quantity of supervision for training generalizable agents.

Imitation Learning Vision and Language Navigation +1

Scaling TransNormer to 175 Billion Parameters

no code implementations27 Jul 2023 Zhen Qin, Dong Li, Weigao Sun, Weixuan Sun, Xuyang Shen, Xiaodong Han, Yunshen Wei, Baohong Lv, Fei Yuan, Xiao Luo, Yu Qiao, Yiran Zhong

TransNormerLLM evolves from the previous linear attention architecture TransNormer by making advanced modifications that include positional embedding, linear attention acceleration, gating mechanism, tensor normalization, inference acceleration and stabilization.

Language Modelling Large Language Model

FedMEKT: Distillation-based Embedding Knowledge Transfer for Multimodal Federated Learning

no code implementations25 Jul 2023 Huy Q. Le, Minh N. H. Nguyen, Chu Myaet Thwal, Yu Qiao, Chaoning Zhang, Choong Seon Hong

Bringing this concept into a system, we develop a distillation-based multimodal embedding knowledge transfer mechanism, namely FedMEKT, which allows the server and clients to exchange the joint knowledge of their learning models extracted from a small multimodal proxy dataset.

Federated Learning Human Activity Recognition +1

Meta-Transformer: A Unified Framework for Multimodal Learning

1 code implementation20 Jul 2023 Yiyuan Zhang, Kaixiong Gong, Kaipeng Zhang, Hongsheng Li, Yu Qiao, Wanli Ouyang, Xiangyu Yue

Multimodal learning aims to build models that can process and relate information from multiple modalities.

Time Series

Boosting Federated Learning Convergence with Prototype Regularization

no code implementations20 Jul 2023 Yu Qiao, Huy Q. Le, Choong Seon Hong

As a distributed machine learning technique, federated learning (FL) requires clients to collaboratively train a shared model with an edge server without leaking their local data.

Federated Learning

Drive Like a Human: Rethinking Autonomous Driving with Large Language Models

1 code implementation14 Jul 2023 Daocheng Fu, Xin Li, Licheng Wen, Min Dou, Pinlong Cai, Botian Shi, Yu Qiao

In this paper, we explore the potential of using a large language model (LLM) to understand the driving environment in a human-like manner and analyze its ability to reason, interpret, and memorize when facing complex scenarios.

Autonomous Driving Common Sense Reasoning +3

LimSim: A Long-term Interactive Multi-scenario Traffic Simulator

1 code implementation13 Jul 2023 Licheng Wen, Daocheng Fu, Song Mao, Pinlong Cai, Min Dou, Yikang Li, Yu Qiao

With the growing popularity of digital twin and autonomous driving in transportation, the demand for simulation systems capable of generating high-fidelity and reliable scenarios is increasing.

Autonomous Driving

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

1 code implementation10 Jul 2023 Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, Bo Dai

With the advance of text-to-image models (e. g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost.

Image Animation

JourneyDB: A Benchmark for Generative Image Understanding

no code implementations3 Jul 2023 Junting Pan, Keqiang Sun, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, Jifeng Dai, Yu Qiao, Hongsheng Li

We further design 4 benchmarks to quantify the performance of generated image understanding in terms of both content and style interpretation.

Image Captioning Question Answering +2

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

2 code implementations25 Jun 2023 Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, Choong Seon Hong

Concretely, we distill the knowledge from the heavy image encoder (ViT-H in the original SAM) to a lightweight image encoder, which can be automatically compatible with the mask decoder in the original SAM.

Image Segmentation Instance Segmentation +1

Align, Adapt and Inject: Sound-guided Unified Image Generation

no code implementations20 Jun 2023 Yue Yang, Kaipeng Zhang, Yuying Ge, Wenqi Shao, Zeyue Xue, Yu Qiao, Ping Luo

Then, we propose the audio adapter to adapt audio representation into an audio token enriched with specific semantics, which can be injected into a frozen T2I model flexibly.

Image Generation Retrieval +1

Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models

no code implementations15 Jun 2023 Junting Pan, Ziyi Lin, Yuying Ge, Xiatian Zhu, Renrui Zhang, Yi Wang, Yu Qiao, Hongsheng Li

Video Question Answering (VideoQA) has been significantly advanced from the scaling of recent Large Language Models (LLMs).

Ranked #2 on Temporal/Casual QA on NExT-QA (using extra training data)

Domain Generalization Retrieval +2

Robustness of SAM: Segment Anything Under Corruptions and Beyond

no code implementations13 Jun 2023 Yu Qiao, Chaoning Zhang, Taegoo Kang, Donghun Kim, Chenshuang Zhang, Choong Seon Hong

Following by interpreting the effects of synthetic corruption as style changes, we proceed to conduct a comprehensive evaluation for its robustness against 15 types of common corruption.

Style Transfer

Denoising Diffusion Semantic Segmentation with Mask Prior Modeling

1 code implementation2 Jun 2023 Zeqiang Lai, Yuchen Duan, Jifeng Dai, Ziheng Li, Ying Fu, Hongsheng Li, Yu Qiao, Wenhai Wang

In this paper, we propose to ameliorate the semantic segmentation quality of existing discriminative approaches with a mask prior modeled by a recently-developed denoising diffusion generative model.

Denoising Semantic Segmentation

AD-PT: Autonomous Driving Pre-Training with Large-scale Point Cloud Dataset

1 code implementation1 Jun 2023 Jiakang Yuan, Bo Zhang, Xiangchao Yan, Tao Chen, Botian Shi, Yikang Li, Yu Qiao

It is a long-term vision for Autonomous Driving (AD) community that the perception models can learn from a large-scale point cloud dataset, to obtain unified representations that can achieve promising results on different tasks or benchmarks.

Autonomous Driving Point Cloud Pre-training

DiffRoom: Diffusion-based High-Quality 3D Room Reconstruction and Generation with Occupancy Prior

no code implementations1 Jun 2023 Xiaoliang Ju, Zhaoyang Huang, Yijin Li, Guofeng Zhang, Yu Qiao, Hongsheng Li

We present DiffRoom, a novel framework for tackling the problem of high-quality 3D indoor room reconstruction and generation, both of which are challenging due to the complexity and diversity of the room geometry.

Image Generation

DiffRate : Differentiable Compression Rate for Efficient Vision Transformers

1 code implementation ICCV 2023 Mengzhao Chen, Wenqi Shao, Peng Xu, Mingbao Lin, Kaipeng Zhang, Fei Chao, Rongrong Ji, Yu Qiao, Ping Luo

Token compression aims to speed up large-scale vision transformers (e. g. ViTs) by pruning (dropping) or merging tokens.

Networks are Slacking Off: Understanding Generalization Problem in Image Deraining

no code implementations24 May 2023 Jinjin Gu, Xianzheng Ma, Xiangtao Kong, Yu Qiao, Chao Dong

A prevailing perspective in deep learning encourages the use of highly complex training data, with the expectation that a richer image content knowledge will facilitate overcoming the generalization problem.

Rain Removal

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

1 code implementation24 May 2023 Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, Ping Luo

In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI, empowering embodied agents with multi-modal understanding and execution capabilities.

Image Captioning Language Modelling +3

VideoLLM: Modeling Video Sequence with Large Language Models

1 code implementation22 May 2023 Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei HUANG, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, LiMin Wang

Building upon this insight, we propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs from natural language processing (NLP) for video sequence understanding.

Video Understanding

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model

1 code implementation18 May 2023 Siyuan Huang, Zhengkai Jiang, Hao Dong, Yu Qiao, Peng Gao, Hongsheng Li

This paper presents Instruct2Act, a framework that utilizes Large Language Models to map multi-modal instructions to sequential actions for robotic manipulation tasks.

Language Modelling Large Language Model +1

InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language

2 code implementations9 May 2023 Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, Kunchang Li, Zhe Chen, Xue Yang, Xizhou Zhu, Yali Wang, LiMin Wang, Ping Luo, Jifeng Dai, Yu Qiao

Different from existing interactive systems that rely on pure language, by incorporating pointing instructions, the proposed iGPT significantly improves the efficiency of communication between users and chatbots, as well as the accuracy of chatbots in vision-centric tasks, especially in complicated visual scenarios where the number of objects is greater than 2.

Language Modelling

Causal Discovery with Unobserved Variables: A Proxy Variable Approach

no code implementations9 May 2023 Mingzhou Liu, Xinwei Sun, Yu Qiao, Yizhou Wang

Our observation is that discretizing continuous variables can can lead to serious errors and comprise the power of the proxy.

Causal Discovery Causal Identification

LEO: Generative Latent Image Animator for Human Video Synthesis

2 code implementations6 May 2023 Yaohui Wang, Xin Ma, Xinyuan Chen, Antitza Dantcheva, Bo Dai, Yu Qiao

Our key idea is to represent motion as a sequence of flow maps in the generation process, which inherently isolate motion from appearance.

Disentanglement Video Editing

Long-Term Rhythmic Video Soundtracker

1 code implementation2 May 2023 Jiashuo Yu, Yaohui Wang, Xinyuan Chen, Xiao Sun, Yu Qiao

To this end, we present Long-Term Rhythmic Video Soundtracker (LORIS), a novel framework to synthesize long-term conditional waveforms.

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

3 code implementations28 Apr 2023 Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, Yu Qiao

This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset.

Instruction Following Optical Character Recognition (OCR) +6

Hierarchical Diffusion Autoencoders and Disentangled Image Manipulation

no code implementations24 Apr 2023 Zeyu Lu, Chengyue Wu, Xinyuan Chen, Yaohui Wang, Lei Bai, Yu Qiao, Xihui Liu

To mitigate those limitations, we propose Hierarchical Diffusion Autoencoders (HDAE) that exploit the fine-grained-to-abstract and lowlevel-to-high-level feature hierarchy for the latent space of diffusion models.

Image Generation Image Manipulation +1

Perception Imitation: Towards Synthesis-free Simulator for Autonomous Vehicles

no code implementations19 Apr 2023 Xiaoliang Ju, Yiyang Sun, Yiming Hao, Yikang Li, Yu Qiao, Hongsheng Li

We propose a perception imitation method to simulate results of a certain perception model, and discuss a new heuristic route of autonomous driving simulator without data synthesis.

Autonomous Driving

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

1 code implementation CVPR 2023 LiMin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, Yu Qiao

Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90. 0% on K400 and 89. 9% on K600) and Something-Something (68. 7% on V1 and 77. 0% on V2).

 Ranked #1 on Temporal Action Localization on FineAction (using extra training data)

Action Classification Action Recognition +2

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

1 code implementation ICCV 2023 Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, LiMin Wang, Yu Qiao

Previous VFMs rely on Image Foundation Models (IFMs), which face challenges in transferring to the video domain.

 Ranked #1 on Zero-Shot Video Retrieval on LSMDC (using extra training data)

Action Classification Action Recognition +5

Prototype Helps Federated Learning: Towards Faster Convergence

no code implementations22 Mar 2023 Yu Qiao, Seong-Bae Park, Sun Moo Kang, Choong Seon Hong

In this paper, a prototype-based federated learning framework is proposed, which can achieve better inference performance with only a few changes to the last global iteration of the typical federated learning process.

Federated Learning

SCPNet: Semantic Scene Completion on Point Cloud

1 code implementation CVPR 2023 Zhaoyang Xia, Youquan Liu, Xin Li, Xinge Zhu, Yuexin Ma, Yikang Li, Yuenan Hou, Yu Qiao

We propose a simple yet effective label rectification strategy, which uses off-the-shelf panoptic segmentation labels to remove the traces of dynamic objects in completion labels, greatly improving the performance of deep models especially for those moving objects.

3D Semantic Scene Completion Knowledge Distillation +2

Aleth-NeRF: Low-light Condition View Synthesis with Concealing Fields

1 code implementation10 Mar 2023 Ziteng Cui, Lin Gu, Xiao Sun, Yu Qiao, Tatsuya Harada

Common capture low-light scenes are challenging for most computer vision techniques, including Neural Radiance Fields (NeRF).

Rethinking Range View Representation for LiDAR Segmentation

no code implementations ICCV 2023 Lingdong Kong, Youquan Liu, Runnan Chen, Yuexin Ma, Xinge Zhu, Yikang Li, Yuenan Hou, Yu Qiao, Ziwei Liu

We show that, for the first time, a range view method is able to surpass the point, voxel, and multi-view fusion counterparts in the competing LiDAR semantic and panoptic segmentation benchmarks, i. e., SemanticKITTI, nuScenes, and ScribbleKITTI.

3D Semantic Segmentation Autonomous Driving +3

FCN+: Global Receptive Convolution Makes FCN Great Again

no code implementations8 Mar 2023 Zhongying Deng, Xiaoyu Ren, Jin Ye, Junjun He, Yu Qiao

The motivation of GRC is that different channels of a convolutional filter can have different grid sampling locations across the whole input feature map.

Semantic Segmentation

OpenICL: An Open-Source Framework for In-context Learning

2 code implementations6 Mar 2023 Zhenyu Wu, Yaoxiang Wang, Jiacheng Ye, Jiangtao Feng, Jingjing Xu, Yu Qiao, Zhiyong Wu

However, the implementation of ICL is sophisticated due to the diverse retrieval and inference methods involved, as well as the varying pre-processing requirements for different models, datasets, and tasks.

Language Modelling Large Language Model +3

Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners

2 code implementations CVPR 2023 Renrui Zhang, Xiangfei Hu, Bohao Li, Siyuan Huang, Hanqiu Deng, Hongsheng Li, Yu Qiao, Peng Gao

Our CaFo incorporates CLIP's language-contrastive knowledge, DINO's vision-contrastive knowledge, DALL-E's vision-generative knowledge, and GPT-3's language-generative knowledge.

Few-Shot Learning Representation Learning

Uncertainty-Estimation with Normalized Logits for Out-of-Distribution Detection

no code implementations15 Feb 2023 Mouxiao Huang, Yu Qiao

However, neural networks often suffer from the overconfidence issue, making high confidence for OOD data which are never seen during training process and may be irrelevant to training data, namely in-distribution (ID) data.

Autonomous Driving Medical Diagnosis +2

Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision

1 code implementation CVPR 2023 Jilan Xu, Junlin Hou, Yuejie Zhang, Rui Feng, Yi Wang, Yu Qiao, Weidi Xie

The former aims to infer all masked entities in the caption given the group tokens, that enables the model to learn fine-grained alignment between visual groups and text entities.

Open Vocabulary Semantic Segmentation Semantic Segmentation

CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP

1 code implementation CVPR 2023 Runnan Chen, Youquan Liu, Lingdong Kong, Xinge Zhu, Yuexin Ma, Yikang Li, Yuenan Hou, Yu Qiao, Wenping Wang

For the first time, our pre-trained network achieves annotation-free 3D semantic segmentation with 20. 8% and 25. 08% mIoU on nuScenes and ScanNet, respectively.

3D Semantic Segmentation Contrastive Learning +4

Policy Pre-training for Autonomous Driving via Self-supervised Geometric Modeling

1 code implementation3 Jan 2023 Penghao Wu, Li Chen, Hongyang Li, Xiaosong Jia, Junchi Yan, Yu Qiao

Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving.

Autonomous Driving Decision Making

Learning Weather-General and Weather-Specific Features for Image Restoration Under Multiple Adverse Weather Conditions

no code implementations CVPR 2023 Yurui Zhu, Tianyu Wang, Xueyang Fu, Xuanyu Yang, Xin Guo, Jifeng Dai, Yu Qiao, Xiaowei Hu

Inspired by this observation, we design an efficient unified framework with a two-stage training strategy to explore the weather-general and weather-specific features.

Image Restoration

UniFormerV2: Unlocking the Potential of Image ViTs for Video Understanding

no code implementations ICCV 2023 Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, LiMin Wang, Yu Qiao

The prolific performances of Vision Transformers (ViTs) in image tasks have prompted research into adapting the image ViTs for video tasks.

Video Understanding

HTML: Hybrid Temporal-scale Multimodal Learning Framework for Referring Video Object Segmentation

no code implementations ICCV 2023 Mingfei Han, Yali Wang, Zhihui Li, Lina Yao, Xiaojun Chang, Yu Qiao

To tackle this problem, we propose a concise Hybrid Temporal-scale Multimodal Learning (HTML) framework, which can effectively align lingual and visual features to discover core object semantics in the video, by learning multimodal interaction hierarchically from different temporal scales.

Referring Video Object Segmentation Semantic Segmentation +1

Neural Transformation Fields for Arbitrary-Styled Font Generation

1 code implementation CVPR 2023 Bin Fu, Junjun He, Jianjun Wang, Yu Qiao

Few-shot font generation (FFG), aiming at generating font images with a few samples, is an emerging topic in recent years due to the academic and commercial values.

Disentanglement Font Generation

Distilling Focal Knowledge From Imperfect Expert for 3D Object Detection

no code implementations CVPR 2023 Jia Zeng, Li Chen, Hanming Deng, Lewei Lu, Junchi Yan, Yu Qiao, Hongyang Li

Specifically, a set of queries are leveraged to locate the instance-level areas for masked feature generation, to intensify feature representation ability in these areas.

3D Object Detection Knowledge Distillation +2

DegAE: A New Pretraining Paradigm for Low-Level Vision

no code implementations CVPR 2023 Yihao Liu, Jingwen He, Jinjin Gu, Xiangtao Kong, Yu Qiao, Chao Dong

However, we argue that pretraining is more significant for high-cost tasks, where data acquisition is more challenging.


Multi-view Spectral Polarization Propagation for Video Glass Segmentation

no code implementations ICCV 2023 Yu Qiao, Bo Dong, Ao Jin, Yu Fu, Seung-Hwan Baek, Felix Heide, Pieter Peers, Xiaopeng Wei, Xin Yang

In this paper, we present the first polarization-guided video glass segmentation propagation solution (PGVS-Net) that can robustly and coherently propagate glass segmentation in RGB-P video sequences.

Image Segmentation Semantic Segmentation

Content Rating Classification for Fan Fiction

no code implementations23 Dec 2022 Yu Qiao, James Pope

The problem is to take fan fiction text and determine the appropriate content rating.

Binary Classification Classification

Lego-MT: Learning Detachable Models for Massively Multilingual Machine Translation

1 code implementation20 Dec 2022 Fei Yuan, Yinquan Lu, Wenhao Zhu, Lingpeng Kong, Lei LI, Yu Qiao, Jingjing Xu

To address the needs of learning representations for all languages in a unified space, we propose a novel efficient training recipe, upon which we build an effective detachable model, Lego-MT.

Machine Translation Translation

MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with Informative-Preserved Reconstruction and Self-Distilled Consistency

no code implementations CVPR 2023 Mingye Xu, Mutian Xu, Tong He, Wanli Ouyang, Yali Wang, Xiaoguang Han, Yu Qiao

Besides, such scenes with progressive masking ratios can also serve to self-distill their intrinsic spatial consistency, requiring to learn the consistent representations from unmasked areas.

object-detection Object Detection +2

Planning-oriented Autonomous Driving

1 code implementation CVPR 2023 Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, Hongyang Li

Oriented at this, we revisit the key components within perception and prediction, and prioritize the tasks such that all these tasks contribute to planning.

Autonomous Driving Philosophy

Improving the Generalizability of Text-Based Emotion Detection by Leveraging Transformers with Psycholinguistic Features

no code implementations19 Dec 2022 Sourabh Zanwar, Daniel Wiechmann, Yu Qiao, Elma Kerz

In recent years, there has been increased interest in building predictive models that harness natural language processing and machine learning techniques to detect emotions from various text sources, including social media posts, micro-blogs or news articles.

Emotion Recognition Transfer Learning

MANTIS at TSAR-2022 Shared Task: Improved Unsupervised Lexical Simplification with Pretrained Encoders

no code implementations19 Dec 2022 Xiaofei Li, Daniel Wiechmann, Yu Qiao, Elma Kerz

In this paper we present our contribution to the TSAR-2022 Shared Task on Lexical Simplification of the EMNLP 2022 Workshop on Text Simplification, Accessibility, and Readability.

Language Modelling Lexical Simplification +4

Exploring Hybrid and Ensemble Models for Multiclass Prediction of Mental Health Status on Social Media

no code implementations19 Dec 2022 Sourabh Zanwar, Daniel Wiechmann, Yu Qiao, Elma Kerz

In recent years, there has been a surge of interest in research on automatic mental health detection (MHD) from social media data leveraging advances in natural language processing and machine learning techniques.

Binary Classification

(Psycho-)Linguistic Features Meet Transformer Models for Improved Explainable and Controllable Text Simplification

no code implementations19 Dec 2022 Yu Qiao, Xiaofei Li, Daniel Wiechmann, Elma Kerz

State-of-the-art text simplification (TS) systems adopt end-to-end neural network models to directly generate the simplified version of the input text, and usually function as a blackbox.

Text Simplification

Diff-Font: Diffusion Model for Robust One-Shot Font Generation

1 code implementation12 Dec 2022 Haibin He, Xinyuan Chen, Chaoyue Wang, Juhua Liu, Bo Du, DaCheng Tao, Yu Qiao

Specifically, a large stroke-wise dataset is constructed, and a stroke-wise diffusion model is proposed to preserve the structure and the completion of each generated character.

Font Generation

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

1 code implementation6 Dec 2022 Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, LiMin Wang, Yu Qiao

Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications.

 Ranked #1 on Temporal Action Localization on THUMOS’14 (using extra training data)

Action Classification Contrastive Learning +8

ResFormer: Scaling ViTs with Multi-Resolution Training

1 code implementation CVPR 2023 Rui Tian, Zuxuan Wu, Qi Dai, Han Hu, Yu Qiao, Yu-Gang Jiang

We introduce, ResFormer, a framework that is built upon the seminal idea of multi-resolution training for improved performance on a wide spectrum of, mostly unseen, testing resolutions.

Action Recognition Image Classification +4

Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information

1 code implementation CVPR 2023 Weijie Su, Xizhou Zhu, Chenxin Tao, Lewei Lu, Bin Li, Gao Huang, Yu Qiao, Xiaogang Wang, Jie zhou, Jifeng Dai

It has been proved that combining multiple pre-training strategies and data from various modalities/sources can greatly boost the training of large-scale models.

Ranked #2 on Object Detection on LVIS v1.0 minival (using extra training data)

Image Classification Long-tailed Object Detection +3

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

2 code implementations CVPR 2023 Hao Li, Jinguo Zhu, Xiaohu Jiang, Xizhou Zhu, Hongsheng Li, Chun Yuan, Xiaohua Wang, Yu Qiao, Xiaogang Wang, Wenhai Wang, Jifeng Dai

In this paper, we propose Uni-Perceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-language tasks with competitive performance.

Language Modelling Multi-Task Learning

UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

2 code implementations17 Nov 2022 Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, LiMin Wang, Yu Qiao

UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator in the transformer format.

Video Understanding

Stare at What You See: Masked Image Modeling without Reconstruction

no code implementations CVPR 2023 Hongwei Xue, Peng Gao, Hongyang Li, Yu Qiao, Hao Sun, Houqiang Li, Jiebo Luo

However, unlike the low-level features such as pixel values, we argue the features extracted by powerful teacher models already encode rich semantic correlation across regions in an intact image. This raises one question: is reconstruction necessary in Masked Image Modeling (MIM) with a teacher model?

Demystify Transformers & Convolutions in Modern Image Deep Networks

1 code implementation10 Nov 2022 Jifeng Dai, Min Shi, Weiyun Wang, Sitong Wu, Linjie Xing, Wenhai Wang, Xizhou Zhu, Lewei Lu, Jie zhou, Xiaogang Wang, Yu Qiao, Xiaowei Hu

Although the novel feature transformation designs are often claimed as the source of gain, some backbones may benefit from advanced engineering techniques, which makes it hard to identify the real gain from the key feature transformation operators.

Image Deep Networks Spatial Token Mixer

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

2 code implementations CVPR 2023 Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, Xiaogang Wang, Yu Qiao

Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state.

 Ranked #1 on Instance Segmentation on COCO test-dev (using extra training data)

2D Object Detection Classification +4

VideoPipe 2022 Challenge: Real-World Video Understanding for Urban Pipe Inspection

no code implementations20 Oct 2022 Yi Liu, Xuan Zhang, Ying Li, Guixin Liang, Yabing Jiang, Lixia Qiu, Haiping Tang, Fei Xie, Wei Yao, Yi Dai, Yu Qiao, Yali Wang

For this reason, we propose to advance research areas of video understanding, with a shift from traditional action recognition to industrial anomaly analysis.

Temporal Defect Localization Video Defect Classification

Wider and Higher: Intensive Integration and Global Foreground Perception for Image Matting

no code implementations13 Oct 2022 Yu Qiao, Ziqi Wei, Yuhao Liu, Yuxin Wang, Dongsheng Zhou, Qiang Zhang, Xin Yang

This paper reviews recent deep-learning-based matting research and conceives our wider and higher motivation for image matting.

Image Matting

Hierarchical and Progressive Image Matting

no code implementations13 Oct 2022 Yu Qiao, Yuhao Liu, Ziqi Wei, Yuxin Wang, Qiang Cai, Guofeng Zhang, Xin Yang

In this paper, we propose an end-to-end Hierarchical and Progressive Attention Matting Network (HAttMatting++), which can better predict the opacity of the foreground from single RGB images without additional input.

Image Matting SSIM

Collaboration of Pre-trained Models Makes Better Few-shot Learner

no code implementations25 Sep 2022 Renrui Zhang, Bohao Li, Wei zhang, Hao Dong, Hongsheng Li, Peng Gao, Yu Qiao

In this paper, we propose CoMo, a Collaboration of pre-trained Models that incorporates diverse prior knowledge from various pre-training paradigms for better few-shot learning.

Few-Shot Learning Representation Learning

Delving into the Devils of Bird's-eye-view Perception: A Review, Evaluation and Recipe

2 code implementations12 Sep 2022 Hongyang Li, Chonghao Sima, Jifeng Dai, Wenhai Wang, Lewei Lu, Huijie Wang, Jia Zeng, Zhiqi Li, Jiazhi Yang, Hanming Deng, Hao Tian, Enze Xie, Jiangwei Xie, Li Chen, Tianyu Li, Yang Li, Yulu Gao, Xiaosong Jia, Si Liu, Jianping Shi, Dahua Lin, Yu Qiao

As sensor configurations get more complex, integrating multi-source information from different sensors and representing features in a unified view come of vital importance.

Autonomous Driving

Recurrent Bilinear Optimization for Binary Neural Networks

2 code implementations4 Sep 2022 Sheng Xu, Yanjing Li, Tiancheng Wang, Teli Ma, Baochang Zhang, Peng Gao, Yu Qiao, Jinhu Lv, Guodong Guo

To address this issue, Recurrent Bilinear Optimization is proposed to improve the learning process of BNNs (RBONNs) by associating the intrinsic bilinear variables in the back propagation process.

object-detection Object Detection

Frozen CLIP Models are Efficient Video Learners

2 code implementations6 Aug 2022 Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, Hongsheng Li

Video recognition has been dominated by the end-to-end learning paradigm -- first initializing a video recognition model with weights of a pretrained image model and then conducting end-to-end training on videos.

Ranked #20 on Action Classification on Kinetics-400 (using extra training data)

Action Classification Video Recognition

Vision-Centric BEV Perception: A Survey

1 code implementation4 Aug 2022 Yuexin Ma, Tai Wang, Xuyang Bai, Huitong Yang, Yuenan Hou, Yaming Wang, Yu Qiao, Ruigang Yang, Dinesh Manocha, Xinge Zhu

In recent years, vision-centric Bird's Eye View (BEV) perception has garnered significant interest from both industry and academia due to its inherent advantages, such as providing an intuitive representation of the world and being conducive to data fusion.

GenText: Unsupervised Artistic Text Generation via Decoupled Font and Texture Manipulation

no code implementations20 Jul 2022 Qirui Huang, Bin Fu, Aozhong zhang, Yu Qiao

Specifically, our current work incorporates three different stages, stylization, destylization, and font transfer, respectively, into a unified platform with a single powerful encoder network and two separate style generator networks, one for font transfer, the other for stylization and destylization.

Style Transfer Text Style Transfer

Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification

1 code implementation19 Jul 2022 Renrui Zhang, Zhang Wei, Rongyao Fang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, Hongsheng Li

On top of that, the performance of Tip-Adapter can be further boosted to be state-of-the-art on ImageNet by fine-tuning the cache model for 10$\times$ fewer epochs than existing methods, which is both effective and efficient.

Retrieval Transfer Learning

HQANN: Efficient and Robust Similarity Search for Hybrid Queries with Structured and Unstructured Constraints

no code implementations16 Jul 2022 Wei Wu, Junlin He, Yu Qiao, Guoheng Fu, Li Liu, Jin Yu

The in-memory approximate nearest neighbor search (ANNS) algorithms have achieved great success for fast high-recall query processing, but are extremely inefficient when handling hybrid queries with unstructured (i. e., feature vectors) and structured (i. e., related attributes) constraints.

CP3: Unifying Point Cloud Completion by Pretrain-Prompt-Predict Paradigm

no code implementations12 Jul 2022 Mingye Xu, Yali Wang, Yihao Liu, Tong He, Yu Qiao

Inspired by prompting approaches from NLP, we creatively reinterpret point cloud generation and refinement as the prompting and predicting stages, respectively.

Point Cloud Completion

Level 2 Autonomous Driving on a Single Device: Diving into the Devils of Openpilot

no code implementations16 Jun 2022 Li Chen, Tutian Tang, Zhitian Cai, Yang Li, Penghao Wu, Hongyang Li, Jianping Shi, Junchi Yan, Yu Qiao

Equipped with a wide span of sensors, predominant autonomous driving solutions are becoming more modular-oriented for safe system design.

Autonomous Driving

Siamese Image Modeling for Self-Supervised Vision Representation Learning

2 code implementations CVPR 2023 Chenxin Tao, Xizhou Zhu, Weijie Su, Gao Huang, Bin Li, Jie zhou, Yu Qiao, Xiaogang Wang, Jifeng Dai

Driven by these analysis, we propose Siamese Image Modeling (SiameseIM), which predicts the dense representations of an augmented view, based on another masked view from the same image but with different augmentations.

Representation Learning Self-Supervised Learning +1

You Only Need 90K Parameters to Adapt Light: A Light Weight Transformer for Image Enhancement and Exposure Correction

1 code implementation30 May 2022 Ziteng Cui, Kunchang Li, Lin Gu, Shenghan Su, Peng Gao, Zhengkai Jiang, Yu Qiao, Tatsuya Harada

Challenging illumination conditions (low-light, under-exposure and over-exposure) in the real world not only cast an unpleasant visual appearance but also taint the computer vision tasks.

Low-Light Image Enhancement object-detection +2

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

3 code implementations28 May 2022 Renrui Zhang, Ziyu Guo, Rongyao Fang, Bin Zhao, Dong Wang, Yu Qiao, Hongsheng Li, Peng Gao

By fine-tuning on downstream tasks, Point-M2AE achieves 86. 43% accuracy on ScanObjectNN, +3. 36% to the second-best, and largely benefits the few-shot classification, part segmentation and 3D object detection with the hierarchical pre-training scheme.

Ranked #3 on 3D Point Cloud Linear Classification on ModelNet40 (using extra training data)

3D Object Detection 3D Point Cloud Linear Classification +5

Evaluating the Generalization Ability of Super-Resolution Networks

no code implementations14 May 2022 Yihao Liu, Hengyuan Zhao, Jinjin Gu, Yu Qiao, Chao Dong

However, research on the generalization ability of Super-Resolution (SR) networks is currently absent.


Blueprint Separable Residual Network for Efficient Image Super-Resolution

1 code implementation12 May 2022 Zheyuan Li, Yingqi Liu, Xiangyu Chen, Haoming Cai, Jinjin Gu, Yu Qiao, Chao Dong

One is the usage of blueprint separable convolution (BSConv), which takes place of the redundant convolution operation.

Image Super-Resolution