Search Results for author: Zhongang Qi

Found 39 papers, 15 papers with code

DOGE: Towards Versatile Visual Document Grounding and Referring

no code implementations26 Nov 2024 Yinan Zhou, Yuxin Chen, Haokun Lin, Shuyu Yang, Li Zhu, Zhongang Qi, Chen Ma, Ying Shan

In recent years, Multimodal Large Language Models (MLLMs) have increasingly emphasized grounding and referring capabilities to achieve detailed understanding and flexible user interaction.

document understanding

mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA

no code implementations22 Nov 2024 Tao Zhang, Ziqi Zhang, Zongyang Ma, Yuxin Chen, Zhongang Qi, Chunfeng Yuan, Bing Li, Junfu Pu, Yuxuan Zhao, Zehua Xie, Jin Ma, Ying Shan, Weiming Hu

Thus, multimodal Retrieval-Augmented Generation (mRAG) is naturally introduced to provide MLLMs with comprehensive and up-to-date knowledge, effectively expanding the knowledge scope.

RAG Retrieval +1

Taming Rectified Flow for Inversion and Editing

1 code implementation7 Nov 2024 Jiangshan Wang, Junfu Pu, Zhongang Qi, Jiayi Guo, Yue Ma, Nisha Huang, Yuxin Chen, Xiu Li, Ying Shan

To address this issue, we propose RF-Solver, a novel training-free sampler that effectively enhances inversion precision by mitigating the errors in the ODE-solving process of rectified flow.

Text-to-Image Generation Video Editing +1

E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding

1 code implementation26 Sep 2024 Ye Liu, Zongyang Ma, Zhongang Qi, Yang Wu, Ying Shan, Chang Wen Chen

Bench (Event-Level & Time-Sensitive Video Understanding Benchmark), a large-scale and high-quality benchmark for open-ended event-level video understanding.

Question Answering Video Understanding

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

no code implementations23 Aug 2024 Tao Wu, Yong Zhang, Xintao Wang, Xianpan Zhou, Guangcong Zheng, Zhongang Qi, Ying Shan, Xi Li

However, since it is only trained on static images, the fine-tuning process of subject learning disrupts abilities of video diffusion models (VDMs) to combine concepts and generate motions.

Denoising Motion Generation +1

SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses

no code implementations3 Aug 2024 Chaolei Tan, Zihang Lin, Junfu Pu, Zhongang Qi, Wei-Yi Pei, Zhi Qu, Yexin Wang, Ying Shan, Wei-Shi Zheng, Jian-Fang Hu

Based on the dataset, we further introduce a more complex setting of video grounding dubbed Multi-Paragraph Video Grounding (MPVG), which takes as input multiple paragraphs and a long video for grounding each paragraph query to its temporal interval.

Natural Language Queries Video Grounding

EA-VTR: Event-Aware Video-Text Retrieval

no code implementations10 Jul 2024 Zongyang Ma, Ziqi Zhang, Yuxin Chen, Zhongang Qi, Chunfeng Yuan, Bing Li, Yingmin Luo, Xu Li, Xiaojuan Qi, Ying Shan, Weiming Hu

EA-VTR can efficiently encode frame-level and video-level visual representations simultaneously, enabling detailed event content and complex event temporal cross-modal alignment, ultimately enhancing the comprehensive understanding of video events.

Action Recognition Contrastive Learning +6

How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?

no code implementations CVPR 2024 Yuxin Chen, Zongyang Ma, Ziqi Zhang, Zhongang Qi, Chunfeng Yuan, Bing Li, Junfu Pu, Ying Shan, Xiaojuan Qi, Weiming Hu

Dominant dual-encoder models enable efficient image-text retrieval but suffer from limited accuracy while the cross-encoder models offer higher accuracy at the expense of efficiency.

Contrastive Learning Image-text Retrieval +3

PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM

1 code implementation5 Jun 2024 Tao Yang, Yingmin Luo, Zhongang Qi, Yang Wu, Ying Shan, Chang Wen Chen

This system integrates our proposed layout generation method as the core component, demonstrating its effectiveness in practical scenarios.

Language Modelling Large Language Model

SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model

no code implementations15 Mar 2024 Tao Wu, XueWei Li, Zhongang Qi, Di Hu, Xintao Wang, Ying Shan, Xi Li

Controllable spherical panoramic image generation holds substantial applicative potential across a variety of domains. However, it remains a challenging task due to the inherent spherical distortion and geometry characteristics, resulting in low-quality content generation. In this paper, we introduce a novel framework of SphereDiffusion to address these unique challenges, for better generating high-quality and precisely controllable spherical panoramic images. For the spherical distortion characteristic, we embed the semantics of the distorted object with text encoding, then explicitly construct the relationship with text-object correspondence to better use the pre-trained knowledge of the planar images. Meanwhile, we employ a deformable technique to mitigate the semantic deviation in latent space caused by spherical distortion. For the spherical geometry characteristic, in virtue of spherical rotation invariance, we improve the data diversity and optimization objectives in the training process, enabling the model to better learn the spherical geometry characteristic. Furthermore, we enhance the denoising process of the diffusion model, enabling it to effectively use the learned geometric characteristic to ensure the boundary continuity of the generated images. With these specific techniques, experiments on Structured3D dataset show that SphereDiffusion significantly improves the quality of controllable spherical image generation and relatively reduces around 35% FID on average.

Denoising Diversity +1

StyleAdapter: A Unified Stylized Image Generation Model

no code implementations4 Sep 2023 Zhouxia Wang, Xintao Wang, Liangbin Xie, Zhongang Qi, Ying Shan, Wenping Wang, Ping Luo

In this work, we propose StyleAdapter, a unified stylized image generation model capable of producing a variety of stylized images that match both the content of a given prompt and the style of reference images, without the need for per-style fine-tuning.

Image Generation

Towards Unseen Triples: Effective Text-Image-joint Learning for Scene Graph Generation

no code implementations23 Jun 2023 Qianji Di, Wenxi Ma, Zhongang Qi, Tianxiang Hou, Ying Shan, Hanzi Wang

In this work, we propose a Text-Image-joint Scene Graph Generation (TISGG) model to resolve the unseen triples and improve the generalisation capability of the SGG models.

Graph Generation Scene Graph Generation +1

SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation

1 code implementation6 Jun 2023 XueWei Li, Tao Wu, Zhongang Qi, Gaoang Wang, Ying Shan, Xi Li

Experimental results on Stanford2D3D Panoramic datasets show that SGAT4PASS significantly improves performance and robustness, with approximately a 2% increase in mIoU, and when small 3D disturbances occur in the data, the stability of our performance is improved by an order of magnitude.

Semantic Segmentation

MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing

3 code implementations ICCV 2023 Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, XiaoHu Qie, Yinqiang Zheng

Despite the success in large-scale text-to-image generation and text-conditioned image editing, existing methods still struggle to produce consistent generation and editing results.

Text-based Image Editing

LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation

2 code implementations CVPR 2023 Guangcong Zheng, Xianpan Zhou, XueWei Li, Zhongang Qi, Ying Shan, Xi Li

To overcome the difficult multimodal fusion of image and layout, we propose to construct a structural image patch with region information and transform the patched image into a special layout to fuse with the normal layout in a unified form.

Layout-to-Image Generation Object

T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

2 code implementations16 Feb 2023 Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, XiaoHu Qie

In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly.

Image Generation Style Transfer

ViLEM: Visual-Language Error Modeling for Image-Text Retrieval

no code implementations CVPR 2023 Yuxin Chen, Zongyang Ma, Ziqi Zhang, Zhongang Qi, Chunfeng Yuan, Ying Shan, Bing Li, Weiming Hu, XiaoHu Qie, Jianping Wu

ViLEM then enforces the model to discriminate the correctness of each word in the plausible negative texts and further correct the wrong words via resorting to image information.

Contrastive Learning Image-text Retrieval +3

Do we really need temporal convolutions in action segmentation?

1 code implementation26 May 2022 Dazhao Du, Bing Su, Yu Li, Zhongang Qi, Lingyu Si, Ying Shan

Most state-of-the-art methods focus on designing temporal convolution-based models, but the inflexibility of temporal convolutions and the difficulties in modeling long-term temporal dependencies restrict the potential of these models.

Action Classification Action Segmentation +1

Accelerating the Training of Video Super-Resolution Models

no code implementations10 May 2022 Lijian Lin, Xintao Wang, Zhongang Qi, Ying Shan

In this work, we show that it is possible to gradually train video models from small to large spatial/temporal sizes, i. e., in an easy-to-hard manner.

Video Super-Resolution

CREATE: A Benchmark for Chinese Short Video Retrieval and Title Generation

no code implementations31 Mar 2022 Ziqi Zhang, Yuxin Chen, Zongyang Ma, Zhongang Qi, Chunfeng Yuan, Bing Li, Ying Shan, Weiming Hu

In this paper, we propose to CREATE, the first large-scale Chinese shoRt vidEo retrievAl and Title gEneration benchmark, to facilitate research and application in video titling and video retrieval in Chinese.

Retrieval Video Captioning +1

BTS: A Bi-Lingual Benchmark for Text Segmentation in the Wild

no code implementations CVPR 2022 Xixi Xu, Zhongang Qi, jianqi ma, Honglun Zhang, Ying Shan, XiaoHu Qie

Current researches mainly focus on only English characters and digits, while few work studies Chinese characters due to the lack of public large-scale and high-quality Chinese datasets, which limits the practical application scenarios of text segmentation.

Segmentation Style Transfer +2

From Heatmaps to Structural Explanations of Image Classifiers

no code implementations13 Sep 2021 Li Fuxin, Zhongang Qi, Saeed Khorram, Vivswan Shitole, Prasad Tadepalli, Minsuk Kahng, Alan Fern

This paper summarizes our endeavors in the past few years in terms of explaining image classifiers, with the aim of including negative results and insights we have gained.

Finding Discriminative Filters for Specific Degradations in Blind Super-Resolution

1 code implementation NeurIPS 2021 Liangbin Xie, Xintao Wang, Chao Dong, Zhongang Qi, Ying Shan

Unlike previous integral gradient methods, our FAIG aims at finding the most discriminative filters instead of input pixels/features for degradation removal in blind SR networks.

Blind Super-Resolution Super-Resolution

Stochastic Block-ADMM for Training Deep Networks

no code implementations1 May 2021 Saeed Khorram, Xiao Fu, Mohamad H. Danesh, Zhongang Qi, Li Fuxin

We prove the convergence of our proposed method and justify its capabilities through experiments in supervised and weakly-supervised settings.

Open-book Video Captioning with Retrieve-Copy-Generate Network

no code implementations CVPR 2021 Ziqi Zhang, Zhongang Qi, Chunfeng Yuan, Ying Shan, Bing Li, Ying Deng, Weiming Hu

Due to the rapid emergence of short videos and the requirement for content understanding and creation, the video captioning task has received increasing attention in recent years.

Decoder Retrieval +1

Visualizing Point Cloud Classifiers by Curvature Smoothing

1 code implementation23 Nov 2019 Chen Ziwen, Wenxuan Wu, Zhongang Qi, Li Fuxin

In this paper, we propose a novel approach to visualize features important to the point cloud classifiers.

Data Augmentation General Classification

Visualizing Deep Networks by Optimizing with Integrated Gradients

1 code implementation2 May 2019 Zhongang Qi, Saeed Khorram, Li Fuxin

Understanding and interpreting the decisions made by deep learning models is valuable in many domains.

Interactive Naming for Explaining Deep Neural Networks: A Formative Study

no code implementations18 Dec 2018 Mandana Hamidi-Haines, Zhongang Qi, Alan Fern, Fuxin Li, Prasad Tadepalli

For this purpose, we developed a user interface for "interactive naming," which allows a human annotator to manually cluster significant activation maps in a test set into meaningful groups called "visual concepts".

General Classification

PointConv: Deep Convolutional Networks on 3D Point Clouds

9 code implementations CVPR 2019 Wenxuan Wu, Zhongang Qi, Li Fuxin

Besides, our experiments converting CIFAR-10 into a point cloud showed that networks built on PointConv can match the performance of convolutional networks in 2D images of a similar structure.

3D Part Segmentation 3D Point Cloud Classification +1

Deep Air Learning: Interpolation, Prediction, and Feature Analysis of Fine-grained Air Quality

no code implementations2 Nov 2017 Zhongang Qi, Tianchun Wang, Guojie Song, Weisong Hu, Xi Li, Zhongfei, Zhang

The interpolation, prediction, and feature analysis of fine-gained air quality are three important topics in the area of urban air computing.

feature selection

Embedding Deep Networks into Visual Explanations

no code implementations15 Sep 2017 Zhongang Qi, Saeed Khorram, Fuxin Li

The XNN works by learning a nonlinear embedding of a high-dimensional activation vector of a deep network layer into a low-dimensional explanation space while retaining faithfulness i. e., the original deep learning predictions can be constructed from the few concepts extracted by our explanation network.

Image Classification

Cannot find the paper you are looking for? You can Submit a new open access paper.