Search Results for author: Jiayi Ji

Found 47 papers, 35 papers with code

Zooming In on Fakes: A Novel Dataset for Localized AI-Generated Image Detection with Forgery Amplification Approach

1 code implementation16 Apr 2025 Lvpan Cai, Haowei Wang, Jiayi Ji, YanShu ZhouMen, Yiwei Ma, Xiaoshuai Sun, Liujuan Cao, Rongrong Ji

In addition, we further propose \textbf{NFA-ViT}, a Noise-guided Forgery Amplification Vision Transformer that enhances the detection of localized forgeries by amplifying forgery-related features across the entire image.

An Efficient and Mixed Heterogeneous Model for Image Restoration

1 code implementation15 Apr 2025 Yubin Gu, Yuan Meng, Kaihang Zheng, Xiaoshuai Sun, Jiayi Ji, Weijian Ruan, Liujuan Cao, Rongrong Ji

This hierarchical and adaptive design enables the model to leverage the strengths of CNNs in local feature extraction, Mamba in global context modeling, and attention mechanisms in dynamic feature refinement.

Image Restoration Mamba

JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

no code implementations30 Mar 2025 Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Rongxin Jiang, Jiebo Luo, Hao Fei, Tat-Seng Chua

This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG).

Video Generation

MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning

no code implementations26 Mar 2025 Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Jiayi Ji, Jie Lou, Debing Zhang, Rongrong Ji

Visual instruction tuning (VIT) has emerged as a crucial technique for enabling multi-modal large language models (MLLMs) to follow user instructions adeptly.

Diversity

ComfyGPT: A Self-Optimizing Multi-Agent System for Comprehensive ComfyUI Workflow Generation

no code implementations22 Mar 2025 Oucheng Huang, Yuhang Ma, Zeng Zhao, Mingrui Wu, Jiayi Ji, Rongsheng Zhang, Zhipeng Hu, Xiaoshuai Sun, Rongrong Ji

Second, we proposed FlowAgent, a LLM-based workflow generation agent that uses both supervised fine-tuning (SFT) and reinforcement learning (RL) to improve workflow generation accuracy.

Image Generation Reinforcement Learning (RL)

QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension

1 code implementation11 Mar 2025 Yongdong Luo, Wang Chen, Xiawu Zheng, Weizhong Huang, Shukang Yin, Haojia Lin, Chaoyou Fu, Jinfa Huang, Jiayi Ji, Jiebo Luo, Rongrong Ji

Specifically, (i) QuoTA strategically allocates frame-level importance scores based on query relevance, enabling one-time visual token assignment before cross-modal interactions in decoder layers, (ii) we decouple the query through Chain-of-Thoughts reasoning to facilitate more precise LVLM-based frame importance scoring, and (iii) QuoTA offers a plug-and-play functionality that extends to existing LVLMs.

AutoML Decoder +3

Towards General Visual-Linguistic Face Forgery Detection(V2)

1 code implementation28 Feb 2025 Ke Sun, Shen Chen, Taiping Yao, Ziyin Zhou, Jiayi Ji, Xiaoshuai Sun, Chia-Wen Lin, Rongrong Ji

Face manipulation techniques have achieved significant advances, presenting serious challenges to security and social trust.

Hallucination Language Modeling +3

IPDN: Image-enhanced Prompt Decoding Network for 3D Referring Expression Segmentation

1 code implementation9 Jan 2025 Qi Chen, Changli Wu, Jiayi Ji, Yiwei Ma, Danni Yang, Xiaoshuai Sun

To tackle intent ambiguity, we designed a Prompt-Aware Decoder (PAD) that guides the decoding process by deriving task-driven signals from the interaction between the expression and visual features.

Decoder Referring Expression +1

Mixed Degradation Image Restoration via Local Dynamic Optimization and Conditional Embedding

no code implementations25 Nov 2024 Yubin Gu, Yuan Meng, Xiaoshuai Sun, Jiayi Ji, Weijian Ruan, Rongrong Ji

In this paper, we propose a novel multiple-in-one IR model that can effectively restore images with both single and mixed degradations.

Decoder Diversity +1

Any-to-3D Generation via Hybrid Diffusion Supervision

no code implementations22 Nov 2024 Yijun Fan, Yiwei Ma, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji

To our knowledge, this is the first method to generate 3D objects from any modality prompts.

3D Generation Image to 3D

Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image

no code implementations20 Oct 2024 Yu Zhao, Hao Fei, Xiangtai Li, Libo Qin, Jiayi Ji, Hongyuan Zhu, Meishan Zhang, Min Zhang, Jianguo Wei

In the visual spatial understanding (VSU) area, spatial image-to-text (SI2T) and spatial text-to-image (ST2I) are two fundamental tasks that appear in dual form.

Image to text

$γ-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models

no code implementations17 Oct 2024 Yaxin Luo, Gen Luo, Jiayi Ji, Yiyi Zhou, Xiaoshuai Sun, Zhiqiang Shen, Rongrong Ji

In $\gamma$-MoD, a novel metric is proposed to guide the deployment of MoDs in the MLLM, namely rank of attention maps (ARank).

Visual Question Answering

I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing

1 code implementation26 Aug 2024 Yiwei Ma, Jiayi Ji, Ke Ye, Weihuang Lin, Zhibin Wang, Yonghan Zheng, Qiang Zhou, Xiaoshuai Sun, Rongrong Ji

We will open-source I2EBench, including all instructions, input images, human annotations, edited images from all evaluated methods, and a simple script for evaluating the results from new IIE models.

TraDiffusion: Trajectory-Based Training-Free Image Generation

1 code implementation19 Aug 2024 Mingrui Wu, Oucheng Huang, Jiayi Ji, Jiale Li, Xinyue Cai, Huafeng Kuang, Jianzhuang Liu, Xiaoshuai Sun, Rongrong Ji

In this work, we propose a training-free, trajectory-based controllable T2I approach, termed TraDiffusion.

Image Generation

ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models

1 code implementation31 Jul 2024 Mingrui Wu, Xinyue Cai, Jiayi Ji, Jiale Li, Oucheng Huang, Gen Luo, Hao Fei, Guannan Jiang, Xiaoshuai Sun, Rongrong Ji

We observe that attention, as the core module of MLLMs, connects text prompt tokens and visual tokens, ultimately determining the final results.

Domain Generalization Prompt Learning

3D-GRES: Generalized 3D Referring Expression Segmentation

2 code implementations30 Jul 2024 Changli Wu, Yihang Liu, Jiayi Ji, Yiwei Ma, Haowei Wang, Gen Luo, Henghui Ding, Xiaoshuai Sun, Rongrong Ji

3D Referring Expression Segmentation (3D-RES) is dedicated to segmenting a specific instance within a 3D space based on a natural language description.

Object Referring Expression +3

Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model

1 code implementation7 Jul 2024 Danni Yang, Ruohan Dong, Jiayi Ji, Yiwei Ma, Haowei Wang, Xiaoshuai Sun, Rongrong Ji

Specifically, we introduce the DiffPNG framework, a straightforward yet effective approach that fully capitalizes on the diffusion's architecture for segmentation by decomposing the process into a sequence of localization, segmentation, and refinement steps.

Segmentation Sentence +1

Multi-branch Collaborative Learning Network for 3D Visual Grounding

1 code implementation7 Jul 2024 Zhipeng Qian, Yiwei Ma, Zhekai Lin, Jiayi Ji, Xiawu Zheng, Xiaoshuai Sun, Rongrong Ji

3D referring expression comprehension (3DREC) and segmentation (3DRES) have overlapping objectives, indicating their potential for collaboration.

3D visual grounding Referring Expression +1

HRSAM: Efficient Interactive Segmentation in High-Resolution Images

1 code implementation2 Jul 2024 You Huang, Wenbin Lai, Jiayi Ji, Liujuan Cao, Shengchuan Zhang, Rongrong Ji

Within FLA, we implement Flash Swin attention, achieving over a 35% speedup compared to traditional Swin attention, and propose a KV-only padding mechanism to enhance extrapolation.

Data Augmentation Interactive Segmentation +2

Evaluating and Analyzing Relationship Hallucinations in Large Vision-Language Models

1 code implementation24 Jun 2024 Mingrui Wu, Jiayi Ji, Oucheng Huang, Jiale Li, Yuhang Wu, Xiaoshuai Sun, Rongrong Ji

We identify three types of relationship co-occurrences that lead to hallucinations: relationship-relationship, subject-relationship, and relationship-object.

Common Sense Reasoning Hallucination +1

Beat: Bi-directional One-to-Many Embedding Alignment for Text-based Person Retrieval

no code implementations9 Jun 2024 Yiwei Ma, Xiaoshuai Sun, Jiayi Ji, Guannan Jiang, Weilin Zhuang, Rongrong Ji

To address this issue, we propose an effective bi-directional one-to-many embedding paradigm that offers a clear optimization direction for each sample, thus mitigating the optimization problem.

Image-text Retrieval Person Retrieval +3

Towards Semantic Equivalence of Tokenization in Multimodal LLM

no code implementations7 Jun 2024 Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan

The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.

Visual Question Answering

SAM as the Guide: Mastering Pseudo-Label Refinement in Semi-Supervised Referring Expression Segmentation

1 code implementation3 Jun 2024 Danni Yang, Jiayi Ji, Yiwei Ma, Tianyu Guo, Haowei Wang, Xiaoshuai Sun, Rongrong Ji

These strategies are designed to extract the most accurate masks from SAM's output, thus guiding the training of the student model with enhanced precision.

Pseudo Label Referring Expression +1

Image Captioning via Dynamic Path Customization

1 code implementation1 Jun 2024 Yiwei Ma, Jiayi Ji, Xiaoshuai Sun, Yiyi Zhou, Xiaopeng Hong, Yongjian Wu, Rongrong Ji

This paper explores a novel dynamic network for vision and language tasks, where the inferring structure is customized on the fly for different inputs.

Diversity Image Captioning

Toward Open-Set Human Object Interaction Detection

1 code implementation Proceedings of the AAAI Conference on Artificial Intelligence 2024 Mingrui Wu, Yuqi Liu, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji

To address this challenge, we introduce a simple Disentangled HOI Detection (DHD) model for detecting novel relationships by integrating an open-set object detector with a Visual Language Model (VLM).

Contrastive Learning Human-Object Interaction Detection +3

Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation

1 code implementation CVPR 2024 Sihan Liu, Yiwei Ma, Xiaoqing Zhang, Haowei Wang, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji

Referring Remote Sensing Image Segmentation (RRSIS) is a new challenge that combines computer vision and natural language processing, delineating specific regions in aerial images as described by textual queries.

ARC Image Segmentation +2

X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation

1 code implementation30 Nov 2023 Yiwei Ma, Yijun Fan, Jiayi Ji, Haowei Wang, Xiaoshuai Sun, Guannan Jiang, Annan Shu, Rongrong Ji

Nevertheless, a substantial domain gap exists between 2D images and 3D assets, primarily attributed to variations in camera-related attributes and the exclusive presence of foreground objects.

3D Generation Text to 3D

Semi-Supervised Panoptic Narrative Grounding

1 code implementation27 Oct 2023 Danni Yang, Jiayi Ji, Xiaoshuai Sun, Haowei Wang, Yinan Li, Yiwei Ma, Rongrong Ji

Remarkably, our SS-PNG-NW+ outperforms fully-supervised models with only 30% and 50% supervision data, exceeding their performance by 0. 8% and 1. 1% respectively.

Data Augmentation Pseudo Label

NICE: Improving Panoptic Narrative Detection and Segmentation with Cascading Collaborative Learning

1 code implementation17 Oct 2023 Haowei Wang, Jiayi Ji, Tianyu Guo, Yilong Yang, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji

To address this, we introduce two cascading modules based on the barycenter of the mask, which are Coordinate Guided Aggregation (CGA) and Barycenter Driven Localization (BDL), responsible for segmentation and detection, respectively.

Segmentation Visual Grounding

JM3D & JM3D-LLM: Elevating 3D Understanding with Joint Multi-modal Cues

1 code implementation14 Oct 2023 Jiayi Ji, Haowei Wang, Changli Wu, Yiwei Ma, Xiaoshuai Sun, Rongrong Ji

The rising importance of 3D understanding, pivotal in computer vision, autonomous driving, and robotics, is evident.

Autonomous Driving Representation Learning

3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation

1 code implementation31 Aug 2023 Changli Wu, Yiwei Ma, Qi Chen, Haowei Wang, Gen Luo, Jiayi Ji, Xiaoshuai Sun

In 3D Referring Expression Segmentation (3D-RES), the earlier approach adopts a two-stage paradigm, extracting segmentation proposals and then matching them with referring expressions.

Navigate Referring Expression +3

MMAPS: End-to-End Multi-Grained Multi-Modal Attribute-Aware Product Summarization

1 code implementation22 Aug 2023 Tao Chen, Ze Lin, Hui Li, Jiayi Ji, Yiyi Zhou, Guanbin Li, Rongrong Ji

Furthermore, we model product attributes based on both text and image modalities so that multi-modal product characteristics can be manifested in the generated summaries.

Attribute

X-Mesh: Towards Fast and Accurate Text-driven 3D Stylization via Dynamic Textual Guidance

1 code implementation ICCV 2023 Yiwei Ma, Xiaioqing Zhang, Xiaoshuai Sun, Jiayi Ji, Haowei Wang, Guannan Jiang, Weilin Zhuang, Rongrong Ji

Text-driven 3D stylization is a complex and crucial task in the fields of computer vision (CV) and computer graphics (CG), aimed at transforming a bare mesh to fit a target text.

Attribute

Towards Local Visual Modeling for Image Captioning

1 code implementation13 Feb 2023 Yiwei Ma, Jiayi Ji, Xiaoshuai Sun, Yiyi Zhou, Rongrong Ji

In this paper, we study the local visual modeling with grid features for image captioning, which is critical for generating accurate and detailed captions.

Image Captioning Object Recognition

Towards Real-Time Panoptic Narrative Grounding by an End-to-End Grounding Network

1 code implementation9 Jan 2023 Haowei Wang, Jiayi Ji, Yiyi Zhou, Yongjian Wu, Xiaoshuai Sun

Extensive experiments on the PNG benchmark dataset demonstrate the effectiveness and efficiency of our method.

RSTNet: Captioning With Adaptive Attention on Visual and Non-Visual Words

1 code implementation CVPR 2021 Xuying Zhang, Xiaoshuai Sun, Yunpeng Luo, Jiayi Ji, Yiyi Zhou, Yongjian Wu, Feiyue Huang, Rongrong Ji

Then, we build a BERTbased language model to extract language context and propose Adaptive-Attention (AA) module on top of a transformer decoder to adaptively measure the contribution of visual and language cues before making decisions for word prediction.

Decoder Image Captioning +3

Variable selection with missing data in both covariates and outcomes: Imputation and machine learning

1 code implementation6 Apr 2021 Liangyuan Hu, Jung-Yi Joyce Lin, Jiayi Ji

We investigate a general variable selection approach when both the covariates and outcomes can be missing at random and have general missing data patterns.

BIG-bench Machine Learning Imputation +2

Dual-Level Collaborative Transformer for Image Captioning

1 code implementation16 Jan 2021 Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Yongjian Wu, Feiyue Huang, Chia-Wen Lin, Rongrong Ji

Descriptive region features extracted by object detection networks have played an important role in the recent advancements of image captioning.

Descriptive Image Captioning +2

Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network

1 code implementation13 Dec 2020 Jiayi Ji, Yunpeng Luo, Xiaoshuai Sun, Fuhai Chen, Gen Luo, Yongjian Wu, Yue Gao, Rongrong Ji

The latter contains a Global Adaptive Controller that can adaptively fuse the global information into the decoder to guide the caption generation.

Caption Generation Decoder +1

Estimating heterogeneous survival treatment effect in observational data using machine learning

1 code implementation17 Aug 2020 Liangyuan Hu, Jiayi Ji, Fan Li

Methods for estimating heterogeneous treatment effect in observational data have largely focused on continuous or binary outcomes, and have been relatively less vetted with survival outcomes.

BIG-bench Machine Learning Causal Inference +1

Variational Structured Semantic Inference for Diverse Image Captioning

no code implementations NeurIPS 2019 Fuhai Chen, Rongrong Ji, Jiayi Ji, Xiaoshuai Sun, Baochang Zhang, Xuri Ge, Yongjian Wu, Feiyue Huang, Yan Wang

To model these two inherent diversities in image captioning, we propose a Variational Structured Semantic Inferring model (termed VSSI-cap) executed in a novel structured encoder-inferer-decoder schema.

Decoder Diversity +1

Semantic-aware Image Deblurring

no code implementations9 Oct 2019 Fuhai Chen, Rongrong Ji, Chengpeng Dai, Xiaoshuai Sun, Chia-Wen Lin, Jiayi Ji, Baochang Zhang, Feiyue Huang, Liujuan Cao

Specially, we propose a novel Structured-Spatial Semantic Embedding model for image deblurring (termed S3E-Deblur), which introduces a novel Structured-Spatial Semantic tree model (S3-tree) to bridge two basic tasks in computer vision: image deblurring (ImD) and image captioning (ImC).

Deblurring Image Captioning +1

Cannot find the paper you are looking for? You can Submit a new open access paper.