Search Results for author: Chong Luo

Found 56 papers, 27 papers with code

ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL

1 code implementation30 May 2025 Yu Zhang, Yunqi Li, Yifan Yang, Rui Wang, Yuqing Yang, Dai Qi, Jianmin Bao, Dongdong Chen, Chong Luo, Lili Qiu

Although chain-of-thought reasoning and reinforcement learning (RL) have driven breakthroughs in NLP, their integration into generative vision models remains underexplored.

Image Generation Language Modeling +2

Efficient RL Training for Reasoning Models via Length-Aware Optimization

1 code implementation18 May 2025 Danlong Yuan, Tian Xie, Shaohan Huang, Zhuocheng Gong, Huishuai Zhang, Chong Luo, Furu Wei, Dongyan Zhao

Large reasoning models, such as OpenAI o1 or DeepSeek R1, have demonstrated remarkable performance on reasoning tasks but often incur a long reasoning path with significant memory and time costs.

Math

Subject-driven Video Generation via Disentangled Identity and Motion

no code implementations23 Apr 2025 Daneul Kim, Jingxu Zhang, Wonjoon Jin, Sunghyun Cho, Qi Dai, Jaesik Park, Chong Luo

We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning.

Subject-driven Video Generation Video Generation

HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models

no code implementations14 Mar 2025 Ziqin Zhou, Yifan Yang, Yuqing Yang, Tianyu He, Houwen Peng, Kai Qiu, Qi Dai, Lili Qiu, Chong Luo, Lingqiao Liu

We explore the trade-offs between compression and reconstruction, while emphasizing the advantages of high-compressed semantic tokens in text-to-video tasks.

Text-to-Video Generation Video Generation

Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

4 code implementations20 Feb 2025 Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, Chong Luo

Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in large reasoning models.

Math reinforcement-learning +2

FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis

no code implementations CVPR 2025 Wonjoon Jin, Qi Dai, Chong Luo, Seung-Hwan Baek, Sunghyun Cho

To synthesize natural object motion while supporting detailed camera control, our framework adopts a two-stage video synthesis pipeline consisting of optical flow generation and flow-conditioned video synthesis.

Motion Synthesis Optical Flow Estimation +1

MageBench: Bridging Large Multimodal Models to Agents

1 code implementation5 Dec 2024 Miaosen Zhang, Qi Dai, Yifan Yang, Jianmin Bao, Dongdong Chen, Kai Qiu, Chong Luo, Xin Geng, Baining Guo

Such vision-in-the-chain reasoning paradigm is more aligned with the needs of multimodal agents, while being rarely evaluated.

Sokoban

StableAnimator: High-Quality Identity-Preserving Human Image Animation

1 code implementation CVPR 2025 Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, Zuxuan Wu

During inference, we propose a novel Hamilton-Jacobi-Bellman (HJB) equation-based optimization to further enhance the face quality.

Denoising Face Reenactment +3

REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents

1 code implementation20 Nov 2024 Rui Tian, Qi Dai, Jianmin Bao, Kai Qiu, Yifan Yang, Chong Luo, Zuxuan Wu, Yu-Gang Jiang

Commercial video generation models have exhibited realistic, high-fidelity results but are still restricted to limited access.

Video Generation

LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation

1 code implementation7 Nov 2024 Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Liang Hu, Qi Dai, Chunyu Wang, Xiyang Dai, Dongdong Chen, Chong Luo, Lili Qiu

CLIP is a foundational multimodal model that aligns image and text features into a shared representation space via contrastive learning on large-scale image-text pairs.

Contrastive Learning Image Captioning +6

PF3plat: Pose-Free Feed-Forward 3D Gaussian Splatting

1 code implementation29 Oct 2024 Sunghwan Hong, Jaewoo Jung, Heeseong Shin, Jisang Han, Jiaolong Yang, Chong Luo, Seungryong Kim

We then introduce lightweight, learnable modules to refine depth and pose estimates from the coarse alignments, improving the quality of 3D reconstruction and novel view synthesis.

3DGS 3D Reconstruction +2

Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms

no code implementations13 Jun 2024 Miaosen Zhang, Yixuan Wei, Zhen Xing, Yifei Ma, Zuxuan Wu, Ji Li, Zheng Zhang, Qi Dai, Chong Luo, Xin Geng, Baining Guo

In this paper, we target the realm of visual aesthetics and aim to align vision models with human aesthetic standards in a retrieval system.

Retrieval

Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild

1 code implementation29 Apr 2024 Donggyun Kim, Seongwoong Cho, Semin Kim, Chong Luo, Seunghoon Hong

In this study, we explore a universal model that can flexibly adapt to unseen dense label structures with a few examples, enabling it to serve as a data-efficient vision generalist in diverse real-world scenarios.

Meta-Learning

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

no code implementations22 Apr 2024 Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai, Matthew Dixon, Ronen Eldan, Victor Fragoso, Jianfeng Gao, Mei Gao, Min Gao, Amit Garg, Allie Del Giorno, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng Hao, Russell J. Hewett, Wenxiang Hu, Jamie Huynh, Dan Iter, Sam Ade Jacobs, Mojan Javaheripi, Xin Jin, Nikos Karampatziakis, Piero Kauffmann, Mahoud Khademi, Dongwoo Kim, Young Jin Kim, Lev Kurilenko, James R. Lee, Yin Tat Lee, Yuanzhi Li, Yunsheng Li, Chen Liang, Lars Liden, Xihui Lin, Zeqi Lin, Ce Liu, Liyuan Liu, Mengchen Liu, Weishung Liu, Xiaodong Liu, Chong Luo, Piyush Madan, Ali Mahmoudzadeh, David Majercak, Matt Mazzola, Caio César Teodoro Mendes, Arindam Mitra, Hardik Modi, Anh Nguyen, Brandon Norick, Barun Patra, Daniel Perez-Becker, Thomas Portet, Reid Pryzant, Heyang Qin, Marko Radmilac, Liliang Ren, Gustavo de Rosa, Corby Rosset, Sambudha Roy, Olatunji Ruwase, Olli Saarikivi, Amin Saied, Adil Salim, Michael Santacroce, Shital Shah, Ning Shang, Hiteshi Sharma, Yelong Shen, Swadheen Shukla, Xia Song, Masahiro Tanaka, Andrea Tupini, Praneetha Vaddamanu, Chunyu Wang, Guanhua Wang, Lijuan Wang, Shuohang Wang, Xin Wang, Yu Wang, Rachel Ward, Wen Wen, Philipp Witte, Haiping Wu, Xiaoxia Wu, Michael Wyatt, Bin Xiao, Can Xu, Jiahang Xu, Weijian Xu, Jilong Xue, Sonali Yadav, Fan Yang, Jianwei Yang, Yifan Yang, ZiYi Yang, Donghan Yu, Lu Yuan, Chenruidong Zhang, Cyril Zhang, Jianwen Zhang, Li Lyna Zhang, Yi Zhang, Yue Zhang, Yunan Zhang, Xiren Zhou

We introduce phi-3-mini, a 3. 8 billion parameter language model trained on 3. 3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3. 5 (e. g., phi-3-mini achieves 69% on MMLU and 8. 38 on MT-bench), despite being small enough to be deployed on a phone.

Ranked #5 on MMR total on MRR-Benchmark (using extra training data)

Language Modeling Language Modelling +3

OmniVid: A Generative Framework for Universal Video Understanding

1 code implementation CVPR 2024 Junke Wang, Dongdong Chen, Chong Luo, Bo He, Lu Yuan, Zuxuan Wu, Yu-Gang Jiang

The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically detect objects or actions in a video and analyze their temporal evolution.

Action Recognition Decoder +5

Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering

no code implementations14 Mar 2024 Zeyu Liu, Weicong Liang, Zhanhao Liang, Chong Luo, Ji Li, Gao Huang, Yuhui Yuan

Visual text rendering poses a fundamental challenge for contemporary text-to-image generation models, with the core problem lying in text encoder deficiencies.

Text to Image Generation Text-to-Image Generation

Unifying Correspondence, Pose and NeRF for Pose-Free Novel View Synthesis from Stereo Pairs

1 code implementation12 Dec 2023 Sunghwan Hong, Jaewoo Jung, Heeseong Shin, Jiaolong Yang, Seungryong Kim, Chong Luo

This work delves into the task of pose-free novel view synthesis from stereo pairs, a challenging and pioneering task in 3D vision.

3D geometry Camera Pose Estimation +3

Panacea: Panoramic and Controllable Video Generation for Autonomous Driving

1 code implementation CVPR 2024 Yuqing Wen, Yucheng Zhao, Yingfei Liu, Fan Jia, Yanhui Wang, Chong Luo, Chi Zhang, Tiancai Wang, Xiaoyan Sun, Xiangyu Zhang

This work notably propels the field of autonomous driving by effectively augmenting the training dataset used for advanced BEV perception techniques.

Autonomous Driving Video Generation

CCEdit: Creative and Controllable Video Editing via Diffusion Models

1 code implementation CVPR 2024 Ruoyu Feng, Wenming Weng, Yanhui Wang, Yuhui Yuan, Jianmin Bao, Chong Luo, Zhibo Chen, Baining Guo

The versatility of our framework is demonstrated through a diverse range of choices in both structure representations and personalized T2I models, as well as the option to provide the edited key frame.

Text-to-Image Generation Video Editing

ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System

no code implementations27 Apr 2023 Junke Wang, Dongdong Chen, Chong Luo, Xiyang Dai, Lu Yuan, Zuxuan Wu, Yu-Gang Jiang

Existing deep video models are limited by specific tasks, fixed input-output spaces, and poor generalization capabilities, making it difficult to deploy them in real-world scenarios.

Video Understanding

LaMD: Latent Motion Diffusion for Image-Conditional Video Generation

no code implementations23 Apr 2023 Yaosi Hu, Zhenzhong Chen, Chong Luo

This can be achieved by breaking down the video generation process into latent motion generation and video reconstruction.

Motion Generation Video Generation +1

Streaming Video Model

1 code implementation CVPR 2023 Yucheng Zhao, Chong Luo, Chuanxin Tang, Dongdong Chen, Noel Codella, Zheng-Jun Zha

We believe that the concept of streaming video model and the implementation of S-ViT are solid steps towards a unified deep learning architecture for video understanding.

Action Recognition Decoder +3

Look Before You Match: Instance Understanding Matters in Video Object Segmentation

no code implementations CVPR 2023 Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Chuanxin Tang, Xiyang Dai, Yucheng Zhao, Yujia Xie, Lu Yuan, Yu-Gang Jiang

Towards this goal, we present a two-branch network for VOS, where the query-based instance segmentation (IS) branch delves into the instance details of the current frame and the VOS branch performs spatial-temporal matching with the memory bank.

Instance Segmentation Segmentation +3

TridentSE: Guiding Speech Enhancement with 32 Global Tokens

no code implementations24 Oct 2022 Dacheng Yin, Zhiyuan Zhao, Chuanxin Tang, Zhiwei Xiong, Chong Luo

In this paper, we present TridentSE, a novel architecture for speech enhancement, which is capable of efficiently capturing both global information and local details.

Speech Enhancement

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

no code implementations15 Sep 2022 Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Luowei Zhou, Yucheng Zhao, Yujia Xie, Ce Liu, Yu-Gang Jiang, Lu Yuan

This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture.

Ranked #4 on Cross-Modal Retrieval on Flickr30k (using extra training data)

Action Classification Action Recognition +15

An Anchor-Free Detector for Continuous Speech Keyword Spotting

no code implementations9 Aug 2022 Zhiyuan Zhao, Chuanxin Tang, Chengdong Yao, Chong Luo

Continuous Speech Keyword Spotting (CSKWS) is a task to detect predefined keywords in a continuous speech.

Keyword Spotting object-detection +1

RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion

no code implementations28 Jun 2022 Dacheng Yin, Chuanxin Tang, Yanqing Liu, Xiaoqiang Wang, Zhiyuan Zhao, Yucheng Zhao, Zhiwei Xiong, Sheng Zhao, Chong Luo

In the proposed paradigm, global and local factors in speech are explicitly decomposed and separately manipulated to achieve high speaker similarity and continuous prosody.

Sentence

Peripheral Vision Transformer

1 code implementation14 Jun 2022 Juhong Min, Yucheng Zhao, Chong Luo, Minsu Cho

We propose to incorporate peripheral position encoding to the multi-head self-attention layers to let the network learn to partition the visual field into diverse peripheral regions given training data.

image-classification Image Classification

Retriever: Learning Content-Style Representation as a Token-Level Bipartite Graph

2 code implementations ICLR 2022 Dacheng Yin, Xuanchi Ren, Chong Luo, Yuwang Wang, Zhiwei Xiong, Wenjun Zeng

Last, an innovative link attention module serves as the decoder to reconstruct data from the decomposed content and style, with the help of the linking keys.

Decoder Quantization +2

Make It Move: Controllable Image-to-Video Generation with Text Descriptions

1 code implementation CVPR 2022 Yaosi Hu, Chong Luo, Zhenzhong Chen

With both controllable appearance and motion, TI2V aims at generating videos from a static image and a text description.

Diversity Image to Video Generation

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

1 code implementation12 Sep 2021 Chuanxin Tang, Chong Luo, Zhiyuan Zhao, Dacheng Yin, Yucheng Zhao, Wenjun Zeng

Given a piece of speech and its transcript text, text-based speech editing aims to generate speech that can be seamlessly inserted into the given speech by editing the transcript.

Decoder text-to-speech +2

A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

1 code implementation30 Aug 2021 Yucheng Zhao, Guangting Wang, Chuanxin Tang, Chong Luo, Wenjun Zeng, Zheng-Jun Zha

Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision.

Self-Supervised Visual Representations Learning by Contrastive Mask Prediction

no code implementations ICCV 2021 Yucheng Zhao, Guangting Wang, Chong Luo, Wenjun Zeng, Zheng-Jun Zha

In this paper, we propose a novel contrastive mask prediction (CMP) task for visual representation learning and design a mask contrast (MaskCo) framework to implement the idea.

Representation Learning Self-Supervised Learning

Unsupervised Visual Representation Learning by Tracking Patches in Video

1 code implementation CVPR 2021 Guangting Wang, Yizhou Zhou, Chong Luo, Wenxuan Xie, Wenjun Zeng, Zhiwei Xiong

The proxy task is to estimate the position and size of the image patch in a sequence of video frames, given only the target bounding box in the first frame.

Action Classification Action Recognition +1

VAE^2: Preventing Posterior Collapse of Variational Video Predictions in the Wild

no code implementations28 Jan 2021 Yizhou Zhou, Chong Luo, Xiaoyan Sun, Zheng-Jun Zha, Wenjun Zeng

We believe that VAE$^2$ is also applicable to other stochastic sequence prediction problems where training data are lack of stochasticity.

Video Prediction

Spatiotemporal Fusion in 3D CNNs: A Probabilistic View

no code implementations CVPR 2020 Yizhou Zhou, Xiaoyan Sun, Chong Luo, Zheng-Jun Zha, Wen-Jun Zeng

Based on the probability space, we further generate new fusion strategies which achieve the state-of-the-art performance on four well-known action recognition datasets.

Action Recognition In Videos Temporal Action Localization

PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network

3 code implementations Applications of Artificial Intelligence Conference 2019 Dacheng Yin, Chong Luo, Zhiwei Xiong, Wen-Jun Zeng

We discover that the two streams should communicate with each other, and this is crucial to phase prediction.

Sound Audio and Speech Processing

Posterior-Guided Neural Architecture Search

1 code implementation23 Jun 2019 Yizhou Zhou, Xiaoyan Sun, Chong Luo, Zheng-Jun Zha, Wen-Jun Zeng

Accordingly, a hybrid network representation is presented which enables us to leverage the Variational Dropout so that the approximation of the posterior distribution becomes fully gradient-based and highly efficient.

image-classification Image Classification +1

Towards a Better Match in Siamese Network Based Visual Object Tracker

no code implementations5 Sep 2018 Anfeng He, Chong Luo, Xinmei Tian, Wen-Jun Zeng

Recently, Siamese network based trackers have received tremendous interest for their fast tracking speed and high performance.

Visual Object Tracking

Cannot find the paper you are looking for? You can Submit a new open access paper.