1 code implementation • 2 Apr 2025 • Kecen Li, Chen Gong, Xiaochen Li, Yuzhong Zhao, Xinwen Hou, Tianhao Wang
In this work, inspired by curriculum learning, we propose a two-stage DP image synthesis framework, where diffusion models learn to generate DP synthetic images from easy to hard.
no code implementations • 27 Mar 2025 • Jingye Chen, Yuzhong Zhao, Yupan Huang, Lei Cui, Li Dong, Tengchao Lv, Qifeng Chen, Furu Wei
Recent advances in generative models have significantly impacted game generation.
1 code implementation • 27 Dec 2024 • Meiqi Wu, Kaiqi Huang, Yuanqiang Cai, Shiyu Hu, Yuzhong Zhao, Weiqiang Wang
This dataset captures handwritten trajectories in various real-world scenarios using commonly accessible RGB cameras, eliminating the need for complex sensors.
no code implementations • 28 Nov 2024 • Feng Liu, Shiwei Zhang, XiaoFeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, Fang Wan
As a fundamental backbone for video generation, diffusion models are challenged by low inference speed due to the sequential nature of denoising.
1 code implementation • 1 Jul 2024 • Mingxiang Liao, Hannan Lu, Xinyu Zhang, Fang Wan, Tianyu Wang, Yuzhong Zhao, WangMeng Zuo, Qixiang Ye, Jingdong Wang
For this purpose, we establish a new benchmark comprising text prompts that fully reflect multiple dynamics grades, and define a set of dynamics scores corresponding to various temporal granularities to comprehensively evaluate the dynamics of each generated video.
1 code implementation • 25 May 2024 • Yuzhong Zhao, Feng Liu, Yue Liu, Mingxiang Liao, Chen Gong, Qixiang Ye, Fang Wan
Unfortunately, most of existing methods using fixed visual inputs remain lacking the resolution adaptability to find out precise language descriptions.
1 code implementation • 31 Jan 2024 • Yuzhong Zhao, Yue Liu, Zonghao Guo, Weijia Wu, Chen Gong, Fang Wan, Qixiang Ye
The multimodal model is constrained to generate captions within a few sub-spaces containing the control words, which increases the opportunity of hitting less frequent captions, alleviating the caption degeneration issue.
Ranked #1 on
Dense Captioning
on Visual Genome
12 code implementations • 18 Jan 2024 • Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, YaoWei Wang, Qixiang Ye, Jianbin Jiao, Yunfan Liu
At the core of VMamba is a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module.
1 code implementation • 29 Nov 2023 • Weijia Wu, Yuzhong Zhao, Zhuang Li, Lianlei Shan, Hong Zhou, Mike Zheng Shou
Image segmentation based on continual learning exhibits a critical drop of performance, mainly due to catastrophic forgetting and background shift, as they are required to incorporate new classes continually.
1 code implementation • 19 Oct 2023 • Kecen Li, Chen Gong, Zhixiang Li, Yuzhong Zhao, Xinwen Hou, Tianhao Wang
Then, this function assists in querying the semantic distribution of the sensitive dataset, facilitating the selection of data from the public dataset with analogous semantics for pre-training.
no code implementations • 20 Sep 2023 • Tengchao Lv, Yupan Huang, Jingye Chen, Yuzhong Zhao, Yilin Jia, Lei Cui, Shuming Ma, Yaoyao Chang, Shaohan Huang, Wenhui Wang, Li Dong, Weiyao Luo, Shaoxiang Wu, Guoxin Wang, Cha Zhang, Furu Wei
In this paper we present KOSMOS-2. 5, a multimodal literate model for machine reading of text-intensive images.
1 code implementation • NeurIPS 2023 • Weijia Wu, Yuzhong Zhao, Hao Chen, YuChao Gu, Rui Zhao, Yefei He, Hong Zhou, Mike Zheng Shou, Chunhua Shen
To showcase the power of the proposed approach, we generate datasets with rich dense pixel-wise labels for a wide range of downstream tasks, including semantic segmentation, instance segmentation, and depth estimation.
1 code implementation • ICCV 2023 • Yuzhong Zhao, Qixiang Ye, Weijia Wu, Chunhua Shen, Fang Wan
During training, GenPromp converts image category labels to learnable prompt embeddings which are fed to a generative model to conditionally recover the input image with noise and learn representative embeddings.
Ranked #1 on
Weakly-Supervised Object Localization
on CUB-200-2011
(Top-1 Localization Accuracy metric, using extra
training data)
1 code implementation • 5 May 2023 • Weijia Wu, Yuzhong Zhao, Zhuang Li, Jiahong Li, Hong Zhou, Mike Zheng Shou, Xiang Bai
Most existing cross-modal language-to-video retrieval (VR) research focuses on single-modal input from video, i. e., visual representation, while the text is omnipresent in human environments and frequently critical to understand video.
1 code implementation • 5 May 2023 • Yuzhong Zhao, Weijia Wu, Zhuang Li, Jiahong Li, Weiqiang Wang
This paper introduces a novel video text synthesis technique called FlowText, which utilizes optical flow estimation to synthesize a large amount of text video data at a low cost for training robust video text spotters.
no code implementations • 10 Apr 2023 • Weijia Wu, Yuzhong Zhao, Zhuang Li, Jiahong Li, Mike Zheng Shou, Umapada Pal, Dimosthenis Karatzas, Xiang Bai
In this competition report, we establish a video text reading benchmark, DSText, which focuses on dense and small text reading challenges in the video with various scenarios.
1 code implementation • ICCV 2023 • Weijia Wu, Yuzhong Zhao, Mike Zheng Shou, Hong Zhou, Chunhua Shen
In contrast, synthetic data can be freely available using a generative model (e. g., DALL-E, Stable Diffusion).
no code implementations • 4 Jul 2022 • Yuzhong Zhao, Yuanqiang Cai, Weijia Wu, Weiqiang Wang
Generally pre-training and long-time training computation are necessary for obtaining a good-performance text detector based on deep networks.