1 code implementation • 19 Jul 2024 • Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, Xihui Liu
In this work, we conduct the first systematic study on compositional text-to-video generation.
no code implementations • 11 Jul 2024 • Zidong Wang, Zeyu Lu, Di Huang, Tong He, Xihui Liu, Wanli Ouyang, Lei Bai
In this paper, we introduce PredBench, a benchmark tailored for the holistic evaluation of spatio-temporal prediction networks.
no code implementations • 8 Jul 2024 • Zhenyu Wang, Aoxue Li, Zhenguo Li, Xihui Liu
For a complex problem, the MLLM agent decomposes it into simpler sub-problems and constructs a tree structure to systematically plan the procedure of generation, editing, and self-correction with step-by-step verification.
no code implementations • 8 Jul 2024 • Zhangyang Qi, Yunhan Yang, Mengchen Zhang, Long Xing, Xiaoyang Wu, Tong Wu, Dahua Lin, Xihui Liu, Jiaqi Wang, Hengshuang Zhao
Afterward, it proceeds to edit the back views.
no code implementations • 1 Jul 2024 • Chenming Zhu, Tai Wang, Wenwei Zhang, Kai Chen, Xihui Liu
Although great progress has been made in 3D visual grounding, current models still rely on explicit textual descriptions for grounding and lack the ability to reason human intentions from implicit instructions.
1 code implementation • 14 Jun 2024 • Yuchen Ren, ZhiYuan Chen, Lifeng Qiao, Hongtai Jing, Yuchen Cai, Sheng Xu, Peng Ye, Xinzhu Ma, Siqi Sun, Hongliang Yan, Dong Yuan, Wanli Ouyang, Xihui Liu
RNA plays a pivotal role in translating genetic instructions into functional outcomes, underscoring its importance in biological processes and disease mechanisms.
no code implementations • 31 May 2024 • Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yunhong Wang, Yu Qiao
We first design a unified diffusion model tailored for multi-view video generation by incorporating a learnable motion module into a frozen 3D-aware diffusion model to capture multi-view spatial-temporal correlations.
1 code implementation • 23 May 2024 • Yao Teng, Yue Wu, Han Shi, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, Xihui Liu
In addition, to further improve training efficiency for high-resolution image generation with DiM, we investigate "weak-to-strong" training strategy that pretrains DiM on low-resolution images ($256\times 256$) and then finetune it on high-resolution images ($512 \times 512$).
no code implementations • 26 Mar 2024 • Sherwin Bahmani, Xian Liu, Yifan Wang, Ivan Skorokhodov, Victor Rong, Ziwei Liu, Xihui Liu, Jeong Joon Park, Sergey Tulyakov, Gordon Wetzstein, Andrea Tagliasacchi, David B. Lindell
We learn local deformations that conform to the global trajectory using supervision from a text-to-video model.
1 code implementation • 20 Mar 2024 • Tianwei Xiong, Enze Xie, Yue Wu, Zhenguo Li, Xihui Liu
We further propose a comprehensive benchmark, named ImageNet Concept Editing Benchmark (ICEB), for evaluating massive concept editing for T2I models with two subtasks, free-form prompts, massive concept categories, and extensive evaluation metrics.
2 code implementations • 19 Feb 2024 • Zeyu Lu, Zidong Wang, Di Huang, Chengyue Wu, Xihui Liu, Wanli Ouyang, Lei Bai
Nature is infinitely resolution-free.
no code implementations • 28 Jan 2024 • Zhenyu Wang, Enze Xie, Aoxue Li, Zhongdao Wang, Xihui Liu, Zhenguo Li
Given a complex text prompt containing multiple concepts including objects, attributes, and relationships, the LLM agent initially decomposes it, which entails the extraction of individual objects, their associated attributes, and the prediction of a coherent scene layout.
1 code implementation • CVPR 2024 • Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xihui Liu, Yu Qiao, Wanli Ouyang, Tong He, Hengshuang Zhao
This paper is not motivated to seek innovation within the attention mechanism.
1 code implementation • CVPR 2024 • Tai Wang, Xiaohan Mao, Chenming Zhu, Runsen Xu, Ruiyuan Lyu, Peisen Li, Xiao Chen, Wenwei Zhang, Kai Chen, Tianfan Xue, Xihui Liu, Cewu Lu, Dahua Lin, Jiangmiao Pang
In the realm of computer vision and robotics, embodied agents are expected to explore their environment and carry out human instructions.
1 code implementation • 17 Dec 2023 • Jiankai Sun, Chuanyang Zheng, Enze Xie, Zhengying Liu, Ruihang Chu, Jianing Qiu, Jiaqi Xu, Mingyu Ding, Hongyang Li, Mengzhe Geng, Yue Wu, Wenhai Wang, Junsong Chen, Zhangyue Yin, Xiaozhe Ren, Jie Fu, Junxian He, Wu Yuan, Qi Liu, Xihui Liu, Yu Li, Hao Dong, Yu Cheng, Ming Zhang, Pheng Ann Heng, Jifeng Dai, Ping Luo, Jingdong Wang, Ji-Rong Wen, Xipeng Qiu, Yike Guo, Hui Xiong, Qun Liu, Zhenguo Li
Reasoning, a crucial ability for complex problem-solving, plays a pivotal role in various real-world settings such as negotiation, medical diagnosis, and criminal investigation.
3 code implementations • 15 Dec 2023 • Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xihui Liu, Yu Qiao, Wanli Ouyang, Tong He, Hengshuang Zhao
This paper is not motivated to seek innovation within the attention mechanism.
Ranked #1 on 3D Semantic Segmentation on ScanNet++ (using extra training data)
1 code implementation • 11 Dec 2023 • Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, Xihui Liu
The pursuit of artificial general intelligence (AGI) has been accelerated by Multimodal Large Language Models (MLLMs), which exhibit superior reasoning, generalization capabilities, and proficiency in processing multimodal inputs.
1 code implementation • NeurIPS 2023 • Yixing Lao, Xiaogang Xu, Zhipeng Cai, Xihui Liu, Hengshuang Zhao
We present CorresNeRF, a novel method that leverages image correspondence priors computed by off-the-shelf methods to supervise NeRF training.
no code implementations • CVPR 2024 • Yunhan Yang, Yukun Huang, Xiaoyang Wu, Yuan-Chen Guo, Song-Hai Zhang, Hengshuang Zhao, Tong He, Xihui Liu
However, due to the lack of information from multiple views, these works encounter difficulties in generating controllable novel views.
no code implementations • 5 Dec 2023 • Yao Teng, Enze Xie, Yue Wu, Haoyu Han, Zhenguo Li, Xihui Liu
In this paper, we propose a new diffusion-based method for interactive point-based video manipulation, called Drag-A-Video.
no code implementations • CVPR 2024 • Xian Liu, Xiaohang Zhan, Jiaxiang Tang, Ying Shan, Gang Zeng, Dahua Lin, Xihui Liu, Ziwei Liu
In this paper, we propose an efficient yet effective framework, HumanGaussian, that generates high-quality 3D humans with fine-grained geometry and realistic appearance.
no code implementations • 12 Oct 2023 • Xian Liu, Jian Ren, Aliaksandr Siarohin, Ivan Skorokhodov, Yanyu Li, Dahua Lin, Xihui Liu, Ziwei Liu, Sergey Tulyakov
Our model enforces the joint learning of image appearance, spatial relationship, and geometry in a unified network, where each branch in the model complements to each other with both structural awareness and textural richness.
1 code implementation • NeurIPS 2023 • Meng Wei, Xiaoyu Yue, Wenwei Zhang, Shu Kong, Xihui Liu, Jiangmiao Pang
Secondly, part segmentation introduces an open granularity challenge due to the diverse and often ambiguous definitions of parts in the open world.
Open Vocabulary Semantic Segmentation Open-Vocabulary Semantic Segmentation +1
no code implementations • 3 Oct 2023 • Xiaoyu Yue, Lei Bai, Meng Wei, Jiangmiao Pang, Xihui Liu, Luping Zhou, Wanli Ouyang
Masked AutoEncoder (MAE) has revolutionized the field of self-supervised learning with its simple yet effective masking and reconstruction strategies.
no code implementations • 18 Sep 2023 • Chenming Zhu, Wenwei Zhang, Tai Wang, Xihui Liu, Kai Chen
Instead of leveraging 2D images, we propose Object2Scene, the first approach that leverages large-scale large-vocabulary 3D object datasets to augment existing 3D scene datasets for open-vocabulary 3D object detection.
1 code implementation • CVPR 2024 • Xiaoyang Wu, Zhuotao Tian, Xin Wen, Bohao Peng, Xihui Liu, Kaicheng Yu, Hengshuang Zhao
In contrast, such privilege has not yet fully benefited 3D deep learning, mainly due to the limited availability of large-scale 3D datasets.
Ranked #3 on 3D Semantic Segmentation on SemanticKITTI (val mIoU metric, using extra training data)
1 code implementation • NeurIPS 2023 • Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, Xihui Liu
Despite the stunning ability to generate high-quality images by recent text-to-image models, current approaches often struggle to effectively compose objects with different attributes and relationships into a complex and coherent scene.
no code implementations • 19 Jun 2023 • Qinghong Sun, Yangguang Li, Zexiang Liu, Xiaoshui Huang, Fenggang Liu, Xihui Liu, Wanli Ouyang, Jing Shao
However, the quality and diversity of existing 3D object generation methods are constrained by the inadequacies of existing 3D object datasets, including issues related to text quality, the incompleteness of multi-modal data representation encompassing 2D rendered images and 3D assets, as well as the size of the dataset.
1 code implementation • 6 Jun 2023 • Yunhan Yang, Xiaoyang Wu, Tong He, Hengshuang Zhao, Xihui Liu
In this work, we propose SAM3D, a novel framework that is able to predict masks in 3D point clouds by leveraging the Segment-Anything Model (SAM) in RGB images without further training or finetuning.
1 code implementation • 23 May 2023 • Ziyun Zeng, Yixiao Ge, Zhan Tong, Xihui Liu, Shu-Tao Xia, Ying Shan
We argue that tuning a text encoder end-to-end, as done in previous work, is suboptimal since it may overfit in terms of styles, thereby losing its original generalization ability to capture the semantics of various language registers.
no code implementations • 24 Apr 2023 • Zeyu Lu, Chengyue Wu, Xinyuan Chen, Yaohui Wang, Lei Bai, Yu Qiao, Xihui Liu
To mitigate those limitations, we propose Hierarchical Diffusion Autoencoders (HDAE) that exploit the fine-grained-to-abstract and lowlevel-to-high-level feature hierarchy for the latent space of diffusion models.
2 code implementations • 12 Apr 2023 • Jiahao Wang, Songyang Zhang, Yong liu, Taiqiang Wu, Yujiu Yang, Xihui Liu, Kai Chen, Ping Luo, Dahua Lin
Extensive experiments and ablative analysis also demonstrate that the inductive bias of network architecture, can be incorporated into simple network structure with appropriate optimization strategy.
1 code implementation • ICCV 2023 • Yuanfeng Ji, Zhe Chen, Enze Xie, Lanqing Hong, Xihui Liu, Zhaoqiang Liu, Tong Lu, Zhenguo Li, Ping Luo
We propose a simple, efficient, yet powerful framework for dense visual predictions based on the conditional diffusion pipeline.
Ranked #3 on Monocular Depth Estimation on SUN-RGBD
1 code implementation • CVPR 2023 • Xiaoyang Wu, Xin Wen, Xihui Liu, Hengshuang Zhao
As a pioneering work, PointContrast conducts unsupervised 3D representation learning via leveraging contrastive learning over raw RGB-D frames and proves its effectiveness on various downstream tasks.
Ranked #13 on Semantic Segmentation on ScanNet (using extra training data)
no code implementations • CVPR 2023 • Jiahao Wang, Songyang Zhang, Yong liu, Taiqiang Wu, Yujiu Yang, Xihui Liu, Kai Chen, Ping Luo, Dahua Lin
Extensive experiments and ablative analysis also demonstrate that the inductive bias of network architecture, can be incorporated into simple network structure with appropriate optimization strategy.
no code implementations • CVPR 2023 • Jin Gao, Jialing Zhang, Xihui Liu, Trevor Darrell, Evan Shelhamer, Dequan Wang
We update the target data instead, and project all test inputs toward the source domain with a generative diffusion model.
1 code implementation • CVPR 2023 • Qingyan Bai, Ceyuan Yang, Yinghao Xu, Xihui Liu, Yujiu Yang, Yujun Shen
Generative adversarial network (GAN) is formulated as a two-player game between a generator (G) and a discriminator (D), where D is asked to differentiate whether an image comes from real data or is produced by G. Under such a formulation, D plays as the rule maker and hence tends to dominate the competition.
no code implementations • 1 Dec 2022 • Dong Huk Park, Grace Luo, Clayton Toste, Samaneh Azadi, Xihui Liu, Maka Karalashvili, Anna Rohrbach, Trevor Darrell
We introduce precise object silhouette as a new form of user control in text-to-image diffusion models, which we dub Shape-Guided Diffusion.
2 code implementations • 11 Oct 2022 • Xiaoyang Wu, Yixing Lao, Li Jiang, Xihui Liu, Hengshuang Zhao
In this work, we analyze the limitations of the Point Transformer and propose our powerful and efficient Point Transformer V2 model with novel designs that overcome the limitations of previous work.
Ranked #1 on 3D Semantic Segmentation on nuScenes
1 code implementation • CVPR 2023 • Ziyun Zeng, Yuying Ge, Xihui Liu, Bin Chen, Ping Luo, Shu-Tao Xia, Yixiao Ge
Pre-training on large-scale video data has become a common recipe for learning transferable spatiotemporal representations in recent years.
1 code implementation • 7 Jul 2022 • Jin Gao, Jialing Zhang, Xihui Liu, Trevor Darrell, Evan Shelhamer, Dequan Wang
We instead update the target data, by projecting all test inputs toward the source domain with a generative diffusion model.
1 code implementation • 22 Jun 2022 • Peiyuan Liao, Xiuyu Li, Xihui Liu, Kurt Keutzer
We introduce ArtBench-10, the first class-balanced, high-quality, cleanly annotated, and standardized dataset for benchmarking artwork generation.
1 code implementation • 26 Apr 2022 • Yuying Ge, Yixiao Ge, Xihui Liu, Alex Jinpeng Wang, Jianping Wu, Ying Shan, XiaoHu Qie, Ping Luo
Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics.
Ranked #7 on Zero-Shot Video Retrieval on MSVD
2 code implementations • CVPR 2022 • Yuying Ge, Yixiao Ge, Xihui Liu, Dian Li, Ying Shan, XiaoHu Qie, Ping Luo
As an additional benefit, our method achieves competitive results with much shorter pre-training videos on single-modality downstream tasks, e. g., action recognition with linear evaluation.
Ranked #8 on Zero-Shot Video Retrieval on MSVD
no code implementations • 10 Dec 2021 • Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, Trevor Darrell
We investigate fine-grained, continuous control of this model class, and introduce a novel unified framework for semantic diffusion guidance, which allows either language or image guidance, or both.
1 code implementation • ECCV 2020 • Xihui Liu, Zhe Lin, Jianming Zhang, Handong Zhao, Quan Tran, Xiaogang Wang, Hongsheng Li
We propose a novel algorithm, named Open-Edit, which is the first attempt on open-domain image manipulation with open-vocabulary instructions.
1 code implementation • NeurIPS 2019 • Xihui Liu, Guojun Yin, Jing Shao, Xiaogang Wang, Hongsheng Li
Semantic image synthesis aims at generating photorealistic images from semantic layouts.
1 code implementation • ICCV 2019 • Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, Jing Shao
Text-image cross-modal retrieval is a challenging task in the field of language and vision.
Ranked #9 on Image Retrieval on Flickr30K 1K test
no code implementations • CVPR 2019 • Xihui Liu, ZiHao Wang, Jing Shao, Xiaogang Wang, Hongsheng Li
Referring expression grounding aims at locating certain objects or persons in an image with a referring expression, where the key challenge is to comprehend and align various types of information from visual and textual domain, such as visual attributes, location and interactions with surrounding regions.
no code implementations • 28 Aug 2018 • Pengze Liu, Xihui Liu, Junjie Yan, Jing Shao
Pedestrian attribute recognition has attracted many attentions due to its wide applications in scene understanding and person analysis from surveillance videos.
no code implementations • ECCV 2018 • Dapeng Chen, Hongsheng Li, Xihui Liu, Yantao Shen, Zejian yuan, Xiaogang Wang
Person re-identification is an important task that requires learning discriminative visual features for distinguishing different person identities.
Ranked #24 on Text based Person Retrieval on CUHK-PEDES
no code implementations • ECCV 2018 • Xihui Liu, Hongsheng Li, Jing Shao, Dapeng Chen, Xiaogang Wang
The aim of image captioning is to generate captions by machine to describe image contents.
no code implementations • ICCV 2017 • Zhongdao Wang, Luming Tang, Xihui Liu, Zhuliang Yao, Shuai Yi, Jing Shao, Junjie Yan, Shengjin Wang, Hongsheng Li, Xiaogang Wang
In our vehicle ReID framework, an orientation invariant feature embedding module and a spatial-temporal regularization module are proposed.
2 code implementations • ICCV 2017 • Xihui Liu, Haiyu Zhao, Maoqing Tian, Lu Sheng, Jing Shao, Shuai Yi, Junjie Yan, Xiaogang Wang
Pedestrian analysis plays a vital role in intelligent video surveillance and is a key component for security-centric computer vision systems.
Ranked #2 on Pedestrian Attribute Recognition on RAP
1 code implementation • CVPR 2017 • Kai Kang, Hongsheng Li, Tong Xiao, Wanli Ouyang, Junjie Yan, Xihui Liu, Xiaogang Wang
Object detection in videos has drawn increasing attention recently with the introduction of the large-scale ImageNet VID dataset.