Search Results for author: Sijie Zhao

Found 17 papers, 11 papers with code

Transforming Weather Data from Pixel to Latent Space

no code implementations9 Mar 2025 Sijie Zhao, Feng Liu, Xueliang Zhang, Hao Chen, Tao Han, Junchao Gong, Ran Tao, Pengfeng Xiao, Lei Bai, Wanli Ouyang

The downstream task further demonstrates that task models can apply to multiple PVS with low data costs in latent space and achieve superior performance compared to models in pixel space.

StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos

no code implementations11 Sep 2024 Sijie Zhao, WenBo Hu, Xiaodong Cun, Yong Zhang, Xiaoyu Li, Zhe Kong, Xiangjun Gao, Muyao Niu, Ying Shan

This paper presents a novel framework for converting 2D videos to immersive stereoscopic 3D, addressing the growing demand for 3D content in immersive experience.

Video Inpainting

DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

1 code implementation3 Sep 2024 WenBo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, Ying Shan

Our training approach enables the model to generate depth sequences with variable lengths at one time, up to 110 frames, and harvest both precise depth details and rich content diversity from realistic and synthetic datasets.

Diversity Monocular Depth Estimation +2

VegeDiff: Latent Diffusion Model for Geospatial Vegetation Forecasting

no code implementations17 Jul 2024 Sijie Zhao, Hao Chen, Xueliang Zhang, Pengfeng Xiao, Lei Bai, Wanli Ouyang

By capturing the uncertainties in vegetation changes and modeling the complex influence of relevant variables, VegeDiff outperforms existing deterministic methods, providing clear and accurate forecasting results of future vegetation states.

model

CV-VAE: A Compatible Video VAE for Latent Generative Video Models

1 code implementation30 May 2024 Sijie Zhao, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Muyao Niu, Xiaoyu Li, WenBo Hu, Ying Shan

Moreover, since current diffusion-based approaches are often implemented using pre-trained text-to-image (T2I) models, directly training a video VAE without considering the compatibility with existing T2I models will result in a latent space gap between them, which will take huge computational resources for training to bridge the gap even with the T2I models as initialization.

Quantization

SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing

1 code implementation7 May 2024 Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, Ying Shan

In this technical report, we introduce SEED-Data-Edit: a unique hybrid dataset for instruction-guided image editing, which aims to facilitate image manipulation using open-form language.

Image Manipulation Language Modeling +3

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

1 code implementation22 Apr 2024 Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, Ying Shan

We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications.

Image Generation

RS-Mamba for Large Remote Sensing Image Dense Prediction

1 code implementation3 Apr 2024 Sijie Zhao, Hao Chen, Xueliang Zhang, Pengfeng Xiao, Lei Bai, Wanli Ouyang

RSM is specifically designed to capture the global context of remote sensing images with linear complexity, facilitating the effective processing of large VHR images.

Building change detection for remote sensing images Change Detection +3

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio Video Point Cloud Time-Series and Image Recognition

1 code implementation CVPR 2024 Xiaohan Ding, Yiyuan Zhang, Yixiao Ge, Sijie Zhao, Lin Song, Xiangyu Yue, Ying Shan

1) We propose four architectural guidelines for designing large-kernel ConvNets the core of which is to exploit the essential characteristics of large kernels that distinguish them from small kernels - they can see wide without going deep.

Time Series Time Series Forecasting

VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

1 code implementation14 Dec 2023 Jinguo Zhu, Xiaohan Ding, Yixiao Ge, Yuying Ge, Sijie Zhao, Hengshuang Zhao, Xiaohua Wang, Ying Shan

In combination with the existing text tokenizer and detokenizer, this framework allows for the encoding of interleaved image-text data into a multimodal sequence, which can subsequently be fed into the transformer model.

Image Captioning In-Context Learning +4

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

3 code implementations27 Nov 2023 Xiaohan Ding, Yiyuan Zhang, Yixiao Ge, Sijie Zhao, Lin Song, Xiangyu Yue, Ying Shan

1) We propose four architectural guidelines for designing large-kernel ConvNets, the core of which is to exploit the essential characteristics of large kernels that distinguish them from small kernels - they can see wide without going deep.

 Ranked #1 on Object Detection on COCO 2017 (mAP metric)

Image Classification Object Detection +3

Exchanging Dual Encoder-Decoder: A New Strategy for Change Detection with Semantic Guidance and Spatial Localization

1 code implementation19 Nov 2023 Sijie Zhao, Xueliang Zhang, Pengfeng Xiao, Guangjun He

We build a binary change detection model based on this strategy, and then validate and compare it with 18 state-of-the-art change detection methods on six datasets in three scenarios, including intraclass change detection datasets (CDD, SYSU), single-view building change detection datasets (WHU, LEVIR-CD, LEVIR-CD+) and a multiview building change detection dataset (NJDS).

Change Detection Decoder +1

Making LLaMA SEE and Draw with SEED Tokenizer

1 code implementation2 Oct 2023 Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, Ying Shan

We identify two crucial design principles: (1) Image tokens should be independent of 2D physical patch positions and instead be produced with a 1D causal dependency, exhibiting intrinsic interdependence that aligns with the left-to-right autoregressive prediction mechanism in LLMs.

multimodal generation

Fisher Information Guidance for Learned Time-of-Flight Imaging

no code implementations CVPR 2022 Jiaqu Li, Tao Yue, Sijie Zhao, Xuemei Hu

Indirect Time-of-Flight (ToF) imaging is widely applied in practice for its superiorities on cost and spatial resolution.

Distribution-Aware Adaptive Multi-Bit Quantization

no code implementations CVPR 2021 Sijie Zhao, Tao Yue, Xuemei Hu

In this paper, we explore the compression of deep neural networks by quantizing the weights and activations into multi-bit binary networks (MBNs).

Image Classification Quantization

Cannot find the paper you are looking for? You can Submit a new open access paper.