VRT: A Video Restoration Transformer

Video restoration (e.g., video super-resolution) aims to restore high-quality frames from low-quality frames. Different from single image restoration, video restoration generally requires to utilize temporal information from multiple adjacent but usually misaligned video frames. Existing deep methods generally tackle with this by exploiting a sliding window strategy or a recurrent architecture, which either is restricted by frame-by-frame restoration or lacks long-range modelling ability. In this paper, we propose a Video Restoration Transformer (VRT) with parallel frame prediction and long-range temporal dependency modelling abilities. More specifically, VRT is composed of multiple scales, each of which consists of two kinds of modules: temporal mutual self attention (TMSA) and parallel warping. TMSA divides the video into small clips, on which mutual attention is applied for joint motion estimation, feature alignment and feature fusion, while self attention is used for feature extraction. To enable cross-clip interactions, the video sequence is shifted for every other layer. Besides, parallel warping is used to further fuse information from neighboring frames by parallel feature warping. Experimental results on five tasks, including video super-resolution, video deblurring, video denoising, video frame interpolation and space-time video super-resolution, demonstrate that VRT outperforms the state-of-the-art methods by large margins ($\textbf{up to 2.16dB}$) on fourteen benchmark datasets.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Video Denoising DAVIS sigma10 VRT PSNR 40.82 # 1
Video Denoising DAVIS sigma20 VRT PSNR 38.15 # 1
Video Denoising DAVIS sigma30 VRT PSNR 36.52 # 1
Video Denoising DAVIS sigma40 VRT PSNR 35.32 # 1
Video Denoising DAVIS sigma50 VRT PSNR 34.36 # 1
Deblurring DVD VRT PSNR 34.27 # 1
Deblurring DVD VRT PSNR 34.27 # 1
Deblurring GoPro VRT PSNR 34.81 # 1
SSIM 0.9724 # 1
Video Super-Resolution MSU Super-Resolution for Video Compression VRT + uavs3e BSQ-rate over ERQA 6.619 # 30
BSQ-rate over Subjective Score 2.511 # 33
BSQ-rate over VMAF 1.425 # 29
BSQ-rate over PSNR 5.862 # 29
BSQ-rate over MS-SSIM 1.982 # 37
BSQ-rate over LPIPS 4.003 # 31
Video Super-Resolution MSU Super-Resolution for Video Compression VRT + x265 BSQ-rate over ERQA 8.92 # 40
BSQ-rate over Subjective Score 2.023 # 28
BSQ-rate over VMAF 1.217 # 21
BSQ-rate over PSNR 6.634 # 37
BSQ-rate over MS-SSIM 1.257 # 24
BSQ-rate over LPIPS 11.329 # 60
Video Super-Resolution MSU Super-Resolution for Video Compression VRT + vvenc BSQ-rate over ERQA 18.333 # 74
BSQ-rate over Subjective Score 2.235 # 29
BSQ-rate over VMAF 0.652 # 2
BSQ-rate over PSNR 5.777 # 25
BSQ-rate over MS-SSIM 0.836 # 13
BSQ-rate over LPIPS 11.496 # 63
Video Super-Resolution MSU Super-Resolution for Video Compression VRT + aomenc BSQ-rate over ERQA 12.289 # 50
BSQ-rate over Subjective Score 2.631 # 34
BSQ-rate over VMAF 1.733 # 39
BSQ-rate over PSNR 10.075 # 53
BSQ-rate over MS-SSIM 2.797 # 43
BSQ-rate over LPIPS 4.429 # 37
Video Super-Resolution MSU Super-Resolution for Video Compression VRT + x264 BSQ-rate over ERQA 1.578 # 8
BSQ-rate over Subjective Score 1.245 # 17
BSQ-rate over VMAF 0.7 # 7
BSQ-rate over PSNR 1.09 # 7
BSQ-rate over MS-SSIM 0.662 # 3
BSQ-rate over LPIPS 1.259 # 12
Video Super-Resolution MSU Video Super Resolution Benchmark: Detail Restoration VRT Subjective score 7.628 # 1
ERQAv1.0 0.758 # 1
QRCRv1.0 0.722 # 1
SSIM 0.902 # 1
PSNR 31.669 # 1
FPS 2.778 # 4
1 - LPIPS 0.929 # 4
Deblurring REDS VRT Average PSNR 36.79 # 1
Video Denoising Set8 sigma10 VRT PSNR 37.88 # 1
Video Denoising Set8 sigma20 VRT PSNR 35.02 # 1
Video Denoising Set8 sigma30 VRT PSNR 33.35 # 1
Video Denoising Set8 sigma40 VRT PSNR 32.15 # 1
Video Denoising Set8 sigma50 VRT PSNR 31.22 # 1
Video Super-Resolution UDM10 - 4x upscaling VRT PSNR 41.05 # 1
SSIM 0.9737 # 1
Video Super-Resolution Vid4 - 4x upscaling VRT PSNR 27.93 # 1
SSIM 0.8400 # 1
Video Frame Interpolation Vid4 - 4x upscaling VRT PSNR 27.46 # 1
SSIM 0.8392 # 1
Parameters 4450000 # 5
Space-time Video Super-resolution Vimeo90K-Fast VRT PSNR 36.98 # 1
SSIM 0.9439 # 1
Space-time Video Super-resolution Vimeo90K-Medium VRT PSNR 36.01 # 1
SSIM 0.9434 # 4

Methods