In recent years, several high-performance conversational systems have been proposed based on the Transformer encoder-decoder model.
We introduce a robust, real-time, high-resolution human video matting method that achieves new state-of-the-art performance.
Ranked #1 on Video Matting on VideoMatte240K
The proposed model can generate photo-realistic portrait images with accurate movements according to intuitive modifications.
We propose 3DETR, an end-to-end Transformer based object detection model for 3D point clouds.
Ranked #5 on 3D Object Detection on ScanNetV2
Though many attempts have been made in blind super-resolution to restore low-resolution images with unknown and complex degradations, they are still far from addressing general real-world degraded images.
Annotating a qualitative large-scale facial expression dataset is extremely difficult due to the uncertainties caused by ambiguous facial expressions, low-quality facial images, and the subjectiveness of annotators.
Large pre-trained language models for textual data have an unconstrained output space; at each decoding step, they can produce any of 10, 000s of sub-word tokens.
Ranked #1 on Semantic Parsing on spider
Furthermore, based on LibFewShot, we provide comprehensive evaluations on multiple benchmark datasets with multiple backbone architectures to evaluate common pitfalls and effects of different training tricks.