To cover language, image, and video at the same time for different scenarios, a 3D transformer encoder-decoder framework is designed, which can not only deal with videos as 3D data but also adapt to texts and images as 1D and 2D data, respectively.
Based on this observation, we hypothesize that the general architecture of the transformers, instead of the specific token mixer module, is more essential to the model's performance.
Ranked #41 on Semantic Segmentation on ADE20K
We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich spatio-temporal information for online multiple object tracking and segmentation.
Ranked #1 on Multi-Object Tracking and Segmentation on BDD100K
We introduce a robust, real-time, high-resolution human video matting method that achieves new state-of-the-art performance.
Ranked #1 on Video Matting on VideoMatte240K
We find that one of the main reasons for that is the lack of an effective receptive field in both the inpainting network and the loss function.
Operating systems include many heuristic algorithms designed to improve overall storage performance and throughput.
We investigate the applicability of the anchor-free strategy on lightweight object detection models.
Ranked #1 on Object Detection on MSCOCO