This paper deals with the problem of audio source separation.
We introduce GAUDI, a generative model capable of capturing the distribution of complex and realistic 3D scenes that can be rendered immersively from a moving camera.
Ranked #1 on Image Generation on ARKitScenes
By only training a query-based image instance segmentation model, MinVIS outperforms the previous best result on the challenging Occluded VIS dataset by over 10% AP.
Ranked #2 on Video Instance Segmentation on OVIS validation
Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios.
Ranked #1 on Zero-Shot Action Recognition on Kinetics
Neural Radiance Fields (NeRFs) have demonstrated amazing ability to synthesize images of 3D scenes from novel views.
Ranked #1 on Novel View Synthesis on Mip-NeRF 360
This paper inherits a strong and simple image restoration model, NAFNet, for single-view feature extraction and extends it by adding cross attention modules to fuse features between views to adapt to binocular scenarios.
Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes.
It is a challenging problem because a target moment may take place in the context of other temporal moments in the untrimmed video.
YOLOv7 surpasses all known object detectors in both speed and accuracy in the range from 5 FPS to 160 FPS and has the highest accuracy 56. 8% AP among all known real-time object detectors with 30 FPS or higher on GPU V100.
Ranked #1 on Real-Time Object Detection on COCO