Semantic segmentation models classify pixels into a set of known (``in-distribution'') visual classes.
Ranked #1 on Anomaly Detection on Road Anomaly (using extra training data)
On effectiveness, the SwinV2-MoE model achieves superior accuracy in both pre-training and down-stream computer vision tasks such as COCO object detection than the counterpart dense model, indicating the readiness of Tutel for end-to-end real-world model training and inference.
Furthermore, we propose a latent-mapping algorithm in the latent space to convert the amateur vocal tone to the professional one.
The tremendous success of CLIP (Radford et al., 2021) has promoted the research and application of contrastive learning for vision-language pretraining.
Ranked #1 on Zero-shot Image Retrieval on MUGE Retrieval
Instead of trying to process more frames at once like most existing methods, we propose to process videos in an online fashion and cache "memory" at each iteration.
Ranked #2 on Action Anticipation on EPIC-KITCHENS-100 (using extra training data)
BotSIM adopts a layered design comprising the infrastructure layer, the adaptor layer and the application layer.
In this paper we introduce a tool called Principal Image Sections Mapping - PRISM, dedicated for PyTorch, but can be easily ported to other deep learning frameworks.
Our goal is to develop a fast sampling method for DMs with a much less number of steps while retaining high sample quality.