The tremendous success of CLIP (Radford et al., 2021) has promoted the research and application of contrastive learning for vision-language pretraining.
Ranked #1 on Zero-shot Image Retrieval on MUGE Retrieval
Language-guided image editing has achieved great success recently.
The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks.
Ranked #17 on Action Classification on Kinetics-600 (using extra training data)
MIC significantly improves the state-of-the-art performance across the different recognition tasks for synthetic-to-real, day-to-nighttime, and clear-to-adverse-weather UDA.
Ranked #1 on Domain Adaptation on GTA5 to Cityscapes
Finally, we show promising results on generalization to different PDE parameters and time-scales with a single surrogate model.
In this paper, a data-driven and local-verification (D$^2$LV) approach is proposed to compete for Image Similarity Challenge: Matching Track at NeurIPS'21.
In this paper, we propose Progressive Volume Distillation (PVD), a systematic distillation method that allows any-to-any conversions between different architectures, including MLP, sparse or low-rank tensors, hashtables and their compositions.
Ranked #1 on Novel View Synthesis on NeRF (Average PSNR metric)
Relative positional embeddings (RPE) have received considerable attention since RPEs effectively model the relative distance among tokens and enable length extrapolation.
In contrast to previous 3D MAE frameworks, which either design a complex decoder to infer masked information from maintained regions or adopt sophisticated masking strategies, we instead propose a much simpler paradigm.