With only text supervision and without any pixel-level annotations, GroupViT learns to group together semantic regions and successfully transfers to the task of semantic segmentation in a zero-shot manner, i. e., without any further fine-tuning.
It is therefore interesting to study how these two tasks can be coupled to benefit each other.
In this paper, we investigate the use of pretraining with adversarial networks, with the objective of discovering the relationship between network depth and robustness.
We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data achieve competitive performances compared to models trained on existing manually curated datasets.
The recent success of Transformers in the language domain has motivated adapting it to a multimodal setting, where a new visual model is trained in tandem with an already pretrained language model.
A common way to speed up the computation is to downsample the feature volume, but this loses high-frequency details.
Identifying the underlying directional relations from observational time series with nonlinear interactions and complex relational structures is key to a wide range of applications, yet remains a hard problem.
Learning causal relations from observational time series with nonlinear interactions and complex causal structures is a key component of human intelligence, and has a wide range of applications.
Deep residual networks (ResNets) made a recent breakthrough in deep learning.
Unsupervised image-to-image translation aims at learning a joint distribution of images in different domains by using images from the marginal distributions in individual domains.
From countless experiments of the past it became widely accepted that the value of k has a significant impact on the performance of this method.
State-of-the-art results of semantic segmentation are established by Fully Convolutional neural Networks (FCNs).