Semi- and weakly-supervised learning have recently attracted considerable attention in the object detection literature since they can alleviate the cost of annotation needed to successfully train deep learning models.
1 code implementation • 28 Mar 2022 • Gnana Praveen Rajasekar, Wheidima Carneiro de Melo, Nasib Ullah, Haseeb Aslam, Osama Zeeshan, Théo Denorme, Marco Pedersoli, Alessandro Koerich, Simon Bacon, Patrick Cardinal, Eric Granger
Specifically, we propose a joint cross-attention model that relies on the complementary relationships to extract the salient features across A-V modalities, allowing for accurate prediction of continuous values of valence and arousal.
Unsupervised pre-training has been proven as an effective approach to boost various downstream tasks given limited labeled data.
The CNN is exploited to collect both positive and negative evidence at the pixel level to train the decoder.
This work considers semi-supervised segmentation as a dense prediction problem based on prototype vector correlation and proposes a simple way to represent each segmentation class with multiple prototypes.
Interpolation is required to restore full size CAMs, yet it does not consider the statistical properties of objects, such as color and texture, leading to activations with inconsistent boundaries, and inaccurate localizations.
Pre-training a recognition model with contrastive learning on a large dataset of unlabeled data has shown great potential to boost the performance of a downstream task, e. g., image classification.
Despite their outstanding accuracy, semi-supervised segmentation methods based on deep neural networks can still yield predictions that are considered anatomically impossible by clinicians, for instance, containing holes or disconnected regions.
Recent advances in unsupervised domain adaptation have significantly improved the recognition accuracy of CNNs by alleviating the domain shift between (labeled) source and (unlabeled) target data distributions.
In this method, we maximize the MI for intermediate feature embeddings that are taken from both the encoder and decoder of a segmentation network.
Our attack enjoys the generality of penalty methods and the computational efficiency of distance-customized algorithms, and can be readily used for a wide set of distances.
The proposed softmax strategy provides several advantages: a reduced computational complexity due to efficient clip sampling, and an improved accuracy since temporal weighting focuses on more relevant clips during both training and inference.
Moreover, to encourage predictions from different networks to be both consistent and confident, we enhance this generalized JSD loss with an uncertainty regularizer based on entropy.
Data augmentation is a key practice in machine learning for improving generalization performance.
Second, we show that, more generally, minimizing the cross-entropy is actually equivalent to maximizing the mutual information, to which we connect several well-known pairwise losses.
Ranked #7 on Metric Learning on In-Shop (using extra training data)
To handle this new learning paradigm, we propose to include surrogate tasks that can leverage very powerful supervisory signals --derived from the data itself-- for semantic feature learning.
Despite the initial belief that Convolutional Neural Networks (CNNs) are driven by shapes to perform visual recognition tasks, recent evidence suggests that texture bias in CNNs provides higher performing models when learning on large labeled training datasets.
Ranked #2 on Few-Shot Semantic Segmentation on Pascal5i
The scarcity of labeled data often limits the application of deep learning to medical image segmentation.
Weakly supervised object localization is a challenging task in which the object of interest should be localized while learning its appearance.
Deep Neural Networks have now achieved state-of-the-art results in a wide range of tasks including image classification, object detection and so on.
The second, named Invariant Information Clustering (IIC), maximizes the mutual information between the clustering of a sample and its geometrically transformed version.
Video-based emotion recognition is a challenging task because it requires to distinguish the small deformations of the human face that represent emotions, while being invariant to stronger visual differences due to different identities.
Data augmentation (DA) is fundamental against overfitting in large convolutional neural networks, especially with a limited training dataset.
An efficient strategy for weakly-supervised segmentation is to impose constraints or regularization priors on target regions.
This paper presents a novel deep neural network (DNN) for multimodal fusion of audio, video and text modalities for emotion recognition.
Although deep neural networks (NNs) have achievedstate-of-the-art accuracy in many visual recognition tasks, the growing computational complexity and energy con-sumption of networks remains an issue, especially for ap-plications on platforms with limited resources and requir-ing real-time processing.
In this paper, we aim to improve the performance of semantic image segmentation in a semi-supervised setting in which training is effectuated with a reduced set of annotated images and additional non-annotated images.
Typically, they use multinomial logistic regression posteriors and parameter regularization, as is very common in supervised learning.
Ranked #2 on Image Clustering on YouTube Faces DB
In this paper we propose a new approach for classifying the global emotion of images containing groups of people.
This paper proposes a principled information theoretic analysis of classification for deep neural network structures, e. g. convolutional neural networks (CNN).
In this paper, a new method for generating object and action proposals in images and videos is proposed.
Learning the distribution of images in order to generate new samples is a challenging task due to the high dimensionality of the data and the highly non-linear relations that are involved.
We generate hypotheses in a sliding-window fashion over different activation layers and show that the final convolutional layers can find the object of interest with high recall but poor localization due to the coarseness of the feature maps.
However, as learning appearance and localization are two interconnected tasks, the optimization is not convex and the procedure can easily get stuck in a poor local minimum, the algorithm "misses" the object in some images.
In classification of objects substantial work has gone into improving the low level representation of an image by considering various aspects such as different features, a number of feature pooling and coding techniques and considering different kernels.
Additionally, without any facial point annotation at the level of individual training images, our method can localize facial points with an accuracy similar to fully supervised approaches.