A few-shot semantic segmentation model is typically composed of a CNN encoder, a CNN decoder and a simple classifier (separating foreground and background pixels).
The generated face image given a target age code is expected to be age-sensitive reflected by bio-plausible transformations of shape and texture, while being identity preserving.
Modelling long-range contextual relationships is critical for pixel-wise prediction tasks such as semantic segmentation.
In this work, we address domain generalization with MixStyle, a plug-and-play, parameter-free module that is simply inserted to shallow CNN layers and requires no modification to training objectives.
To overcome these limitations, we propose a novel latent space factorization model, called L2M-GAN, which is learned end-to-end and effective for editing both local and global attributes.
With this meta-learning framework, our model can not only disentangle the cross-modal shared semantic content for SBIR, but can adapt the disentanglement to any unseen user style as well, making the SBIR model truly style-agnostic.
Analysis of human sketches in deep learning has advanced immensely through the use of waypoint-sequences rather than raster-graphic representations.
A fundamental challenge faced by existing Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) models is the data scarcity -- model performances are largely bottlenecked by the lack of sketch-photo pairs.
This data is uniquely characterised by its existence in dual modalities of rasterized images and vector coordinate sequences.
We argue that these are caused by the lack of context-aware object and stuff feature encoding in their generators, and location-sensitive appearance representation in their discriminators.
Ranked #1 on Layout-to-Image Generation on COCO-Stuff 128x128
no code implementations • 11 Mar 2021 • Xingyu Jiang, Mingyang Qin, Xinjian Wei, Zhongpei Feng, Jiezun Ke, Haipeng Zhu, Fucong Chen, Liping Zhang, Li Xu, Xu Zhang, Ruozhou Zhang, Zhongxu Wei, Peiyu Xiong, Qimei Liang, Chuanying Xi, Zhaosheng Wang, Jie Yuan, Beiyi Zhu, Kun Jiang, Ming Yang, Junfeng Wang, Jiangping Hu, Tao Xiang, Brigitte Leridon, Rong Yu, Qihong Chen, Kui Jin, Zhongxian Zhao
Iron selenide (FeSe) - the structurally simplest iron-based superconductor, has attracted tremendous interest in the past years.
In particular, intensive research in this topic has led to a broad spectrum of methodologies, e. g., those based on domain alignment, meta-learning, data augmentation, or ensemble learning, just to name a few; and has covered various vision applications such as object recognition, segmentation, action recognition, and person re-identification.
First, data augmentations are introduced to both the support and query sets with each sample now being represented as an augmented embedding (AE) composed of concatenated embeddings of both the original and augmented versions.
Extensive experiments on four standard few-shot action benchmarks show that our method clearly outperforms previous state-of-the-art methods, with the improvement particularly significant (10+\%) on the most challenging fine-grained action recognition benchmark.
Based on this property, we identify the discriminative areas of a given clean example easily for local perturbations.
Most recent few-shot learning (FSL) approaches are based on episodic training whereby each episode samples few training instances (shots) per class to imitate the test condition.
Importantly, at the episode-level, two SSL-FSL hybrid learning objectives are devised: (1) The consistency across the predictions of an FSL classifier from different extended episodes is maximized as an episode-level pretext task.
With the constrained jigsaw puzzles, instead of solving them directly, which could still be extremely hard, we carefully design four surrogate tasks that are more solvable but meanwhile still ensure that the learned representation is sensitive to spatiotemporal continuity at both the local and global levels.
In this paper, we aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task.
Ranked #1 on Semantic Segmentation on FoodSeg103 (using extra training data)
By transferring knowledge learned from seen/previous tasks, meta learning aims to generalize well to unseen/future tasks.
However, most existing models developed for these tasks are pre-trained on general video action classification tasks.
Ranked #6 on Temporal Action Localization on ActivityNet-1.3
In this paper, we propose a novel watermark removal attack from a different perspective.
In this paper, we study a further trait of sketches that has been overlooked to date, that is, they are hierarchical in terms of the levels of detail -- a person typically sketches up to various extents of detail to depict an object.
Specifically, we use our dual-branch architecture as a universal representation framework to design two sketch-specific deep models: (i) We propose a deep hashing model for sketch retrieval, where a novel hashing loss is specifically designed to accommodate both the abstract and messy traits of sketches.
The study of neural generative models of human sketches is a fascinating contemporary modeling problem due to the links between sketch image generation and the human drawing process.
In this challenge, action recognition is posed as the problem of simultaneously predicting a single `verb' and `noun' class label given an input trimmed video clip.
In this paper, we design Top-DP, a novel solution to optimize the differential privacy protection of decentralized image classification systems.
Specifically, we consider that under cloth-changes, soft-biometrics such as body shape would be more reliable.
Departing from existing alternatives, our W3 module models all three facets of video attention jointly.
Ranked #1 on Action Recognition on EPIC-KITCHENS-55
Existing few-shot learning (FSL) methods make the implicit assumption that the few target class samples are from the same domain as the source class samples.
This is achieved by having a learning objective formulated to ensure that the generated data can be correctly classified by the label classifier while fooling the domain classifier.
To this end we propose OpeN-ended Centre nEt (ONCE), a detector designed for incrementally learning to detect novel class objects with few examples.
To address this problem, we propose a graph convolutional network (GCN)-based label denoising (LDN) method to remove the irrelevant images.
Fine-grained sketch-based image retrieval (FG-SBIR) addresses the problem of retrieving a particular photo instance given a user's query sketch.
Existing sketch-analysis work studies sketches depicting static objects or scenes.
However, there are currently no satisfactory solutions with strong efficiency and security in decentralized systems.
In this paper, we argue that the inter-meta-task relationships should be exploited and those tasks are sampled strategically to assist in meta-learning.
Specifically, armed with a set transformer based attention module, we construct each episode with two sub-episodes without class overlap on the seen classes to simulate the domain shift between the seen and unseen classes.
The widely studied closed-world setting is usually applied under various research-oriented assumptions, and has achieved inspiring success using deep learning techniques on a number of datasets.
Free-hand sketches are highly illustrative, and have been widely used by humans to depict objects or stories from ancient times to the present.
Person re-identification (re-ID), which aims to re-identify people across different camera views, has been significantly advanced by deep learning in recent years, particularly with convolutional neural networks (CNNs).
An effective person re-identification (re-ID) model should learn feature representations that are both discriminative, for distinguishing similar-looking people, and generalisable, for deployment across datasets without any adaptation.
In this paper, we propose to tackle the challenging few-shot learning (FSL) problem by learning global class representations using both base and novel class training samples.
In the former one asks whether a machine can `understand' enough about the meaning of input data to produce a meaningful but more compact abstraction.
As an instance-level recognition problem, person re-identification (ReID) relies on discriminative features, which not only capture different spatial scales but also encapsulate an arbitrary combination of multiple scales.
Ranked #7 on Person Re-Identification on CUHK03
A deep neural network is a parametrization of a multilayer mapping of signals in terms of many alternatively arranged linear and nonlinear transformations.
Differentiable programming is a fresh programming paradigm which composes parameterized algorithmic components and trains them using automatic differentiation (AD).
Strongly Correlated Electrons Quantum Physics
Matrix product states (MPS), a tensor network designed for one-dimensional quantum systems, has been recently proposed for generative modeling of natural data (such as images) in terms of `Born machine'.
To address the training data scarcity problem, our FFCSN model is trained with both meta learning and adversarial learning.
The standard approach to ZSL requires a set of training images annotated with seen class labels and a semantic descriptor for seen/unseen classes (attribute vector is the most widely used).
In this paper, a unified approach is presented to transfer learning that addresses several source and target domain label-space and annotation assumptions with a single model.
Ranked #17 on Unsupervised Domain Adaptation on Market to Duke
This is made possible by learning a projection between a feature space and a semantic space (e. g. attribute space).
Specifically, we assume that each synthesised data point can belong to any unseen class; and the most likely two class candidates are exploited to learn a robust projection function in a competitive fashion.
Inspired by the fact that an unseen class is not exactly `unseen' if it belongs to the same superclass as a seen class, we propose a novel inductive ZSL model that leverages superclasses as the bridge between seen and unseen classes to narrow the domain gap.
We contribute the first large-scale dataset of scene sketches, SketchyScene, with the goal of advancing research on sketch understanding at both the object and scene level.
Instead there is a fundamental process of abstraction and iconic rendering, where overall geometry is warped and salient details are selectively included.
Most existing person re-identification (re-id) methods are unsuitable for real-world deployment due to two reasons: Unscalability to large population size, and Inadaptability over time.
In this paper, we present a novel approach for translating an object photo to a sketch, mimicking the human sketching process.
Contemporary deep learning techniques have made image recognition a reasonably reliable technology.
Human free-hand sketches have been studied in various contexts including sketch recognition, synthesis and fine-grained sketch-based image retrieval (FG-SBIR).
Key to our network design is the embedding of unique characteristics of human sketch, where (i) a two-branch CNN-RNN architecture is adapted to explore the temporal ordering of strokes, and (ii) a novel hashing loss is specifically designed to accommodate both the temporal and abstract traits of sketches.
Key to effective person re-identification (Re-ID) is modelling discriminative and view-invariant factors of person appearance at both high and low semantic levels.
The iVQA task is to generate a question that corresponds to a given image and answer pair.
Video summarization aims to facilitate large-scale video browsing by producing short, concise summaries that are diverse and representative of original videos.
Ranked #3 on Supervised Video Summarization on SumMe
Person Re-identification (re-id) faces two major challenges: the lack of cross-view paired training data and learning discriminative identity-sensitive and view-invariant features in the presence of large pose variations.
Many vision problems require matching images of object instances across different domains.
Once trained, a RN is able to classify images of new classes by computing relation scores between query images and the few examples of each new class without further updating the network.
With the recent renaissance of deep convolution neural networks, encouraging breakthroughs have been achieved on the supervised recognition tasks, where each class has sufficient training data and fully annotated training data.
Human sketches are unique in being able to capture both the spatial topology of a visual object, as well as its subtle appearance details.
Ranked #2 on Sketch-Based Image Retrieval on Chairs
Our model is able to learn deep discriminative feature representations at different scales and automatically determine the most suitable scales for matching.
In this paper, we address two main issues in large-scale image annotation: 1) how to learn a rich feature representation suitable for predicting a diverse set of visual concepts ranging from object, scene to abstract concept; 2) how to annotate an image with the optimal number of class labels.
We propose to model complex visual scenes using a non-parametric Bayesian model learned from weakly labelled images abundant on media sharing sites such as Flickr.
Specifically, exact decorrelation is replaced by soft decorrelation via a mini-batch based Stochastic Decorrelation Loss (SDL) to be optimised jointly with the other training objectives.
In this paper, to address the two issues, we propose a two-phase framework for recognizing images from unseen fine-grained classes, i. e. zero-shot fine-grained classification.
We propose a novel and flexible approach to meta-learning for learning-to-learn from only a few examples.
Generating natural language descriptions of images is an important capability for a robot or other visual-intelligence driven AI agent that may need to communicate with human users about what it is seeing.
We address the problem of localisation of objects as bounding boxes in images and videos with weak labels.
Learning semantic attributes for person re-identification and description-based person search has gained increasing interest due to attributes' great potential as a pose and view-invariant representation.
Sketch-based image retrieval (SBIR) is challenging due to the inherent domain-gap between sketch and photo.
Ranked #5 on Sketch-Based Image Retrieval on Chairs
(3) Our model can be learned with a mixture of weakly labelled and unlabelled data, allowing the large volume of unlabelled images on the Internet to be exploited for learning.
Most existing approaches to training object detectors rely on fully supervised learning, which requires the tedious manual annotation of object location in a training set.
We show that with this additional reconstruction constraint, the learned projection function from the seen classes is able to generalise better to the new unseen classes.
Conversely, we give sufficient and necessary conditions to determine whether a TNS can be transformed into an RBM of given architectures.
Current person re-identification (re-id) methods assume that (1) pre-labelled training data are available for every camera pair, (2) the gallery size for re-identification is moderate.
Existing person re-identification models are poor for scaling up to large data required in real-world applications due to: (1) Complexity: They employ complex models for optimal performance resulting in high computational cost for training at a large scale; (2) Inadaptability: Once trained, they are unsuitable for incremental update to incorporate any new data available.
Second, a two-stepped fine-tuning strategy is developed to transfer knowledge from auxiliary datasets.
We propose a simple modification to the design pattern that makes learning more effective and efficient.
In this paper we argue that the key to make deep ZSL models succeed is to choose the right embedding space.
Ranked #4 on Zero-Shot Action Recognition on Kinetics
We investigate the problem of fine-grained sketch-based image retrieval (SBIR), where free-hand human sketches are used as queries to perform instance-level retrieval of images.
Ranked #3 on Sketch-Based Image Retrieval on Chairs
Most existing person re-identification (Re-ID) approaches follow a supervised learning framework, in which a large number of labelled matching pairs are required for training.
Most existing person re-identification (re-id) methods focus on learning the optimal distance metrics across camera views.
Ranked #87 on Person Re-Identification on Market-1501
We address a new partial person re-identification (re-id) problem, where only a partial observation of a person is available for matching across different non-overlapping camera views.
Zero-shot learning (ZSL) can be considered as a special case of transfer learning where the source and target domains have different tasks/label spaces and the target domain is unlabelled, providing little guidance for the knowledge transfer.
In real world person re-identification (re-id), images of people captured at very different resolutions from different locations need be matched.
The semantic manifold structure is used to redefine the distance metric in the semantic embedding space for more effective ZSL.
We propose a perceptual grouping framework that organizes image edges into meaningful structures and demonstrate its usefulness on various computer vision tasks.
When humans describe images they tend to use combinations of nouns and adjectives, corresponding to objects and their associated attributes respectively.
Recently, zero-shot learning (ZSL) has received increasing interest.
Zero-shot learning has received increasing interest as a means to alleviate the often prohibitive expense of annotating training data for large scale recognition problems.
We propose a multi-scale multi-channel deep neural network framework that, for the first time, yields sketch recognition performance surpassing that of humans.
In this paper, we propose a more principled way to identify annotation outliers by formulating the subjective visual property prediction task as a unified robust learning to rank problem, tackling both the outlier detection and learning to rank jointly.
A projection from a low-level feature space to the semantic representation space is learned from the auxiliary dataset and is applied without adaptation to the target dataset.
Specifically, in contrast to previous work which ignores the semantic relationships between seen classes and focus merely on those between seen and unseen classes, in this paper a novel approach based on a semantic graph is proposed to represent the relationships between all the seen and unseen class in a semantic word space.
By oversegmenting all the images into regions, we formulate noisily tagged image parsing as a weakly supervised sparse learning problem over all the regions, where the initial labels of each region are inferred from image-level labels.
A number of computer vision problems such as human age estimation, crowd density estimation and body/face pose (view angle) estimation can be formulated as a regression problem by learning a mapping function between a high dimensional vector-formed feature input and a scalarvalued output.