We present an investigation into how representational losses can affect the drawings produced by artificial agents playing a communication game.
The optimisation of neural networks can be sped up by orthogonalising the gradients before the optimisation step, ensuring the diversification of the learned representations.
When compared to GhostNet, inference latency on the Jetson Nano is improved by 1. 3x and 2x on the GPU and CPU respectively.
We find that contextual representations in language mod-els outperform static word embeddings, when the compositional chain of object is short.
Although different paradigms of visual semantic embedding models are designed to align visual features and distributed word representations, it is unclear to what extent current ZSL models encode semantic information from distributed word representations.
The new reduced design space results in a BLEU score increase of approximately 1% for sub-optimal models from the original design space, with a wide range for performance scaling between 0. 356s - 1. 526s for the GPU and 2. 9s - 7. 31s for the CPU.
In this paper, we propose temporal early exits to reduce the computational complexity of per-frame video object detection.
Evidence that visual communication preceded written language and provided a basis for it goes back to prehistory, in forms such as cave and rock paintings depicting traces of our distant ancestors.
However, the training process of such dynamic DNNs can be costly, since platform-aware models of different deployment scenarios must be retrained to become dynamic.
The majority of work has focused on using fixed, pretrained image feature extraction networks which potentially bias the information the agents learn to communicate.
In this work we empirically show that linear disentangled representations are not generally present in standard VAE models and that they instead require altering the loss landscape to induce them.
Finally, we show that a consequence of the difference between interpolating MSDA such as MixUp and masking MSDA such as FMix is that the two can be combined to improve performance even further.
Ranked #3 on Image Classification on Fashion-MNIST
There has been an increasing interest in the area of emergent communication between agents which learn to play referential signalling games with realistic images.
Colour vision has long fascinated scientists, who have sought to understand both the physiology of the mechanics of colour vision and the psychophysics of colour perception.
We present an extension of a variational auto-encoder that creates semantically richcoupled probabilistic latent representations that capture the semantics of multiplemodalities of data.
Representations of sets are challenging to learn because operations on sets should be permutation-invariant.
We introduce torchbearer, a model fitting library for pytorch aimed at researchers working on deep learning or differentiable programming.
While Wikipedia exists in 287 languages, its content is unevenly distributed among them.
Visual Question Answering (VQA) models have struggled with counting objects in natural images so far.
Ranked #22 on Visual Question Answering on VQA v2 test-dev
We explore the problem of generating natural language summaries for Semantic Web data.
Our model is based on a Recurrent Neural Network (RNN) that is trained over concatenated sequences of comments, a Convolution Neural Network that is trained over Wikipedia sentences and a formulation that couples the two trained embeddings in a multimodal space.