4 code implementations • • Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karen Simonyan
Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research.
Ranked #1 on Action Recognition on RareAct
From this perspective, we hypothesise that instabilities in training GANs arise from the integration error in discretising the continuous dynamics.
Modern text-to-speech synthesis pipelines typically involve multiple processing stages, each of which is designed or learnt independently from the rest.
Training generative adversarial networks requires balancing of delicate adversarial dynamics.
However, their application in the audio domain has received limited attention, and autoregressive models, such as WaveNet, remain the state of the art in generative modelling of audio signals such as human speech.
Generative models of natural images have progressed towards high fidelity samples by the strong leveraging of scale.
Ranked #1 on Video Generation on Kinetics-600 12 frames, 128x128
We extensively evaluate the representation learning and generation capabilities of these BigBiGAN models, demonstrating that these generation-based models achieve the state of the art in unsupervised representation learning on ImageNet, as well as in unconditional image generation.
Ranked #10 on Contrastive Learning on imagenet-1k
Despite recent progress in generative image modeling, successfully generating high-resolution, diverse samples from complex datasets such as ImageNet remains an elusive goal.
Ranked #3 on Conditional Image Generation on ArtBench-10 (32x32)
9 code implementations • 27 Nov 2017 • Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M. Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, Chrisantha Fernando, Koray Kavukcuoglu
Neural networks dominate the modern machine learning landscape, but their training and success still suffer from sensitivity to empirical choices of hyperparameters such as model architecture, loss function, and optimisation algorithm.
Over the past three years Pinterest has experimented with several visual search and recommendation services, including Related Pins (2014), Similar Looks (2015), Flashlight (2016) and Lens (2017).
The ability of the Generative Adversarial Networks (GANs) framework to learn generative models mapping from simple latent distributions to arbitrarily complex data distributions has been demonstrated empirically, with compelling results showing that the latent space of such generators captures semantic variation in the data distribution.
In order to succeed at this task, context encoders need to both understand the content of the entire image, as well as produce a plausible hypothesis for the missing part(s).
Clearly explaining a rationale for a classification decision to an end-user can be as important as the decision itself.
Convolutional Neural Networks spread through computer vision like a wildfire, impacting almost all visual tasks imaginable.
We demonstrate that, with the availability of distributed computation platforms such as Amazon Web Services and open-source tools, it is possible for a small engineering team to build, launch and maintain a cost-effective, large-scale visual search system with widely available tools.
Our LSTM model is trained on video-sentence pairs and learns to associate a sequence of video frames to a sequence of words in order to generate a description of the event in the video clip.
Solving the visual symbol grounding problem has long been a goal of artificial intelligence.
Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or "temporally deep", are effective for tasks involving sequences, visual and otherwise.
Ranked #3 on Human Interaction Recognition on BIT
A major challenge in scaling object detection is the difficulty of obtaining labeled images for large numbers of categories.
Semantic part localization can facilitate fine-grained categorization by explicitly isolating subtle appearance differences associated with specific object parts.
Ranked #62 on Fine-Grained Image Classification on CUB-200-2011
The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.
In other words, are deep CNNs trained on large amounts of labeled data as susceptible to dataset bias as previous methods have been shown to be?
We find that R-CNN outperforms OverFeat by a large margin on the 200-class ILSVRC2013 detection dataset.
Ranked #27 on Object Detection on PASCAL VOC 2007 (using extra training data)
We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be re-purposed to novel generic tasks.
Images seen during test time are often not from the same distribution as images used for learning.
Most successful object classification and detection methods rely on classifiers trained on large labeled datasets.
We present an algorithm that learns representations which explicitly compensate for domain mismatch and which can be efficiently realized as linear classifiers.