Second, we use examples of user decision-making to provide our LLM-powered planner and reasoner with relevant contextual instances, enhancing their capacity to make informed decisions.
Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP.
Ranked #3 on Fine-Grained Image Recognition on OVEN
Retrieval augmented models are becoming increasingly popular for computer vision tasks after their recent success in NLP problems.
Ranked #1 on Image Classification on WebVision-1000 (using extra training data)
REVEAL consists of four key components: the memory, the encoder, the retriever and the generator.
Ranked #1 on Visual Question Answering (VQA) on A-OKVQA (Accuracy metric)
We study class-incremental learning, a training setup in which new classes of data are observed over time for the model to learn from.
Recent advances in deep learning have relied on large, labelled datasets to train high-capacity models.
An effective and simple approach to long-tailed visual recognition is to learn feature representations and a classifier separately, with instance and class-balanced sampling, respectively.
Ranked #10 on Long-tail Learning on iNaturalist 2018
We assume that the model is updated incrementally for new classes as new data becomes available sequentially. This requires adapting the previously stored feature vectors to the updated feature space without having access to the corresponding original training images.
In this work we consider the problem of learning a classifier from noisy labels when a few clean labeled examples are given.
In this work, we employ a transductive label propagation method that is based on the manifold assumption to make predictions on the entire dataset and use these predictions to generate pseudo-labels for the unlabeled data and train a deep neural network.
State of the art image retrieval performance is achieved with CNN features and manifold ranking using a k-NN similarity graph that is pre-computed off-line.
In particular, annotation errors, the size of the dataset, and the level of challenge are addressed: new annotation for both datasets is created with an extra attention to the reliability of the ground truth.
Positive examples are distant points on a single manifold, while negative examples are nearby points on different manifolds.
Eliminating the impact of the clutter on the image descriptor increases the chance of retrieving relevant images and prevents topic drift due to actually retrieving the clutter in the case of query expansion.
The diffusion is carried out on descriptors of overlapping image regions rather than on a global image descriptor like in previous approaches.
Experiments with standard image search benchmarks, including the Yahoo100M dataset comprising 100 million images, show that our method gives comparable (and sometimes superior) accuracy compared to exhaustive search while requiring only 10% of the vector operations and memory.
We study an indexing architecture to store and search in a database of high-dimensional vectors from the perspective of statistical signal processing and decision theory.
Our results show that the regular dense detector is outperformed by other methods in most situations, leading us to improve the state of the art in comparable setups on standard retrieval and fined-grain benchmarks.
We introduce ConceptVision, a method that aims for high accuracy in categorizing large number of scenes, while keeping the model relatively simpler and efficient for scalability.