Recent document question answering models consist of two key components: the vision encoder, which captures layout and visual elements in images, and a Large Language Model (LLM) that helps contextualize questions to the image and supplements them with external world knowledge to generate accurate answers.
While text line recognition models are generally trained on large corpora of real and synthetic data, such models can still make frequent mistakes if the handwriting is inscrutable or the image acquisition process adds corruptions, such as noise, blur, compression, etc.
In this work, we focus on the complex problem of extracting medicine names from handwritten prescriptions using only weakly labeled data.
Inspired by this, we identify and formulate a new, pragmatic problem setting of NCDwF: Novel Class Discovery without Forgetting, which tasks a machine learning model to incrementally discover novel categories of instances from unlabeled data, while maintaining its performance on the previously seen categories.
Novel Class Discovery (NCD) is a learning paradigm, where a machine learning model is tasked to semantically group instances from unlabeled data, by utilizing labeled instances from a disjoint set of classes.
In our work, we find evidence that these losses are insufficient for the task of scene decomposition, without also considering architectural inductive biases.
In Test-time Adaptation (TTA), given a source model, the goal is to adapt it to make better predictions for test instances from a different distribution than the source.
Unsupervised domain adaptation aims to adapt a model learned on the labeled source data, to a new unlabeled target dataset.
However, it is a critical task in many applications like environmental monitoring, where the number of labeled examples for each class is limited.
Once this correspondence is found, we can directly transfer the demonstrations on one domain to the other and use it for imitation.
A recent line of work addressed this problem and proposed an algorithm that transfers knowledge to the unlabeled target domain from a single source model without requiring access to the source data.
While machine learning approaches to visual recognition offer great promise, most of the existing methods rely heavily on the availability of large quantities of labeled training data.
In this work, we propose a novel framework for domain adaptation in semantic segmentation with image-level weak labels in the target domain.
In order to cope with this issue, we introduce the problem of learning person re-identification models from videos with weak supervision.
Learning to solve complex goal-oriented tasks with sparse terminal-only rewards often requires an enormous number of samples.
We formulate a conditional random field model that encodes the context and devise an information-theoretic approach that utilizes entropy and mutual information of the nodes to compute the set of most informative queries, which are labeled by a human.
The weak nature of the supervision is because, during training, we only have access to the video-text pairs rather than the temporal extent of the video to which different text descriptions relate.
We show that in spite of not using human-generated trajectories and just using the simulator as a model to generate a limited number of trajectories, we can get a speed-up of about 2-3x in the learning process.
Most activity localization methods in the literature suffer from the burden of frame-wise annotation requirement.
Ranked #1 on Action Classification on ActivityNet-1.2
We exploit recent advances in generative adversarial network (GAN) architectures to account for temporal correlations and generate adversarial samples that can cause misclassification rates of over 80% for targeted activities.
Minimization of labeling effort for person re-identification in camera networks is an important problem as most of the existing popular methods are supervised and they require large amount of manual annotations, acquiring which is a tedious job.
In computer vision, selection of the most informative samples from a huge pool of training data in order to learn a good recognition model is an active research problem.
We construct a graph from the unlabeled data to represent the underlying structure, such that each node represents a data point, and edges represent the inter-relationships between them.