Identifying the worries of individuals and societies plays a crucial role in providing social support and enhancing policy decision-making.
We propose a novel approach to infer the network structure for DQN models operating with high-dimensional continuous actions.
The rapid adoption of text-to-image diffusion models in society underscores an urgent need to address their biases.
Teaching Visual Question Answering (VQA) models to refrain from answering unanswerable questions is necessary for building a trustworthy AI system.
Existing CL methods usually reduce forgetting with task priors, \ie using task identity or a subset of previously seen samples for model training.
To mitigate this issue, we propose a low-rank (LoRa) branch that disentangles RFT into two distinct components: optimizing natural objectives via the LoRa branch and adversarial objectives via the FE.
Our work takes a novel approach to address these challenges in graph unlearning through knowledge distillation, as it distills to delete in GNN (D2DGN).
Learning a versatile language-image model is computationally prohibitive under a limited computing budget.
The teacher model first extracts rich modality features from the generic modality feature by considering both the semantic information of items and the complementary information of multiple modalities.
The proposed method, named Mixture of Depth and Point cloud video experts (DPMix), achieved the first place in the 4D Action Segmentation Track of the HOI4D Challenge 2023.
Training an effective video action recognition model poses significant computational challenges, particularly under limited resource budgets.
The main challenge in video question answering (VideoQA) is to capture and understand the complex spatial and temporal relations between objects based on given questions.
The existing deepfake detection methods have reached a bottleneck in generalizing to unseen forgeries and manipulation approaches.
Human-Object Interaction Detection is a crucial aspect of human-centric scene understanding, with important applications in various domains.
Our research focuses on the innovative application of a differentiable logic loss in the training to leverage the co-occurrence relations between verb and noun, as well as the pre-trained Large Language Models (LLMs) to generate the logic rules for the adaptation to unseen action labels.
Experience Replay (ER) is a simple and effective rehearsal-based strategy, which optimizes the model with current training data and a subset of old samples stored in a memory buffer.
In our benchmark, which is curated to evaluate MLLMs visual semantic understanding and fine-grained perception capabilities, we discussed different visual tokenizers pre-trained with dominant methods (i. e., DeiT, CLIP, MAE, DINO), and observe that: i) Fully/weakly supervised models capture more semantics than self-supervised models, but the gap is narrowed by scaling up the pre-training dataset.
Thanks to the proposed fusion module, our method is robust not only to occlusion and large pitch and roll view angles, which is the benefit of our image space approach, but also to noise and large yaw angles, which is the benefit of our model space method.
Ranked #1 on 3D Face Reconstruction on AFLW2000-3D (Mean NME metric)
To improve transferability, the existing work introduced the standard invariant regularization (SIR) to impose style-independence property to SCL, which can exempt the impact of nuisance style factors in the standard representation.
Adversarial contrastive learning (ACL) does not require expensive data annotations but outputs a robust representation that withstands adversarial attacks and also generalizes to a wide range of downstream tasks.
Visual Commonsense Reasoning (VCR) remains a significant yet challenging research problem in the realm of visual reasoning.
To overcome these two challenges, we propose a unified Relation-Enhanced Transformer (RET) to improve representation discriminability for both point cloud and natural language queries.
In the last few years, there have been notable developments in machine unlearning to remove the information of certain training data efficiently and effectively from ML models.
At the item level, a synthetic data generation module is proposed to generate a synthetic item corresponding to the selected item based on the user's preferences.
To quantitatively study the object bias problem, we advocate a new protocol for evaluating model performance.
2) Insufficient number of distant interactions in benchmark datasets results in under-fitting on these instances.
On the other hand, pertaining to the implicit knowledge, the multi-modal implicit knowledge for knowledge-based VQA still remains largely unexplored.
As a step towards improving the abstract reasoning capability of machines, we aim to solve Raven's Progressive Matrices (RPM) with neural networks, since solving RPM puzzles is highly correlated with human intelligence.
Then, a spatial convolution is employed to capture the local structure of points in the 3D space, and a temporal convolution is used to model the dynamics of the spatial regions along the time dimension.
It facilitates the provision for removal of certain set or class of data from an already trained ML model without requiring retraining from scratch.
However, existing methods ignore the fact that different modalities contribute differently towards a user's preference on various factors of an item.
Given that our framework is model-agnostic, we apply it to the existing popular baselines and validate its effectiveness on the benchmark dataset.
From the results on four datasets regarding the above three tasks, our method yields remarkable performance improvements compared with the baselines, demonstrating its superiority on reducing the modality bias problem.
Furthermore, we theoretically find that the adversary can also degrade the lower bound of a TST's test power, which enables us to iteratively minimize the test criterion in order to search for adversarial pairs.
To explore these issues, we formulate a new semi-supervised continual learning method, which can be generically applied to existing continual learning models.
In case of machine learning (ML) applications, this necessitates deletion of data not only from storage archives but also from ML models.
In this paper, we propose an unsupervised domain adaptation method for deep point cloud representation learning.
Raven's Progressive Matrices (RPM) is highly correlated with human intelligence, and it has been widely used to measure the abstract reasoning ability of humans.
Motivated by scenarios where data is used for diverse prediction tasks, we study whether fair representation can be used to guarantee fairness for unknown tasks and for multiple fairness notions simultaneously.
The concept module generates semantically meaningful features for primitive concepts, whereas the visual module extracts visual features for attributes and objects from input images.
A recent adversarial training (AT) study showed that the number of projected gradient descent (PGD) steps to successfully attack a point (i. e., find an adversarial example in its proximity) is an effective measure of the robustness of this point.
In this work, we take a step towards training robust models for cross-domain pose estimation task, which brings together ideas from causal representation learning and generative adversarial networks.
Hyperspectral compressive imaging takes advantage of compressive sensing theory to achieve coded aperture snapshot measurement without temporal scanning, and the entire three-dimensional spatial-spectral data is captured by a two-dimensional projection during a single integration period.
A new model is trained with these labels to generalize reliably despite the label noise.
no code implementations • 22 Sep 2020 • Konstantinos Nikolaidis, Stein Kristiansen, Thomas Plagemann, Vera Goebel, Knut Liestøl, Mohan Kankanhalli, Gunn Marit Traaen, Britt Øverland, Harriet Akre, Lars Aakerøy, Sigurd Steinshamn
In this work, we present an approach for unsupervised domain adaptation (DA) with the constraint, that the labeled source data are not directly available, and instead only access to a classifier trained on the source data is provided.
1 code implementation • 21 Sep 2020 • Konstantinos Nikolaidis, Stein Kristiansen, Thomas Plagemann, Vera Goebel, Knut Liestøl, Mohan Kankanhalli, Gunn Marit Traaen, Britt Øverland, Harriet Akre, Lars Aakerøy, Sigurd Steinshamn
We use sleep monitoring data from both an open and a large closed clinical study and evaluate whether (1) end-users can create and successfully use customized classification models for sleep apnea detection, and (2) the identity of participants in the study is protected.
We argue that the key to Byzantine detection is monitoring of gradients of the model parameters of clients.
When the federated learning is adopted among competitive agents with siloed datasets, agents are self-interested and participate only if they are fairly rewarded.
We demonstrate the effectiveness of the proposed model on two different large-scale and publicly available datasets, YFCC100M and NUS-WIDE.
In this paper, we corroborate based on three subjective experiments on a novel image dataset that objects in natural images are inherently perceived to have varying levels of importance.
Adversarial training based on the minimax formulation is necessary for obtaining adversarial robustness of trained models.
Therefore, it is important to develop algorithms that can leverage off-the-shelf labeled dataset to learn useful knowledge for the target task.
To enable research in this direction, we introduce 360Action, the first omnidirectional video dataset for multi-person action recognition.
To overcome this limitation, we propose a novel mask transfer network (MTN), which can greatly boost the processing speed of VOS and also achieve a reasonable accuracy.
Finally, by sequentially examining each state transition in the video graph, our method can detect and explain how those actions are executed with prior knowledge, just like the logical manner of thinking by humans.
To tackle this problem, in this paper, we propose a novel Multimodal Attentive Metric Learning (MAML) method to model user diverse preferences for various items.
Supervised machine learning applications in the health domain often face the problem of insufficient training datasets.
Benefiting from the advancement of computer vision, natural language processing and information retrieval techniques, visual question answering (VQA), which aims to answer questions about an image or a video, has received lots of attentions over the past few years.
In addition, analysis of the intra-class compactness and inter-class separability demonstrates the advantages of the proposed function over the softmax function, which is consistent with the performance improvement.
Advertisements (ads) often contain strong affective content to capture viewer attention and convey an effective message to the audience.
Our analytical studies reveal that the step factor h in the Euler method is able to control the robustness of ResNet in both its training and generalization.
Despite the success of deep neural networks (DNNs) in image classification tasks, the human-level performance relies on massive training data with high-quality manual annotations, which are expensive and time-consuming to collect.
Ranked #24 on Image Classification on Clothing1M (using extra training data)
Then the aspect importance is integrated into a novel aspect-aware latent factor model (ALFM), which learns user's and item's latent factors based on ratings.
Moreover, our method achieves better performance than the best unsupervised offline algorithm on the DAVIS-2016 benchmark dataset.
Contrary to the popular notion that ad affect hinges on the narrative and the clever use of linguistic and social cues, we find that actively attended objects and the coarse scene structure better encode affective information as compared to individual scene objects or conspicuous background elements.
However, due to the domain shift problem, the performance of Web images trained deep classifiers tend to degrade when directly deployed to videos.
To enable the study of this problem, there exist a vast number of action datasets, which are recorded under controlled laboratory settings, real-world surveillance environments, or crawled from the Internet.
This paper addresses the problem of active learning of a multi-output Gaussian process (MOGP) model representing multiple types of coexisting correlated environmental phenomena.