For the response generator, we use grounding span prediction as an auxiliary task to be jointly trained with the main task of response generation.
To alleviate the ambiguity estimating 3D garments from monocular videos, we design a multi-hypothesis deformation module that learns spatial representations of multiple plausible deformations.
To explore the potential spatio-temporal relationship, we propose spatio-temporal transformers to simultaneously extract trajectory information and fuse inter-person features in a hierarchical manner.
Neural network models are vulnerable to adversarial examples, and adversarial transferability further increases the risk of adversarial attacks.
We study model extraction attacks in natural language processing (NLP) where attackers aim to steal victim models by repeatedly querying the open Application Programming Interfaces (APIs).
We observe that the current change detection methods struggle with the multitask conflicts between semantic and height change detection tasks.
The main drawback of these approaches is that in general it does not use the information in the treatment indicator beyond the construction of the transformed outcome and usually is not efficient.
In this paper, we propose a zero-shot learning solution for the DRP task in preclinical drug screening.
Compared with the general video grounding task, MTVG focuses on meticulous actions and changes on the face.
no code implementations • 11 Sep 2023 • Chunyong Hu, Hang Zheng, Kun Li, Jianyun Xu, Weibo Mao, Maochun Luo, Lingxuan Wang, Mingxia Chen, Qihao Peng, Kaixuan Liu, Yiru Zhao, Peihan Hao, Minzhe Liu, Kaicheng Yu
Multi-sensor modal fusion has demonstrated strong advantages in 3D object detection tasks.
Specifically, we achieve the results of 0. 8492 and 0. 8439 for MuSe-Personalisation in terms of arousal and valence CCC.
Specifically, the proposed Dual-TL uses a Spatial TokenLearner (S-TL) to explore associations in different facial ROIs, which promises the rPPG prediction far away from noisy ROI disturbances.
First, we employed a sharing feature encoder to project both video and query into a joint feature space before performing cross-modal co-attention (i. e., video-to-query attention and query-to-video attention) to highlight discriminative features in each modality.
In this paper, we present the solution of our team HFUT-VUT for the MultiMediate Grand Challenge 2023 at ACM Multimedia 2023.
1 code implementation • 2 Aug 2023 • Tengju Ye, Wei Jing, Chunyong Hu, Shikun Huang, Lingping Gao, Fangzhen Li, Jingke Wang, Ke Guo, Wencong Xiao, Weibo Mao, Hang Zheng, Kun Li, Junbo Chen, Kaicheng Yu
Building a multi-modality multi-task neural network toward accurate and robust performance is a de-facto standard in perception task of autonomous driving.
In this paper, we briefly introduce the solution of our team HFUT-VUT for the Micros-gesture Classification in the MiGA challenge at IJCAI 2023.
To address these issues, we propose a novel Deformable Motion Modulation (DMM) that utilizes geometric kernel offset with adaptive weight modulation to simultaneously perform feature alignment and style transfer.
In order to defend against malware attacks, researchers have proposed many Windows malware detection models based on deep learning.
In this way, RGB images are not required during inference anymore since the 2D knowledge branch provides 2D information according to the 3D LIDAR input.
Interactive image segmentation aims to segment the target from the background with the manual guidance, which takes as input multimodal data such as images, clicks, scribbles, and bounding boxes.
This paper presents our 2nd place solution for the NuPlan Challenge 2023.
Recent learning-based approaches have achieved significant progress in light field (LF) image super-resolution (SR) by exploring convolution-based or transformer-based network structures.
Building end-to-end task bots and maintaining their integration with new functionalities using minimal human efforts is a long-standing challenge in dialog research.
To address this issue, in this paper, the limited available data on the incident power density and resultant maximum temperature rise on the skin surface considering various steady-state exposure scenarios at 10$-$90 GHz have been statistically modeled.
First, the point cloud is divided into small patches, and a matching patch set is selected based on global descriptors and spatial distribution, which constitutes the coarse matching process.
Also, we model global and local spatial relationships in a 3D scene and a textual description respectively based on the scene graph, and introduce a partlevel action mechanism to represent interactions as atomic body part states.
Angle-constrained formation control has attracted much attention from control community due to the advantage that inter-edge angles are invariant under uniform translations, rotations, and scalings of the whole formation.
In this manuscript (ms), we propose causal inference based single-branch ensemble trees for uplift modeling, namely CIET.
Visual question answering (VQA) is an important and challenging multimodal task in computer vision.
However, existing methods cannot deal with large scenes containing hundreds of people, which encounter the challenges of large number of people, large variations in human scale, and complex spatial distribution.
3D human body representation learning has received increasing attention in recent years.
This is particularly challenging in the context of expanding systems, because i) the range of the EVs is limited while charging time is typically long, which constrain the viable rebalancing operations; and ii) the EV stations in the system are dynamically changing, i. e., the legitimate targets for rebalancing operations can vary over time.
Specifically, we propose a modulation based transformer as the upsampler, which modulates the pixel features in discrete space via a periodic nonlinear function to generate features for continuous pixels.
To this end, we propose HDhuman, which uses a human reconstruction network with a pixel-aligned spatial transformer and a rendering network with geometry-guided pixel-wise feature integration to achieve high-quality human reconstruction and rendering.
Emotion recognition is a challenging and actively-studied research area that plays a critical role in emotion-aware human-computer interaction systems.
To cope with the complexity of textures and generate photo-realistic results, we propose a reference-based neural rendering network and exploit a bottom-up sharpening-guided fine-tuning strategy to obtain detailed textures.
From this observation, we have devised a single-stage anchor-free network that can fulfill these requirements.
Nowadays, there is an explosive growth of screen contents due to the wide application of screen sharing, remote cooperation, and online education.
Specifically, we exploit a symmetric twin neural network comprised of a projection head with a dimensionality of the cluster number to conduct dual contrastive learning from a spectral-spatial augmentation pool.
Extended from our last year's award-winning model AFDet, we have made a handful of modifications to the base model, to improve the accuracy and at the same time to greatly reduce the latency.
User queries for a real-world dialog system may sometimes fall outside the scope of the system's capabilities, but appropriate system responses will enable smooth processing throughout the human-computer interaction.
This paper presents an unsupervised two-stage approach to discover intents and generate meaningful intent labels automatically from a collection of unlabeled utterances in a domain.
As the data size in Machine Learning fields grows exponentially, it is inevitable to accelerate the computation by utilizing the ever-growing large number of available cores provided by high-performance computing hardware.
Nanopapers based on graphene and related materials were recently proposed for application in heat spreader applications.
Existing methods cannot effectively utilize the input information, which often fail to preserve the style and shape of hair and clothes.
In each block, we propose a pose-guided non-local attention (PoNA) mechanism with a long-range dependency scheme to select more important regions of image features to transfer.
One of the remaining challenges for aspect term extraction in sentiment analysis resides in the extraction of phrase-level aspect terms, which is non-trivial to determine the boundaries of such terms.
For all-pixel operation, we propose the Normal Regression Network to make efficient use of the intra-image spatial information for predicting a surface normal map with rich details.
Various combinations of cameras enrich computational photography, among which reference-based superresolution (RefSR) plays a critical role in multiscale imaging systems.
We explore the suitability of unsupervised representation learning methods on biomedical text -- BioBERT, SciBERT, and BioSentVec -- for biomedical question answering.
With the success of big data and artificial intelligence in many fields, the applications of big data driven models are expected in financial risk management especially credit scoring and rating.
To provide more discriminative feedback for the second language (L2) learners to better identify their mispronunciation, we propose a method for exaggerated visual-speech feedback in computer-assisted pronunciation training (CAPT).
3D face reconstruction from a single image is a challenging problem, especially under partial occlusions and extreme poses.
In this manuscript, we propose a federated F-score based ensemble tree model for automatic rule extraction, namely Fed-FEARE.
In this paper, we formulate the data augmentation as a conditional generation task: generating a new sentence while preserving the original opinion targets and labels.
Our method enables a realtime online motion capture system running at 30fps using 5 cameras on a 5-person scene.
Ranked #8 on 3D Multi-Person Pose Estimation on Shelf
This paper proposes a new method for simultaneous 3D reconstruction and semantic segmentation of indoor scenes.
Driven by applications like Micro Aerial Vehicles (MAVs), driver-less cars, etc, localization solution has become an active research topic in the past decade.
Observing that each demonstrator has an inherent reward for each state and the task-specific behaviors mainly depend on a small number of key states, we propose a meta IRL algorithm that first models the reward function for each task as a distribution conditioned on a baseline reward function shared by all tasks and dependent only on the demonstrator, and then finds the most likely reward function in the distribution that explains the task-specific behaviors.
This works handles the inverse reinforcement learning problem in high-dimensional state spaces, which relies on an efficient solution of model-based high-dimensional reinforcement learning problems.
We also show that the proposed method can extend many existing methods to high-dimensional state spaces.
We introduce a strategy to flexibly handle different types of actions with two approximations of the Bellman Optimality Equation, and a Bellman Gradient Iteration method to compute the gradient of the Q-value with respect to the reward function.
This paper develops a method to use RGB-D cameras to track the motions of a human spinal cord injury patient undergoing spinal stimulation and physical rehabilitation.
Non-rigid registration is challenging because it is ill-posed with high degrees of freedom and is thus sensitive to noise and outliers.
In a robot-centered smart home, the robot observes the home states with its own sensors, and then it can change certain object states according to an operator's commands for remote operations, or imitate the operator's behaviors in the house for autonomous operations.