Existing video frame interpolation (VFI) methods blindly predict where each object is at a specific timestep t ("time indexing"), which struggles to predict precise object movements.
First, the Adaptive Threshold Learning module generates two thresholds, namely the clean and noisy thresholds, for each category.
However, it is difficult to restore the lost details in the dark area by relying only on the RGB domain.
Moreover, to enhance stability of the prompt effect evaluation, we propose a novel prompt bagging method involving forward and backward thinking, which is superior to majority voting and is beneficial for both feedback and weight calculation in boosting.
A key component of retrieval is to model (user, item) similarity, which is commonly represented as the dot product of two learned embeddings.
Vision transformers (ViTs) quantization offers a promising prospect to facilitate deploying large pre-trained networks on resource-limited devices.
The continuous improvement of human-computer interaction technology makes it possible to compute emotions.
In this paper, we present our solutions for the 5th Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW), which includes four sub-challenges of Valence-Arousal (VA) Estimation, Expression (Expr) Classification, Action Unit (AU) Detection and Emotional Reaction Intensity (ERI) Estimation.
Common capture low-light scenes are challenging for most computer vision techniques, including Neural Radiance Fields (NeRF).
Self-supervised representation learning follows a paradigm of withholding some part of the data and tasking the network to predict it from the remaining part.
no code implementations • 7 Nov 2022 • Andrey Ignatov, Radu Timofte, Cheng-Ming Chiang, Hsien-Kai Kuo, Yu-Syuan Xu, Man-Yu Lee, Allen Lu, Chia-Ming Cheng, Chih-Cheng Chen, Jia-Ying Yong, Hong-Han Shuai, Wen-Huang Cheng, Zhuang Jia, Tianyu Xu, Yijian Zhang, Long Bao, Heng Sun, Diankai Zhang, Si Gao, Shaoli Liu, Biao Wu, Xiaofeng Zhang, Chengjian Zheng, Kaidi Lu, Ning Wang, Xiao Sun, HaoDong Wu, Xuncheng Liu, Weizhan Zhang, Caixia Yan, Haipeng Du, Qinghua Zheng, Qi Wang, Wangdu Chen, Ran Duan, Mengdi Sun, Dan Zhu, Guannan Chen, Hojin Cho, Steve Kim, Shijie Yue, Chenghua Li, Zhengyang Zhuge, Wei Chen, Wenxu Wang, Yufeng Zhou, Xiaochen Cai, Hengxing Cai, Kele Xu, Li Liu, Zehua Cheng, Wenyi Lian, Wenjing Lian
While numerous solutions have been proposed for this problem, they are usually quite computationally demanding, demonstrating low FPS rates and power efficiency on mobile devices.
Our experiments demonstrate the effectiveness of our proposed model and hybrid fusion strategy on multimodal fusion, and the AUC of our proposed model on the test set is 0. 8972.
In this paper, we present our solutions for the Multimodal Sentiment Analysis Challenge (MuSe) 2022, which includes MuSe-Humor, MuSe-Reaction and MuSe-Stress Sub-challenges.
Existing solutions to this problem estimate a single image sequence without considering the motion ambiguity for each region.
The paper presents a scalable approach for learning spatially distributed visual representations over individual tokens and a holistic instance representation simultaneously.
We propose an unsupervised method for 3D geometry-aware representation learning of articulated objects, in which no image-pose pairs or foreground masks are used for training.
In this paper, instead of two consecutive frames, we propose to exploit a pair of images captured by dual RS cameras with reversed RS directions for this highly challenging task.
Concretely, we pretrain the sign-to-gloss visual network on the general domain of human actions and the within-domain of a sign-to-gloss dataset, and pretrain the gloss-to-text translation network on the general domain of a multilingual corpus and the within-domain of a gloss-to-text corpus.
Ranked #2 on Sign Language Translation on CSL-Daily
In the proposed method, (1) the topological structure information and texture feature of regions of interest (ROIs) are modeled as graphs and processed with graph convolutional network (GCN) to remain the topological features.
Typically in recent work, the pseudo-labels are obtained by training a model on the labeled data, and then using confident predictions from the model to teach itself.
For human action understanding, a popular research direction is to analyze short video clips with unambiguous semantic content, such as jumping and drinking.
However, methods for understanding short semantic actions cannot be directly translated to long kinematic sequences such as dancing, where it becomes challenging even to semantically label the human movements.
A common problem in the task of human-object interaction (HOI) detection is that numerous HOI classes have only a small number of labeled examples, resulting in training sets with a long-tailed distribution.
Ranked #40 on Human-Object Interaction Detection on HICO-DET
no code implementations • 27 Aug 2021 • Andrea Fasoli, Chia-Yu Chen, Mauricio Serrano, Xiao Sun, Naigang Wang, Swagath Venkataramani, George Saon, Xiaodong Cui, Brian Kingsbury, Wei zhang, Zoltán Tüske, Kailash Gopalakrishnan
We investigate the impact of aggressive low-precision representations of weights and activations in two families of large LSTM-based architectures for Automatic Speech Recognition (ASR): hybrid Deep Bidirectional LSTM - Hidden Markov Models (DBLSTM-HMMs) and Recurrent Neural Network - Transducers (RNN-Ts).
While the average prediction accuracy has been improved significantly over the years, the performance on hard poses with depth ambiguity, self-occlusion, and complex or rare poses is still far from satisfactory.
Ranked #20 on Skeleton Based Action Recognition on NTU RGB+D 120
This paper presents a computational framework that generates ensemble predictive mechanics models with uncertainty quantification (UQ).
Large-scale distributed training of Deep Neural Networks (DNNs) on state-of-the-art platforms is expected to be severely communication constrained.
no code implementations • • Xiao Sun, Naigang Wang, Chia-Yu Chen, Jiamin Ni, Ankur Agrawal, Xiaodong Cui, Swagath Venkataramani, Kaoutar El Maghraoui, Vijayalakshmi (Viji) Srinivasan, Kailash Gopalakrishnan
In this paper, we propose a number of novel techniques and numerical representation formats that enable, for the very first time, the precision of training systems to be aggressively scaled from 8-bits to 4-bits.
With the reduced dimensionality of less relevant body areas, the training set distribution within network branches more closely reflects the statistics of local poses instead of global body poses, without sacrificing information important for joint inference.
Ranked #19 on Monocular 3D Human Pose Estimation on Human3.6M
A common problem in human-object interaction (HOI) detection task is that numerous HOI classes have only a small number of labeled examples, resulting in training sets with a long-tailed distribution.
A recent approach for object detection and human pose estimation is to regress bounding boxes or human keypoints from a central point on the object or person.
Reducing the numerical precision of data and computation is extremely effective in accelerating deep learning training workloads.
Point cloud analysis has drawn broader attentions due to its increasing demands in various fields.
A multiscale shared convolution structure is adopted in the discriminator network to further supervise training the generator.
In recent years, the generation of conversation content based on deep neural networks has attracted many researchers.
We present a method for human pose tracking that is based on learning spatiotemporal relationships among joints.
For the ECCV 2018 PoseTrack Challenge, we present a 3D human pose estimation system based mainly on the integral human pose regression method.
Ranked #1 on 3D Human Pose Estimation on CHALL H80K
State-of-the-art human pose estimation methods are based on heat map representation.
Ranked #23 on Pose Estimation on MPII Human Pose
We propose a weakly-supervised transfer learning method that uses mixed 2D and 3D labels in a unified deep neutral network that presents two-stage cascaded structure.
A central problem is that the structural information in the pose is not well exploited in the previous regression methods.
Ranked #36 on Pose Estimation on MPII Human Pose
In this work, we propose to directly embed a kinematic object model into the deep neutral network learning for general articulated object pose estimation.
Ranked #302 on 3D Human Pose Estimation on Human3.6M
We extends the previous 2D cascaded object pose regression work  in two aspects so that it works better for 3D articulated objects.