To address these issues, we introduce a simple yet effective retrieval-based video language model (R-VLM) for efficient and interpretable long video QA.
It models the uncertainty propagation relationship of the geometry projection during training, improving the stability and efficiency of the end-to-end model learning.
In this work, we build a multimodal model to ground natural language instructions in given UI screenshots as a generic UI task automation executor.
We propose a semantic decomposition method based on product quantization, where the multi-source semantics can be decomposed and represented by several disentangled and noise-suppressed single-source semantics.
To further understand the the in-context learning mechanism and importance of the in-weights component, we proof by construction that a simple Transformer, which uses pattern matching and copy-past mechanism to perform in-context learning, can match the in-context learning performance with more complex, best tuned Transformer under the perfect in-weights component assumption.
The framework consists of pre-transfer, transfer, and post-transfer steps to accomplish knowledge transfer.
Specifically, we use a small network similar to NeRF while preserving the rendering speed with a single network forwarding per pixel as in NeLF.
In this paper, we tackle this problem by introducing temporal dependency to existing text-driven diffusion models, which allows them to generate consistent appearance for the edited objects.
1 code implementation • 31 Jul 2023 • Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, Gaoang Wang
Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks.
Generating realistic human motion from given action descriptions has experienced significant advancements because of the emerging requirement of digital humans.
Transformer is popular in recent 3D human pose estimation, which utilizes long-term modeling to lift 2D keypoints into the 3D space.
Ranked #84 on 3D Human Pose Estimation on Human3.6M
In specific, we present Responsible Task Automation (ResponsibleTA) as a fundamental framework to facilitate responsible collaboration between LLM-based coordinators and executors for task automation with three empowered capabilities: 1) predicting the feasibility of the commands for executors; 2) verifying the completeness of executors; 3) enhancing the security (e. g., the protection of users' privacy).
In this paper, we study several typical disentangled representation learning works in terms of both disentanglement and compositional generalization abilities, and we provide an important insight: vector-based representation (using a vector instead of a scalar to represent a concept) is the key to empower both good disentanglement and strong compositional generalization.
To further enhance the reliability of our noise decision results, ReSup uses two networks to jointly achieve noise suppression.
To begin with, we are among the first to comprehensively investigate mainstream KD techniques on DNS models to resolve the two challenges.
Clothes-invariant feature extraction is critical to the clothes-changing person re-identification (CC-ReID).
Our method leverages both self-supervised learned landmarks and 3D face model-based landmarks to model the motion.
Most Neural Radiance Fields (NeRFs) have poor generalization ability, limiting their application when representing multiple scenes by a single model.
The underlying idea is to generate pseudo labels for unlabeled frames during training and to optimize the model on the combination of labeled and pseudo-labeled data.
The Multiplane Image (MPI), containing a set of fronto-parallel RGBA layers, is an effective and efficient representation for view synthesis from sparse inputs.
Since different attributes have their individual semantics and characteristics, we propose to decouple the diffusion processes for them to improve the diversity of training samples and learn the reverse process jointly to exploit global-scope contexts for facilitating generation.
Better yet, our codec has surpassed the under-developing next generation traditional codec/ECM in both RGB and YUV420 colorspaces, in terms of PSNR.
For real-time speech enhancement (SE) including noise suppression, dereverberation and acoustic echo cancellation, the time-variance of the audio signals becomes a severe challenge.
Both mask decay and residual representation learning greatly improve the RD performance of our scalable encoder.
In this paper, we propose an efficient NP framework dubbed Versatile Neural Processes (VNP), which largely increases the capability of approximating functions.
In this paper, we introduce a new setting called Domain Generalization for Image Captioning (DGIC), where the data from the target domain is unseen in the learning process.
Our model achieves state-of-the-art performance on R-VOS benchmarks, Ref-DAVIS17 and Ref-Youtube-VOS, and also our RRYTVOS dataset.
Meanwhile, besides assisting frame coding at the current time step, the feature from context generation will be propagated as motion condition when coding the subsequent motion latent.
Existing Siamese tracking methods, which are built on pair-wise matching between two single frames, heavily rely on additional sophisticated mechanism to exploit temporal information among successive video frames, hindering them from high efficiency and industrial deployments.
Recently end-to-end neural audio/speech coding has shown its great potential to outperform traditional signal analysis based audio codecs.
A high-quality NeRF decomposition relies on good geometry information extraction as well as good prior terms to properly resolve ambiguities between different components.
This second paper presents a literature review of key enabling technologies of digital twins, with an emphasis on uncertainty quantification, optimization methods, open source datasets and tools, major findings, challenges, and future directions.
In part two of this review, the role of uncertainty quantification and optimization are discussed, a battery digital twin is demonstrated, and more perspectives on the future of digital twin are shared.
We present a novel paradigm of building an animatable 3D human representation from a monocular video input, such that it can be rendered in any unseen poses and views.
But we find existing graph-based methods in the visible-infrared person re-identification task (VI-ReID) suffer from bad generalization because of two issues: 1) train-test modality balance gap, which is a property of VI-ReID task.
Neural audio/speech coding has recently demonstrated its capability to deliver high quality at much lower bitrates than traditional methods.
Since the current frame is not available in video frame synthesis, NCM is performed in a current-frame-agnostic fashion to establish multi-scale correspondences in the spatial-temporal neighborhoods of each pixel.
Ranked #2 on Video Frame Interpolation on X4K1000FPS
Besides estimating the probability distribution, our entropy model also generates the quantization step at spatial-channel-wise.
We propose a robust context fusion network to tackle VIS in an online fashion, which predicts instance segmentation frame-by-frame with a few preceding frames.
In this paper, we introduce a cross-scale scalable vector quantization scheme (CSVQ), in which multi-scale features are encoded progressively with stepwise feature fusion and refinement.
In this paper we propose a multi-modal multi-correlation learning framework targeting at the task of audio-visual speech separation.
Referring Video Object Segmentation (R-VOS) is a challenging task that aims to segment an object in a video based on a linguistic expression.
We consolidate this conditional mask calibration process in a progressive manner, where the object representations and proto-masks evolve to be discriminative iteratively.
Ranked #1 on Visual Object Tracking on YouTube-VOS
Deep neural networks often suffer the data distribution shift between training and testing, and the batch statistics are observed to reflect the shift.
We propose a method for self-supervised image representation learning under the guidance of 3D geometric consistency.
Improving the generalization ability of Deep Neural Networks (DNNs) is critical for their practical uses, which has been a longstanding challenge.
The temporal features usually contain various noisy and uncorrelated information, and they may interfere with the restoration of the current frame.
It significantly improves the performance of several classic contrastive learning models in downstream tasks.
In this work, we propose an innovative token-mixer, dubbed Active Token Mixer (ATM), to actively incorporate flexible contextual information distributed across different channels from other tokens into the given query token.
Ranked #62 on Object Detection on COCO minival
Distribution forecast can quantify forecast uncertainty and provide various forecast scenarios with their corresponding estimated probabilities.
For deep reinforcement learning (RL) from pixels, learning effective state representations is crucial for achieving high performance.
Deep-learning based methods have shown their advantages in audio coding over traditional ones but limited attention has been paid on real-time communications (RTC).
In this paper, we propose a simple yet effective recursive least-squares estimator-aided online learning approach for few-shot online adaptation without requiring offline training.
Human action detection is a hot topic, which is widely used in video surveillance, human machine interface, healthcare monitoring, gaming, dancing training and musical instrument teaching.
We introduce two modulators, propagation and correction modulators, to separately perform channel-wise re-calibration on the target frame embeddings according to local temporal correlations and reliable references respectively.
Based on this representation, we introduce a cropping-free temporal fusion approach to model the temporal consistency between video frames.
From the stored propagated features, we propose to learn multi-scale temporal contexts, and re-fill the learned temporal contexts into the modules of our compression scheme, including the contextual encoder-decoder, the frame generator, and the temporal context encoder.
Instance segmentation is a challenging task aiming at classifying and segmenting all object instances of specific classes.
Therefore, we assume the task-relevant information that is not shared between views can not be ignored and theoretically prove that the minimal sufficient representation in contrastive learning is not sufficient for the downstream tasks, which causes performance degradation.
Deep learning-based video compression is a challenging task and many previous state-of-the-art learning-based video codecs use optical flows to exploit the temporal correlation between successive frames and then compress the residual error.
By inserting the proposed cross-stage mechanism in existing spatial and temporal transformer blocks, we build a separable transformer network for video learning based on ViT structure, in which self-attentions and features are progressively aggregated from one block to the next.
Our method contains two training stages based on model-agnostic meta learning (MAML), each of which consists of a contrastive branch and a meta branch.
Ranked #27 on Self-Supervised Action Recognition on UCF101
In this paper, we propose a Geometry Uncertainty Projection Network (GUP Net) to tackle the error amplification problem at both inference and training stages.
However, spatial correlations and temporal correlations represent different contextual information of scenes and temporal reasoning.
Detecting and localizing objects in the real 3D space, which plays a crucial role in scene understanding, is particularly challenging given only a monocular image due to the geometric information loss during imagery projection.
Specifically, we propose a phoneme-based distribution regularization (PbDr) for speech enhancement, which incorporates frame-wise phoneme information into speech enhancement network in a conditional manner.
This paper proposes MCSSL, a self-supervised learning approach for building custom object detection models in multi-camera networks.
We develop a conceptually simple, flexible, and effective framework (named T-Net) for two-view correspondence learning.
In this paper, we propose a novel idea to model speech and noise simultaneously in a two-branch convolutional neural network, namely SN-Net.
Ranked #1 on Speech Enhancement on Deep Noise Suppression (DNS) Challenge (SI-SDR-NB metric)
A crucial task in scene understanding is 3D object detection, which aims to detect and localize the 3D bounding boxes of objects belonging to specific classes.
Experimental results show that our uncertainty modeling is effective at alleviating the interference of background frames and brings a large performance gain without bells and whistles.
In this paper, we consider the problem of the scattering of in-plane waves at an interface between a homogeneous medium and a metamaterial.
In this paper, we tackle the above limitation by proposing a novel cross-modality shared-specific feature transfer algorithm (termed cm-SSFT) to explore the potential of both the modality-shared information and the modality-specific characteristics to boost the re-identification performance.
no code implementations • 4 Dec 2019 • Joyce Fang, Martin Ellis, Bin Li, Siyao Liu, Yasaman Hosseinkashi, Michael Revow, Albert Sadovnikov, Ziyuan Liu, Peng Cheng, Sachin Ashok, David Zhao, Ross Cutler, Yan Lu, Johannes Gehrke
Bandwidth estimation and congestion control for real-time communications (i. e., audio and video conferencing) remains a difficult problem, despite many years of research.
In this paper, we study the problem of 3D object detection from stereo images, in which the key challenge is how to effectively utilize stereo information.
Knowledge distillation aims at transferring knowledge acquired in one model (a teacher) to another model (a student) that is typically smaller.
Specifically, for each training image, we first generate attention maps to represent the object's discriminative parts by weakly supervised learning.
Ranked #12 on Fine-Grained Image Classification on CUB-200-2011
We present an instance segmentation scheme based on pixel affinity information, which is the relationship of two pixels belonging to a same instance.
We propose MonoGRNet for the amodal 3D object detection from a monocular RGB image via geometric reasoning in both the observed 2D projection and the unobserved depth dimension.
Ranked #24 on Monocular 3D Object Detection on KITTI Cars Moderate
In this paper, we address the problem of reconstructing an object's surface from a single image using generative networks.
Besides, we propose attention regularization and attention dropout to weakly supervise the generating process of attention maps.
In this paper, we improve the learning of local feature descriptors by optimizing the performance of descriptor matching, which is a common stage that follows descriptor extraction in local feature based pipelines, and can be formulated as nearest neighbor retrieval.
The RoI-based sub-region attention map and aspect ratio attention map are selectively pooled from the banks, and then used to refine the original RoI features for RoI classification.
This paper proposes an efficient content adaptive screen image scaling scheme for the real-time screen applications like remote desktop and screen sharing.