Existing visual instruction tuning methods typically prompt large language models with textual descriptions to generate instruction-following data.
Ranked #16 on Visual Question Answering on MM-Vet
Recognizing and generating object-state compositions has been a challenging task, especially when generalizing to unseen compositions.
Implicit neural representations (INR) have gained increasing attention in representing 3D scenes and images, and have been recently applied to encode videos (e. g., NeRV, E-NeRV).
The goal of multimodal summarization is to extract the most important information from different modalities to form output summaries.
Ranked #3 on Supervised Video Summarization on SumMe
We match the performance of NeRV, a state-of-the-art implicit neural representation, on the reconstruction task for frames seen during training while far surpassing for frames that are skipped during training (unseen images).
In this paper, we aim to address the challenge of label sparsity in semantic correspondence by enriching supervision signals from sparse keypoint annotations.
It outperforms the lightgbm2 by up to 34 pcp ROC-AUC in a cold start case when a new seller sells a new product .
Without the boundary information of action segments, existing methods mostly rely on multiple instance learning (MIL), where the predictions of unlabeled instances (i. e., video snippets) are supervised by classifying labeled bags (i. e., untrimmed videos).
In contrast, with NeRV, we can use any neural network compression method as a proxy for video compression, and achieve comparable performance to traditional frame-based video compression approaches (H. 264, HEVC \etc).
Ranked #6 on Video Reconstruction on UVG
With rapidly evolving internet technologies and emerging tools, sports related videos generated online are increasing at an unprecedentedly fast pace.
Deep reinforcement learning (DRL) has achieved great successes in many simulated tasks.
To this end, we introduce Global Temporal Attention (GTA), which performs global temporal attention on top of spatial attention in a decoupled manner.
In this paper, we proposed a deep interactive reinforcement learning method for path following of AUV by combining the advantages of deep reinforcement learning and interactive RL.
Our results show that learning from demonstration can allow a TAMER agent to learn a roughly optimal policy up to the deepest search and encourage the agent to explore along the optimal path.
By employing hierarchical feature selection, we can compress the scale and dimension of global dictionary, which directly contributes to the decrease of computational cost in sparse representation that our approach is strongly rooted in.
Second, an adaptive selective ensemble framework for online learning is designed to balance the robustness and complexity of the algorithm.
Extreme learning machine (ELM) as a neural network algorithm has shown its good performance, such as fast speed, simple structure etc, but also, weak robustness is an unavoidable defect in original ELM for blended data.
It becomes a potential framework to solve robustness issue of ELM for high-dimensional blended data in the future.