Video-based large language models (Video-LLMs) have been recently introduced, targeting both fundamental improvements in perception and comprehension, and a diverse range of user inquiries.
Despite their impressive capabilities, large language models (LLMs) are prone to hallucinations, i. e., generating content that deviates from facts seen during pretraining.
PSMT incorporates two layers where the first sparse coding layer represents the input sequence as sparse coefficients over an overcomplete dictionary and the second manifold learning layer learns a geometric embedding space that captures topological similarity and dynamic temporal linearity in sparse coefficients.
One natural approach is to use caption models to describe each photo in the album, and then use LLMs to summarize and rewrite the generated captions into an engaging story.
Tailoring outputs of large language models, such as ChatGPT, to specific user needs remains a challenge despite their impressive generation quality.
Large language models (LLMs), such as ChatGPT, are able to generate human-like, fluent responses for many downstream tasks, e. g., task-oriented dialog and question answering.
Unsupervised domain adaption has been widely adopted in tasks with scarce annotated data.
Towards this goal, we present a two-branch network for VOS, where the query-based instance segmentation (IS) branch delves into the instance details of the current frame and the VOS branch performs spatial-temporal matching with the memory bank.
Ranked #1 on Semi-Supervised Video Object Segmentation on Long Video Dataset (using extra training data)
Through our analysis, we find one important reason is that existing large-scale VL datasets do not contain much commonsense knowledge, which motivates us to improve the commonsense of VL-models from the data perspective.
This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture.
Ranked #4 on Cross-Modal Retrieval on Flickr30k (using extra training data)
Thanks to the strong zero-shot capability of foundation models, we start by constructing a rich semantic representation of the image (e. g., image tags, object attributes / locations, captions) as a structured textual prompt, called visual clues, using a vision foundation model.
Specifically, we observe that in most state-of-the-art knowledge-based VQA methods: 1) visual features are extracted either from the whole image or in a sliding window manner for retrieving knowledge, and the important relationship within/among object regions is neglected; 2) visual features are not well utilized in the final answering model, which is counter-intuitive to some extent.
Ranked #8 on Visual Question Answering (VQA) on OK-VQA
no code implementations • 3 May 2022 • ZiYi Yang, Yuwei Fang, Chenguang Zhu, Reid Pryzant, Dongdong Chen, Yu Shi, Yichong Xu, Yao Qian, Mei Gao, Yi-Ling Chen, Liyang Lu, Yujia Xie, Robert Gmyr, Noel Codella, Naoyuki Kanda, Bin Xiao, Lu Yuan, Takuya Yoshioka, Michael Zeng, Xuedong Huang
Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to maintain a holistic worldview.
2 code implementations • 20 Apr 2022 • Sheng Shen, Chunyuan Li, Xiaowei Hu, Jianwei Yang, Yujia Xie, Pengchuan Zhang, Zhe Gan, Lijuan Wang, Lu Yuan, Ce Liu, Kurt Keutzer, Trevor Darrell, Anna Rohrbach, Jianfeng Gao
We propose K-LITE, a simple strategy to leverage external knowledge for building transferable visual systems: In training, it enriches entities in text with WordNet and Wiktionary knowledge, leading to an efficient and scalable approach to learning image representations that uses knowledge about the visual concepts.
Finding the k largest or smallest elements from a collection of scores, i. e., top-k operation, is an important model component widely used in information retrieval, machine learning, and data mining.
Due to the combinatorial nature of the problem, most existing methods are only applicable when the sample size is small, and limited to linear regression models.
The top-$k$ operation, i. e., finding the $k$ largest or smallest elements from a collection of scores, is an important model component, which is widely used in information retrieval, machine learning, and data mining.
The top-k operation, i. e., finding the k largest or smallest elements from a collection of scores, is an important model component, which is widely used in information retrieval, machine learning, and data mining.
This paper proposes a new meta-learning method -- named HARMLESS (HAwkes Relational Meta LEarning method for Short Sequences) for learning heterogeneous point process models from short event sequence data along with a relational network.
Optimal Transport (OT) naturally arises in many machine learning applications, yet the heavy computational burden limits its wide-spread uses.
The great success achieved by deep neural networks attracts increasing attention from the manufacturing and healthcare communities.
Machine learning methods play increasingly important roles in pre-procedural planning for complex surgeries and interventions.
However, as we will demonstrate, regularized variations with large regularization parameter will degradate the performance in several important machine learning applications, and small regularization parameter will fail due to numerical stability issues with existing algorithms.