Large-scale visual-linguistic pre-training aims to capture the generic representations from multimodal features, which are essential for downstream vision-language tasks.
However, we find visual and textual fine-grained information, e. g., keywords in the sentence and objects in the image, can be fairly informative for semantics understanding.
Specifically, IdealGPT utilizes an LLM to generate sub-questions, a VLM to provide corresponding sub-answers, and another LLM to reason to achieve the final answer.
The field of vision and language has witnessed a proliferation of pre-trained foundation models.
We present a new commonsense task, Human-centric Commonsense Grounding, that tests the models' ability to ground individuals given the context descriptions about what happened before, and their mental/physical states or intentions.
Visual commonsense understanding requires Vision Language (VL) models to not only understand image and text but also cross-reference in-between to fully integrate and achieve comprehension of the visual scene described.
Large-scale multi-modal contrastive pre-training has demonstrated great utility to learn transferable features for a range of downstream tasks by mapping multiple modalities into a shared embedding space.
Experiments demonstrate that MAD leads to consistent gains in the low-shot, domain-shifted, and fully-supervised conditions on VCR, SNLI-VE, and VQA, achieving SOTA performance on VCR compared to other single models pretrained with image-text data.
Ranked #3 on Visual Question Answering (VQA) on VCR (Q-A) test
We notice that detailed local geometrical information probably is not the key to point cloud analysis -- we introduce a pure residual MLP network, called PointMLP, which integrates no sophisticated local geometrical extractors but still performs very competitively.
Ranked #4 on Point Cloud Segmentation on PointCloud-C
Experiments demonstrate that our proposed CLIP-TD leads to exceptional gains in the low-shot (up to 51. 9%) and domain-shifted (up to 71. 3%) conditions of VCR, while simultaneously improving performance under standard fully-supervised conditions (up to 2%), achieving state-of-art performance on VCR compared to other single models that are pretrained with image-text data only.
As for pre-training, a scene-graph-aware pre-training method is proposed to leverage structure knowledge extracted in the visual scene graph.
Large-scale multimodal contrastive pretraining has demonstrated great utility to support high performance in a range of downstream tasks by mapping multiple modalities into a shared embedding space.
Graph Neural Network (GNN) has been demonstrated its effectiveness in dealing with non-Euclidean structural data.
Pre-trained contextual vision-and-language (V&L) models have achieved impressive performance on various benchmarks.
Scene graph generation models understand the scene through object and predicate recognition, but are prone to mistakes due to the challenges of perception in the wild.
Specifically, most general-purpose DA methods that struggle for global feature alignment and ignore local geometric information are not suitable for 3D domain alignment.
Ranked #1 on Unsupervised Domain Adaptation on PreSIL to KITTI
The proposed module learns the cross-modality relationships between latent visual and language summarizations, which summarize visual regions and question into a small number of latent representations to avoid modeling uninformative individual region-word relations.
In the attribute building stage, we address the problem of unordered point cloud data using a space partitioning procedure and developing a robust descriptor that characterizes the relationship between a point and its one-hop neighbor in a PointHop unit.
It can robustly capture the high-level interactions between language and vision domains, thus significantly improves the performance of visual question answering.
More specifically, based on the relation score module, the point-single-view fusion feature is first extracted by fusing the point cloud feature and each single view feature with point-singe-view relation, then the point-multi-view fusion feature is extracted by fusing the point cloud feature and the features of different number of views with point-multi-view relation.
However, there is little effort on using mesh data in recent years, due to the complexity and irregularity of mesh data.
In this paper, we present a hypergraph neural networks (HGNN) framework for data representation learning, which can encode high-order data correlation in a hypergraph structure.
With the recent proliferation of deep learning, various deep models with different representations have achieved the state-of-the-art performance.