The "splice" in our method is two-fold: 1) Each mixed image is a splice of several downsampled images in the form of a grid, where the semantics of images attending to mixing are blended without object deficiencies for alleviating co-occurred bias; 2) We splice mixed images and the original mini-batch to form a new SpliceMixed mini-batch, which allows an image with different scales to contribute to training together.
Graph pooling has been increasingly recognized as crucial for Graph Neural Networks (GNNs) to facilitate hierarchical graph representation learning.
We demonstrate that our partial linearization technique enables a more effective fusion of multiple tasks into a single model, outperforming standard adapter tuning and task arithmetic alone.
Zero-shot translation (ZST), which is generally based on a multilingual neural machine translation model, aims to translate between unseen language pairs in training data.
In light of the above issues, we present Consist3D, a three-stage framework Chasing for semantic-, geometric-, and saturation-Consistent Text-to-3D generation from a single image, in which the first two stages aim to learn parameterized consistency tokens, and the last stage is for optimization.
We evaluated our method on three popular egocentric action recognition datasets, Something-Something V2, H2O, and EPIC-KITCHENS-100, and the experimental results demonstrate the effectiveness of the proposed method for handling data scarcity problems, including long-tailed and few-shot egocentric action recognition.
A pooling operation is essential for effective graph-level representation learning, where the node drop pooling has become one mainstream graph pooling technology.
Zero-shot transfer learning for Dialogue State Tracking (DST) helps to handle a variety of task-oriented dialogue domains without the cost of collecting in-domain data.
To this end, we present a novel paradigm that attempts to extract noise-resistant features in its pipeline and introduces a noise-aware learning scheme to effectively improve the robustness of multimodal emotion understanding.
Secondly, to further differentiate the low-confidence regions in CAM, we devised a Class Token Contrast module (CTC) inspired by the fact that class tokens in ViT can capture high-level semantics.
no code implementations • 1 Mar 2023 • Chao Xue, Wei Liu, Shuai Xie, Zhenfang Wang, Jiaxing Li, Xuyang Peng, Liang Ding, Shanshan Zhao, Qiong Cao, Yibo Yang, Fengxiang He, Bohua Cai, Rongcheng Bian, Yiyan Zhao, Heliang Zheng, Xiangyang Liu, Dongkai Liu, Daqing Liu, Li Shen, Chang Li, Shijin Zhang, Yukang Zhang, Guanpu Chen, Shixiang Chen, Yibing Zhan, Jing Zhang, Chaoyue Wang, DaCheng Tao
Automated machine learning (AutoML) seeks to build ML models with minimal human effort.
This technical report briefly describes our JDExplore d-team's submission Vega v1 on the General Language Understanding Evaluation (GLUE) leaderboard, where GLUE is a collection of nine natural language understanding tasks, including question answering, linguistic acceptability, sentiment analysis, text similarity, paraphrase detection, and natural language inference.
Although cross-view multi-object tracking has received increased attention in recent years, existing datasets still have several issues, including 1) missing real-world scenarios, 2) lacking diverse scenes, 3) owning a limited number of tracks, 4) comprising only static cameras, and 5) lacking standard benchmarks, which hinder the investigation and comparison of cross-view tracking methods.
As selected data have high discrepancies in probabilities, the divergence of two networks can be maintained by training on such data.
Machine Translation Quality Estimation (QE) is the task of evaluating translation output in the absence of human-written references.
no code implementations • 4 Dec 2022 • Qihuang Zhong, Liang Ding, Yibing Zhan, Yu Qiao, Yonggang Wen, Li Shen, Juhua Liu, Baosheng Yu, Bo Du, Yixin Chen, Xinbo Gao, Chunyan Miao, Xiaoou Tang, DaCheng Tao
This technical report briefly describes our JDExplore d-team's Vega v2 submission on the SuperGLUE leaderboard.
Self-supervised facial representation has recently attracted increasing attention due to its ability to perform face understanding without relying on large-scale annotated datasets heavily.
We present Twin Answer Sentences Attack (TASA), an adversarial attack method for question answering (QA) models that produces fluent and grammatical adversarial contexts while maintaining gold answers.
As for model sizes, we scale the Transformer-Big up to the extremely large model that owns nearly 4. 7 Billion parameters, to fully enhance the model capacity for our Vega-MT.
Ranked #1 on Machine Translation on WMT 2022 Chinese-English
To address the above issues, we propose a novel metric-based meta-learning framework termed instance-adaptive class representation learning network (ICRL-Net) for few-shot visual recognition.
To alleviate this negative effect, in this paper, we investigate the dependence between outputs of the target model and input adversarial samples from the perspective of information theory, and propose an adversarial defense method.
To this end, we develop an elaborated GNN model with content and style local patches as the graph vertices.
Anomaly detection aims at identifying deviant samples from the normal data distribution.
However, the LTH-based methods suffer from two major drawbacks: 1) they require exhaustive and iterative training of dense models, resulting in an extremely large training computation cost, and 2) they only trim graph structures and model parameters but ignore the node feature dimension, where significant redundancy exists.
Experimental results on real-world networks demonstrate that CLNode is a general framework that can be combined with various GNNs to improve their accuracy and robustness.
Second, the constraints for diversity are designed to be task-agnostic, which causes the constraints to not work well.
Despite their effectiveness, however, current SGG methods only assume scene graph homophily while ignoring heterophily.
Scene graph generation (SGG) aims to detect objects and predict the relationships between each pair of objects.
Graph neural networks have emerged as a leading architecture for many graph-level tasks, such as graph classification and graph generation.
Specifically, when applying the proposed module, it employs a two-stream pipeline during training, i. e., either with or without a BatchFormerV2 module, where the batchformer stream can be removed for testing.
In this paper, we analyze the adversarial attack on graphs from the perspective of feature smoothness which further contributes to an efficient new adversarial defensive algorithm for GNNs.
Point cloud segmentation is fundamental in understanding 3D environments.
Ranked #14 on Semantic Segmentation on S3DIS Area5
In contrast to previous works, we focus on the reproducibility of the approaches and the examination of the elements that lead to improved performance by pretrained and nonpretrained models in retrieving images and text.
Motivated by the inherent consistency between the self-attention in Transformers and the semantic affinity, we propose an Affinity from Attention (AFA) module to learn semantic affinity from the multi-head self-attention (MHSA) in Transformers.
Ranked #20 on Weakly-Supervised Semantic Segmentation on COCO 2014 val
As a result, HLN significantly improves the performance of scene graph generation by integrating and reasoning from object interactions, relationship interactions, and transitive inference of hyper-relationships.
In the visual words learning module, we counter the first problem by enforcing the classification network to learn fine-grained visual word labels so that more object extents could be discovered.
To address this problem, we propose Resistance Training using Prior Bias (RTPB) for the scene graph generation.
In this paper, we conduct theoretical and experimental analysis to explore the fundamental causes of performance degradation in deep GCNs: over-smoothing and gradient vanishing have a mutually reinforcing effect that causes the performance to deteriorate more quickly in deep GCNs.
Essentially, our CGPN can enhance the learning performance of GNNs under extremely limited labels by contrastively propagating the limited labels to the entire graph.
The sample selection approach is popular in learning with noisy labels, which tends to select potentially clean data out of noisy data for robust training.
To this end, we propose to decompose each video into a series of expression snippets, each of which contains a small number of facial movements, and attempt to augment the Transformer's ability for modeling intra-snippet and inter-snippet visual relations, respectively, obtaining the Expression snippet Transformer (EST).
Ranked #7 on Dynamic Facial Expression Recognition on DFEW
In this paper, we propose a Graph Pattern Loss based Diversified Attention Network(GPLDAN) for unsupervised cross-modal retrieval to deeply analyze correlations among representations.
Despite their efficiency, current graph-based predictors treat all operations equally, resulting in biased topological knowledge of cell architectures.
Essentially, DGCPN addresses the inaccurate similarity problem by exploring and exploiting the data's intrinsic relationships in a graph.
In this paper, we explore the beneficial effect of undetermined relationships on visual relationship detection.