Search Results for author: Shentong Mo

Found 45 papers, 16 papers with code

Efficient 3D Shape Generation via Diffusion Mamba with Bidirectional SSMs

no code implementations7 Jun 2024 Shentong Mo

To address this challenge, we introduce a novel diffusion architecture tailored for 3D point clouds generation-Diffusion Mamba (DiM-3D).

3D Shape Generation 3D Shape Modeling +1

MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers

1 code implementation7 Jun 2024 Tanvir Mahmud, Shentong Mo, Yapeng Tian, Diana Marculescu

Furthermore, to suppress the background features in each modality from foreground matched audio-visual features, we introduce a robust discriminative foreground mining scheme.

audio-visual learning Contrastive Learning

DMT-JEPA: Discriminative Masked Targets for Joint-Embedding Predictive Architecture

1 code implementation28 May 2024 Shentong Mo, Sukmin Yun

To be specific, the proposed DMT-JEPA (a) computes feature similarities between each masked patch and its corresponding neighboring patches to select patches having semantically meaningful relations, and (b) employs lightweight cross-attention heads to aggregate features of neighboring patches as the masked targets.

Image Classification object-detection +2

Scaling Diffusion Mamba with Bidirectional SSMs for Efficient Image and Video Generation

no code implementations24 May 2024 Shentong Mo, Yapeng Tian

Traditional diffusion transformers (DiT), which utilize self-attention blocks, are effective but their computational complexity scales quadratically with the input length, limiting their use for high-resolution images.

Image Generation Video Generation

Unified Video-Language Pre-training with Synchronized Audio

no code implementations12 May 2024 Shentong Mo, Haofan Wang, Huaxia Li, Xu Tang

Video-language pre-training is a typical and challenging problem that aims at learning visual and textual representations from large-scale data in a self-supervised way.

A Large-scale Medical Visual Task Adaptation Benchmark

no code implementations19 Apr 2024 Shentong Mo, Xufang Luo, Yansen Wang, Dongsheng Li

Visual task adaptation has been demonstrated to be effective in adapting pre-trained Vision Transformers (ViTs) to general downstream visual tasks using specialized learnable layers or tokens.

DailyMAE: Towards Pretraining Masked Autoencoders in One Day

1 code implementation31 Mar 2024 Jiantao Wu, Shentong Mo, Sara Atito, ZhenHua Feng, Josef Kittler, Muhammad Awais

Recently, masked image modeling (MIM), an important self-supervised learning (SSL) method, has drawn attention for its effectiveness in learning data representation from unlabeled data.

Self-Supervised Learning

Text-to-Audio Generation Synchronized with Videos

no code implementations8 Mar 2024 Shentong Mo, Jing Shi, Yapeng Tian

Extensive evaluations on the AudioCaps and T2AV-Bench demonstrate that our T2AV sets a new standard for video-aligned TTA generation in ensuring visual alignment and temporal consistency.

AudioCaps Audio Generation +1

Audio-Synchronized Visual Animation

no code implementations8 Mar 2024 Lin Zhang, Shentong Mo, Yijing Zhang, Pedro Morgado

We hope our established benchmark can open new avenues for controllable visual generation.

LSPT: Long-term Spatial Prompt Tuning for Visual Representation Learning

no code implementations27 Feb 2024 Shentong Mo, Yansen Wang, Xufang Luo, Dongsheng Li

Visual Prompt Tuning (VPT) techniques have gained prominence for their capacity to adapt pre-trained Vision Transformers (ViTs) to downstream visual tasks using specialized learnable tokens termed as prompts.

Representation Learning Visual Prompt Tuning

We Choose to Go to Space: Agent-driven Human and Multi-Robot Collaboration in Microgravity

no code implementations22 Feb 2024 Miao Xin, Zhongrui You, Zihan Zhang, Taoran Jiang, Tingjia Xu, Haotian Liang, Guojing Ge, Yuchen Ji, Shentong Mo, Jian Cheng

We present SpaceAgents-1, a system for learning human and multi-robot collaboration (HMRC) strategies under microgravity conditions.

Decision Making

Fast Training of Diffusion Transformer with Extreme Masking for 3D Point Clouds Generation

no code implementations12 Dec 2023 Shentong Mo, Enze Xie, Yue Wu, Junsong Chen, Matthias Nießner, Zhenguo Li

Motivated by the inherent redundancy of 3D compared to 2D, we propose FastDiT-3D, a novel masked diffusion transformer tailored for efficient 3D point cloud generation, which greatly reduces training costs.

3D Generation Denoising +1

Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling

1 code implementation CVPR 2024 Shentong Mo, Pedro Morgado

Thus, to address the computational complexity, we propose an alternative procedure that factorizes the local representations before representing audio-visual interactions.

Linker-Tuning: Optimizing Continuous Prompts for Heterodimeric Protein Prediction

no code implementations2 Dec 2023 Shuxian Zou, Hui Li, Shentong Mo, Xingyi Cheng, Eric Xing, Le Song

Predicting the structure of interacting chains is crucial for understanding biological systems and developing new drugs.

Protein Structure Prediction

MultiIoT: Towards Large-scale Multisensory Learning for the Internet of Things

no code implementations10 Nov 2023 Shentong Mo, Paul Pu Liang, Russ Salakhutdinov, Louis-Philippe Morency

The Internet of Things (IoT), the network integrating billions of smart physical devices embedded with sensors, software, and communication technologies for the purpose of connecting and exchanging data with other devices and systems, is a critical and rapidly expanding component of our modern world.

Representation Learning

Exploring Data Augmentations on Self-/Semi-/Fully- Supervised Pre-trained Models

no code implementations28 Oct 2023 Shentong Mo, Zhun Sun, Chao Li

Data augmentation has become a standard component of vision pre-trained models to capture the invariance between augmented views.

Data Augmentation Image Classification +4

Tree of Uncertain Thoughts Reasoning for Large Language Models

no code implementations14 Sep 2023 Shentong Mo, Miao Xin

These local uncertainties, intrinsic to LLMs given their potential for diverse responses, remain a significant concern in the reasoning process.

Decision Making Response Generation +1

Class-Incremental Grouping Network for Continual Audio-Visual Learning

1 code implementation ICCV 2023 Shentong Mo, Weiguo Pian, Yapeng Tian

Our CIGN leverages learnable audio-visual class tokens and audio-visual grouping to continually aggregate class-aware features.

audio-visual learning Class Incremental Learning +2

Masked Momentum Contrastive Learning for Zero-shot Semantic Understanding

no code implementations22 Aug 2023 Jiantao Wu, Shentong Mo, Muhammad Awais, Sara Atito, ZhenHua Feng, Josef Kittler

Self-supervised pretraining (SSP) has emerged as a popular technique in machine learning, enabling the extraction of meaningful feature representations without labelled data.

Contrastive Learning Object +6

Audio-Visual Class-Incremental Learning

1 code implementation ICCV 2023 Weiguo Pian, Shentong Mo, Yunhui Guo, Yapeng Tian

We demonstrate that joint audio-visual modeling can improve class-incremental learning, but current methods fail to preserve semantic similarity between audio and visual features as incremental step grows.

Class Incremental Learning Incremental Learning +3

A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition

1 code implementation30 May 2023 Shentong Mo, Pedro Morgado

The ability to accurately recognize, localize and separate sound sources is fundamental to any audio-visual perception task.

audio-visual learning

DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment

no code implementations22 May 2023 Shentong Mo, Jing Shi, Yapeng Tian

In this work, we propose a novel and personalized text-to-sound generation approach with visual alignment based on latent diffusion models, namely DiffAVA, that can simply fine-tune lightweight visual-text alignment modules with frozen modality-specific encoders to update visual-aligned text embeddings as the condition.

AudioCaps Audio Generation +1

AV-SAM: Segment Anything Model Meets Audio-Visual Localization and Segmentation

no code implementations3 May 2023 Shentong Mo, Yapeng Tian

In this work, we propose a simple yet effective audio-visual localization and segmentation framework based on the Segment Anything Model, namely AV-SAM, that can generate sounding object masks corresponding to the audio.

Decoder Object Localization +2

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

no code implementations10 Apr 2023 Shentong Mo, Jingfei Xia, Ihor Markevych

Visual and linguistic pre-training aims to learn vision and language representations together, which can be transferred to visual-linguistic downstream tasks.

Image Retrieval Phrase Grounding +6

Audio-Visual Grouping Network for Sound Localization from Mixtures

1 code implementation CVPR 2023 Shentong Mo, Yapeng Tian

Sound source localization is a typical and challenging task that predicts the location of sound sources in a video.

Object Localization

Variantional autoencoder with decremental information bottleneck for disentanglement

1 code implementation22 Mar 2023 Jiantao Wu, Shentong Mo, Muhammad Awais, Sara Atito, Xingshen Zhang, Lin Wang, Xiang Yang

One major challenge of disentanglement learning with variational autoencoders is the trade-off between disentanglement and reconstruction fidelity.


Rethinking Prototypical Contrastive Learning through Alignment, Uniformity and Correlation

no code implementations18 Oct 2022 Shentong Mo, Zhun Sun, Chao Li

Particularly, in the classification down-stream tasks with linear probes, our proposed method outperforms the state-of-the-art instance-wise and prototypical contrastive learning methods on the ImageNet-100 dataset by 2. 96% and the ImageNet-1K dataset by 2. 46% under the same settings of batch size and epochs.

Contrastive Learning Self-Supervised Learning

A Closer Look at Weakly-Supervised Audio-Visual Source Localization

1 code implementation30 Aug 2022 Shentong Mo, Pedro Morgado

We also propose a new approach for visual sound source localization that addresses both these problems.

Siamese Prototypical Contrastive Learning

no code implementations18 Aug 2022 Shentong Mo, Zhun Sun, Chao Li

One of the drawbacks of CSL is that the loss term requires a large number of negative samples to provide better mutual information bound ideally.

Contrastive Learning Self-Supervised Learning

Object-wise Masked Autoencoders for Fast Pre-training

no code implementations28 May 2022 Jiantao Wu, Shentong Mo

Furthermore, we investigate the inter-object and intra-object relationship and find that the latter is crucial for self-supervised pre-training.

Image Classification Object

Unitail: Detecting, Reading, and Matching in Retail Scene

no code implementations1 Apr 2022 Fangyi Chen, Han Zhang, Zaiwang Li, Jiachen Dou, Shentong Mo, Hao Chen, Yongxin Zhang, Uzair Ahmed, Chenchen Zhu, Marios Savvides

To make full use of computer vision technology in stores, it is required to consider the actual needs that fit the characteristics of the retail scene.

Benchmarking Dense Object Detection +2

Point3D: tracking actions as moving points with 3D CNNs

no code implementations20 Mar 2022 Shentong Mo, Jingfei Xia, Xiaoqing Tan, Bhiksha Raj

Our Point3D consists of a Point Head for action localization and a 3D Head for action classification.

Action Classification Action Localization +1

Localizing Visual Sounds the Easy Way

1 code implementation17 Mar 2022 Shentong Mo, Pedro Morgado

Unsupervised audio-visual source localization aims at localizing visible sound sources in a video without relying on ground-truth localization for training.

Multi-Scale Self-Contrastive Learning with Hard Negative Mining for Weakly-Supervised Query-based Video Grounding

no code implementations8 Mar 2022 Shentong Mo, Daizong Liu, Wei Hu

Secondly, since some predicted frames (i. e., boundary frames) are relatively coarse and exhibit similar appearance to their adjacent frames, we propose a coarse-to-fine contrastive learning paradigm to learn more discriminative frame-wise representations for distinguishing the false positive frames.

Contrastive Learning Sentence +2

High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning

1 code implementation2 Mar 2022 Paul Pu Liang, Yiwei Lyu, Xiang Fan, Jeffrey Tsaw, Yudong Liu, Shentong Mo, Dani Yogatama, Louis-Philippe Morency, Ruslan Salakhutdinov

Many real-world problems are inherently multimodal, from spoken language, gestures, and paralinguistics humans use to communicate, to force, proprioception, and visual sensors on robots.

Representation Learning Time Series Analysis +2

Context Autoencoder for Self-Supervised Representation Learning

6 code implementations7 Feb 2022 Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, Jingdong Wang

The pretraining tasks include two tasks: masked representation prediction - predict the representations for the masked patches, and masked patch reconstruction - reconstruct the masked patches.

Decoder Instance Segmentation +6

Multi-modal Self-supervised Pre-training for Regulatory Genome Across Cell Types

no code implementations11 Oct 2021 Shentong Mo, Xi Fu, Chenyang Hong, Yizhen Chen, Yuxuan Zheng, Xiangru Tang, Zhiqiang Shen, Eric P Xing, Yanyan Lan

The core problem is to model how regulatory elements interact with each other and its variability across different cell types.

Piecing and Chipping: An effective solution for the information-erasing view generation in Self-supervised Learning

no code implementations29 Sep 2021 Jingwei Liu, Yi Gu, Shentong Mo, Zhun Sun, Shumin Han, Jiafeng Guo, Xueqi Cheng

In self-supervised learning frameworks, deep networks are optimized to align different views of an instance that contains the similar visual semantic information.

Data Augmentation Self-Supervised Learning

Representation Disentanglement in Generative Models with Contrastive Learning

no code implementations29 Sep 2021 Shentong Mo, Zhun Sun, Shumin Han

Recent works apply the contrastive learning on the discriminator of the Generative Adversarial Networks, and there exists little work on exploring if contrastive learning can be applied on encoders to learn disentangled representations.

Contrastive Learning Disentanglement +1

Multi-modal Self-supervised Pre-training for Large-scale Genome Data

no code implementations NeurIPS Workshop AI4Scien 2021 Shentong Mo, Xi Fu, Chenyang Hong, Yizhen Chen, Yuxuan Zheng, Xiangru Tang, Yanyan Lan, Zhiqiang Shen, Eric Xing

In this work, we propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.

Learning by Examples Based on Multi-level Optimization

no code implementations22 Sep 2021 Shentong Mo, Pengtao Xie

Learning by examples, which learns to solve a new problem by looking into how similar problems are solved, is an effective learning method in human learning.

Few-Shot Learning

Automatic Speech Verification Spoofing Detection

1 code implementation15 Dec 2020 Shentong Mo, Haofan Wang, Pinxu Ren, Ta-Chung Chi

Automatic speech verification (ASV) is the technology to determine the identity of a person based on their voice.

Cannot find the paper you are looking for? You can Submit a new open access paper.