Search Results for author: Hassan Akbari

Found 8 papers, 5 papers with code

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

no code implementations NeurIPS 2023 Hassan Akbari, Dan Kondratyuk, Yin Cui, Rachel Hornung, Huisheng Wang, Hartwig Adam

We conduct extensive empirical studies and reveal the following key insights: 1) Performing gradient descent updates by alternating on diverse modalities, loss functions, and tasks, with varying input resolutions, efficiently improves the model.

 Ranked #1 on Zero-Shot Action Recognition on Kinetics (using extra training data)

Classification Image Classification +7

Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization

no code implementations3 Nov 2022 Junru Wu, Yi Liang, Feng Han, Hassan Akbari, Zhangyang Wang, Cong Yu

For example, even in the commonly adopted instructional videos, a speaker can sometimes refer to something that is not visually present in the current frame; and the semantic misalignment would only be more unpredictable for the raw videos from the internet.

Contrastive Learning

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

2 code implementations NeurIPS 2021 Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong

We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.

Ranked #3 on Zero-Shot Video Retrieval on YouCook2 (text-to-video Mean Rank metric)

Action Classification Action Recognition In Videos +9

Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding

1 code implementation CVPR 2019 Hassan Akbari, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, Shih-Fu Chang

Following dedicated non-linear mappings for visual features at each level, word, and sentence embeddings, we obtain multiple instantiations of our common semantic space in which comparisons between any target text and the visual content is performed with cosine similarity.

Language Modelling Phrase Grounding +2

Lip2AudSpec: Speech reconstruction from silent lip movements video

1 code implementation26 Oct 2017 Hassan Akbari, Himani Arora, Liangliang Cao, Nima Mesgarani

In this study, we propose a deep neural network for reconstructing intelligible speech from silent lip movement videos.

Lip Reading

Cannot find the paper you are looking for? You can Submit a new open access paper.