Search Results for author: Po-Yao Huang

Found 26 papers, 16 papers with code

DINOv2: Learning Robust Visual Features without Supervision

11 code implementations • 14 Apr 2023 • Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision.

Ranked #1 on Image Classification on CIFAR-10

Domain Generalization Fine-Grained Image Classification +5

124,889

Paper
Code

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

2 code implementations • EMNLP 2021 • Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, Christoph Feichtenhofer

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks.

Ranked #1 on Temporal Action Localization on CrossTask (using extra training data)

Action Segmentation Long Video Retrieval (Background Removed) +4

29,237

Paper
Code

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

1 code implementation • Findings (ACL) 2021 • Hu Xu, Gargi Ghosh, Po-Yao Huang, Prahal Arora, Masoumeh Aminzadeh, Christoph Feichtenhofer, Florian Metze, Luke Zettlemoyer

We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks.

Ranked #2 on Temporal Action Localization on CrossTask (using extra training data)

Action Segmentation Language Modelling +5

29,233

Paper
Code

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

2 code implementations • 25 Mar 2024 • Puyuan Peng, Po-Yao Huang, Abdelrahman Mohamed, David Harwath

We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on audiobooks, internet videos, and podcasts.

Language Modelling

6,548

Paper
Code

Masked Autoencoders that Listen

4 code implementations • 13 Jul 2022 • Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, Christoph Feichtenhofer

Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers.

Ranked #2 on Speaker Identification on VoxCeleb1 (using extra training data)

Audio Classification Representation Learning +1

1,290

Paper
Code

Demystifying CLIP Data

2 code implementations • 28 Sep 2023 • Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer

We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective.

993

Paper
Code

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

2 code implementations • 1 Jun 2023 • Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer

Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance.

Ranked #1 on Image Classification on iNaturalist 2019 (using extra training data)

Action Classification Action Recognition In Videos +4

691

Paper
Code

Argus: Efficient Activity Detection System for Extended Video Analysis

1 code implementation • Proceedings of the IEEE Winter Conference on Applications of Computer Vision Workshops 2020 • Wenhe Liu, Guoliang Kang, Po-Yao Huang, Xiaojun Chang, Yijun Qian, Junwei Liang, Liangke Gui, Jing Wen, Peng Chen

We propose an Efficient Activity Detection System, Argus, for Extended Video Analysis in the surveillance scenario.

Action Detection Activity Detection +5

457

Paper
Code

CiT: Curation in Training for Effective Vision-Language Data

1 code implementation • ICCV 2023 • Hu Xu, Saining Xie, Po-Yao Huang, Licheng Yu, Russell Howes, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer

Large vision-language models are generally applicable to many downstream tasks, but come at an exorbitant training cost that only large institutions can afford.

Paper
Code

Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models

1 code implementation • NAACL 2021 • Po-Yao Huang, Mandela Patrick, Junjie Hu, Graham Neubig, Florian Metze, Alexander Hauptmann

Specifically, we focus on multilingual text-to-video search and propose a Transformer-based model that learns contextualized multilingual multimodal embeddings.

Image Retrieval Text-to-video search +1

Paper
Code

RWR-GAE: Random Walk Regularization for Graph Auto Encoders

1 code implementation • 12 Aug 2019 • Vaibhav, Po-Yao Huang, Robert Frederking

Node embeddings have become an ubiquitous technique for representing graph data in a low dimensional space.

Ranked #2 on Graph Clustering on Pubmed

Clustering Graph Clustering +2

Paper
Code

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

1 code implementation • 19 Sep 2023 • Yuan Tseng, Layne Berry, Yi-Ting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-Yi Lee

Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information.

audio-visual learning Representation Learning

Paper
Code

MAViL: Masked Audio-Video Learners

1 code implementation • NeurIPS 2023 • Po-Yao Huang, Vasu Sharma, Hu Xu, Chaitanya Ryali, Haoqi Fan, Yanghao Li, Shang-Wen Li, Gargi Ghosh, Jitendra Malik, Christoph Feichtenhofer

We present Masked Audio-Video Learners (MAViL) to train audio-visual representations.

Contrastive Learning Retrieval

Paper
Code

STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition

1 code implementation • CVPR 2023 • Xiaoyu Zhu, Po-Yao Huang, Junwei Liang, Celso M. de Melo, Alexander Hauptmann

The model uses a hierarchical transformer with intra-frame off-set attention and inter-frame self-attention.

Action Recognition Temporal Action Localization

Paper
Code

Audio-Visual Event Recognition through the lens of Adversary

1 code implementation • 15 Nov 2020 • Juncheng B Li, Kaixin Ma, Shuhui Qu, Po-Yao Huang, Florian Metze

This work aims to study several key questions related to multimodal learning through the lens of adversarial noises: 1) The trade-off between early/middle/late fusion affecting its robustness and accuracy 2) How do different frequency/time domain features contribute to the robustness?

Paper
Code

Video Representation Learning and Latent Concept Mining for Large-scale Multi-label Video Classification

no code implementations • 5 Jul 2017 • Po-Yao Huang, Ye Yuan, Zhenzhong Lan, Lu Jiang, Alexander G. Hauptmann

We report on CMU Informedia Lab's system used in Google's YouTube 8 Million Video Understanding Challenge.

Attribute General Classification +3

Paper
Add Code

Attention-based Multimodal Neural Machine Translation

no code implementations • WS 2016 • Po-Yao Huang, Frederick Liu, Sz-Rung Shiang, Jean Oh, Chris Dyer

Image Captioning Machine Translation +1

Paper
Add Code

RCAA: Relational Context-Aware Agents for Person Search

no code implementations • ECCV 2018 • Xiaojun Chang, Po-Yao Huang, Yi-Dong Shen, Xiaodan Liang, Yi Yang, Alexander G. Hauptmann

In this paper, we address this problem by training relational context-aware agents which learn the actions to localize the target person from the gallery of whole scene images.

Person Search

Paper
Add Code

Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations

no code implementations • IJCNLP 2019 • Po-Yao Huang, Xiaojun Chang, Alexander Hauptmann

With the aim of promoting and understanding the multilingual version of image search, we leverage visual object detection and propose a model with diverse multi-head attention to learn grounded multilingual multimodal representations.

Image Retrieval object-detection +2

Paper
Add Code

Unsupervised Multimodal Neural Machine Translation with Pseudo Visual Pivoting

no code implementations • ACL 2020 • Po-Yao Huang, Junjie Hu, Xiaojun Chang, Alexander Hauptmann

In this paper, we investigate how to utilize visual content for disambiguation and promoting latent space alignment in unsupervised MMT.

Translation Unsupervised Machine Translation

Paper
Add Code

A Comprehensive Survey of Neural Architecture Search: Challenges and Solutions

no code implementations • 1 Jun 2020 • Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Xiaojiang Chen, Xin Wang

Neural Architecture Search (NAS) is just such a revolutionary algorithm, and the related research work is complicated and rich.

Neural Architecture Search

Paper
Add Code

A Survey of Deep Active Learning

1 code implementation • 30 Aug 2020 • Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Brij B. Gupta, Xiaojiang Chen, Xin Wang

Therefore, deep active learning (DAL) has emerged.

Active Learning speech-recognition +1

Paper
Code

Support-set bottlenecks for video-text representation learning

no code implementations • ICLR 2021 • Mandela Patrick, Po-Yao Huang, Yuki Asano, Florian Metze, Alexander Hauptmann, João Henriques, Andrea Vedaldi

The dominant paradigm for learning video-text representations -- noise contrastive learning -- increases the similarity of the representations of pairs of samples that are known to be related, such as text and video from the same sample, and pushes away the representations of all other pairs.

Contrastive Learning Representation Learning +3

Paper
Add Code

Diffusion Models as Masked Autoencoders

no code implementations • ICCV 2023 • Chen Wei, Karttikeya Mangalam, Po-Yao Huang, Yanghao Li, Haoqi Fan, Hu Xu, Huiyu Wang, Cihang Xie, Alan Yuille, Christoph Feichtenhofer

There has been a longstanding belief that generation can facilitate a true understanding of visual data.

Denoising Image Inpainting

Paper
Add Code

FLAP: Fast Language-Audio Pre-training

no code implementations • 2 Nov 2023 • Ching-Feng Yeh, Po-Yao Huang, Vasu Sharma, Shang-Wen Li, Gargi Gosh

We propose Fast Language-Audio Pre-training (FLAP), a self-supervised approach that efficiently and effectively learns aligned audio and language representations through masking, contrastive learning and reconstruction.

AudioCaps Contrastive Learning +2

Paper
Add Code

Adversarially Masked Video Consistency for Unsupervised Domain Adaptation

no code implementations • 24 Mar 2024 • Xiaoyu Zhu, Junwei Liang, Po-Yao Huang, Alex Hauptmann

The second is a Masked Consistency Learning module to learn class-discriminative representations.

Unsupervised Domain Adaptation

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.