Search Results for author: Chiori Hori

Found 22 papers, 5 papers with code

NIIRF: Neural IIR Filter Field for HRTF Upsampling and Personalization

1 code implementation • 27 Feb 2024 • Yoshiki Masuyama, Gordon Wichern, François G. Germain, Zexu Pan, Sameer Khurana, Chiori Hori, Jonathan Le Roux

Existing NF-based methods focused on estimating the magnitude of the HRTF from a given sound source direction, and the magnitude is converted to a finite impulse response (FIR) filter.

Spatial Interpolation

Paper
Code

Interactive Planning Using Large Language Models for Partially Observable Robotics Tasks

no code implementations • 11 Dec 2023 • Lingfeng Sun, Devesh K. Jha, Chiori Hori, Siddarth Jain, Radu Corcodel, Xinghao Zhu, Masayoshi Tomizuka, Diego Romeres

Designing robotic agents to perform open vocabulary tasks has been the long-standing goal in robotics and AI.

Paper
Add Code

Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction

no code implementations • 30 Oct 2023 • Zexu Pan, Gordon Wichern, Yoshiki Masuyama, Francois G. Germain, Sameer Khurana, Chiori Hori, Jonathan Le Roux

Target speech extraction aims to extract, based on a given conditioning cue, a target speech signal that is corrupted by interfering sources, such as noise or competing speakers.

Speaker Separation Speech Enhancement +1

Paper
Add Code

Generation or Replication: Auscultating Audio Latent Diffusion Models

no code implementations • 16 Oct 2023 • Dimitrios Bralios, Gordon Wichern, François G. Germain, Zexu Pan, Sameer Khurana, Chiori Hori, Jonathan Le Roux

The introduction of audio latent diffusion models possessing the ability to generate realistic sound clips on demand from a text description has the potential to revolutionize how we work with audio.

AudioCaps Memorization +1

Paper
Add Code

Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos

no code implementations • 27 Jun 2023 • Chiori Hori, Puyuan Peng, David Harwath, Xinyu Liu, Kei Ota, Siddarth Jain, Radu Corcodel, Devesh Jha, Diego Romeres, Jonathan Le Roux

This paper introduces a method for robot action sequence generation from instruction videos using (1) an audio-visual Transformer that converts audio-visual features and instruction speech to a sequence of robot actions called dynamic movement primitives (DMPs) and (2) style-transfer-based training that employs multi-task learning with video captioning and weakly-supervised learning with a semantic classifier to exploit unpaired video-action data.

Multi-Task Learning Scene Understanding +3

Paper
Add Code

(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering

no code implementations • 18 Feb 2022 • Anoop Cherian, Chiori Hori, Tim K. Marks, Jonathan Le Roux

Spatio-temporal scene-graph approaches to video-based reasoning tasks, such as video question-answering (QA), typically construct such graphs for every video frame.

Ranked #24 on Video Question Answering on NExT-QA

Question Answering Spatio-temporal Scene Graphs +1

Paper
Add Code

Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning

no code implementations • 13 Oct 2021 • Ankit P. Shah, Shijie Geng, Peng Gao, Anoop Cherian, Takaaki Hori, Tim K. Marks, Jonathan Le Roux, Chiori Hori

In previous work, we have proposed the Audio-Visual Scene-Aware Dialog (AVSD) task, collected an AVSD dataset, developed AVSD technologies, and hosted an AVSD challenge track at both the 7th and 8th Dialog System Technology Challenges (DSTC7, DSTC8).

Region Proposal

Paper
Add Code

Optimizing Latency for Online Video CaptioningUsing Audio-Visual Transformers

no code implementations • 4 Aug 2021 • Chiori Hori, Takaaki Hori, Jonathan Le Roux

A CNN-based timing detector is also trained to detect a proper output timing, where the captions generated by the two Trans-formers become sufficiently close to each other.

Video Captioning

Paper
Add Code

Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers

no code implementations • 19 Apr 2021 • Takaaki Hori, Niko Moritz, Chiori Hori, Jonathan Le Roux

In this paper, we extend our prior work by (1) introducing the Conformer architecture to further improve the accuracy, (2) accelerating the decoding process with a novel activation recycling technique, and (3) enabling streaming decoding with triggered attention.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Multi-Pass Transformer for Machine Translation

no code implementations • 23 Sep 2020 • Peng Gao, Chiori Hori, Shijie Geng, Takaaki Hori, Jonathan Le Roux

In contrast with previous approaches where information flows only towards deeper layers of a stack, we consider a multi-pass transformer (MPT) architecture in which earlier layers are allowed to process information in light of the output of later layers.

Machine Translation Neural Architecture Search +1

Paper
Add Code

Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers

no code implementations • 8 Jul 2020 • Shijie Geng, Peng Gao, Moitreya Chatterjee, Chiori Hori, Jonathan Le Roux, Yongfeng Zhang, Hongsheng Li, Anoop Cherian

Given an input video, its associated audio, and a brief caption, the audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a question-answer dialog with a human about the audio-visual content.

Answer Generation Graph Representation Learning

Paper
Add Code

Spatio-Temporal Ranked-Attention Networks for Video Captioning

no code implementations • 17 Jan 2020 • Anoop Cherian, Jue Wang, Chiori Hori, Tim K. Marks

To this end, we propose a Spatio-Temporal and Temporo-Spatial (STaTS) attention model which, conditioned on the language state, hierarchically combines spatial and temporal attention to videos in two different orders: (i) a spatio-temporal (ST) sub-model, which first attends to regions that have temporal evolution, then temporally pools the features from these regions; and (ii) a temporo-spatial (TS) sub-model, which first decides a single frame to attend to, then applies spatial attention within that frame.

Video Captioning

Paper
Add Code

Multi-Layer Content Interaction Through Quaternion Product For Visual Question Answering

no code implementations • 3 Jan 2020 • Lei Shi, Shijie Geng, Kai Shuang, Chiori Hori, Songxiang Liu, Peng Gao, Sen Su

To solve the issue for the intermediate layers, we propose an efficient Quaternion Block Network (QBN) to learn interaction not only for the last layer but also for all intermediate layers simultaneously.

Question Answering Video Description +1

Paper
Add Code

The Eighth Dialog System Technology Challenge

no code implementations • 14 Nov 2019 • Seokhwan Kim, Michel Galley, Chulaka Gunasekara, Sungjin Lee, Adam Atkinson, Baolin Peng, Hannes Schulz, Jianfeng Gao, Jinchao Li, Mahmoud Adada, Minlie Huang, Luis Lastras, Jonathan K. Kummerfeld, Walter S. Lasecki, Chiori Hori, Anoop Cherian, Tim K. Marks, Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta

This paper introduces the Eighth Dialog System Technology Challenge.

dialog state tracking

Paper
Add Code

Audio-Visual Scene-Aware Dialog

2 code implementations • 25 Jan 2019 • Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K. Marks, Chiori Hori, Peter Anderson, Stefan Lee, Devi Parikh

We introduce the task of scene-aware dialog.

Scene-Aware Dialogue

Paper
Code

Dialog System Technology Challenge 7

no code implementations • 11 Jan 2019 • Koichiro Yoshino, Chiori Hori, Julien Perez, Luis Fernando D'Haro, Lazaros Polymenakos, Chulaka Gunasekara, Walter S. Lasecki, Jonathan K. Kummerfeld, Michel Galley, Chris Brockett, Jianfeng Gao, Bill Dolan, Xiang Gao, Huda Alamari, Tim K. Marks, Devi Parikh, Dhruv Batra

This paper introduces the Seventh Dialog System Technology Challenges (DSTC), which use shared datasets to explore the problem of building dialog systems.

Sentence

Paper
Add Code

End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features

2 code implementations • 21 Jun 2018 • Chiori Hori, Huda Alamri, Jue Wang, Gordon Wichern, Takaaki Hori, Anoop Cherian, Tim K. Marks, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Irfan Essa, Dhruv Batra, Devi Parikh

We introduce a new dataset of dialogs about videos of human behaviors.

Question Answering Video Description +1

Paper
Code

Audio Visual Scene-Aware Dialog (AVSD) Challenge at DSTC7

4 code implementations • 1 Jun 2018 • Huda Alamri, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Jue Wang, Irfan Essa, Dhruv Batra, Devi Parikh, Anoop Cherian, Tim K. Marks, Chiori Hori

Scene-aware dialog systems will be able to have conversations with users about the objects and events around them.

Video Description Visual Dialog

Paper
Code

End-to-end Conversation Modeling Track in DSTC6

1 code implementation • 22 Jun 2017 • Chiori Hori, Takaaki Hori

For example, Ghazvininejad et al. proposed a knowledge grounded neural conversation model [3], where the research is aiming at combining conversational dialogs with task-oriented knowledge using unstructured data such as Twitter data for conversation and Foursquare data for external knowledge. However, the task is still limited to a restaurant information service, and has not yet been tested with a wide variety of dialog tasks.