Search Results for author: Chiori Hori

Found 22 papers, 5 papers with code

NIIRF: Neural IIR Filter Field for HRTF Upsampling and Personalization

1 code implementation27 Feb 2024 Yoshiki Masuyama, Gordon Wichern, François G. Germain, Zexu Pan, Sameer Khurana, Chiori Hori, Jonathan Le Roux

Existing NF-based methods focused on estimating the magnitude of the HRTF from a given sound source direction, and the magnitude is converted to a finite impulse response (FIR) filter.

Spatial Interpolation

Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction

no code implementations30 Oct 2023 Zexu Pan, Gordon Wichern, Yoshiki Masuyama, Francois G. Germain, Sameer Khurana, Chiori Hori, Jonathan Le Roux

Target speech extraction aims to extract, based on a given conditioning cue, a target speech signal that is corrupted by interfering sources, such as noise or competing speakers.

Speaker Separation Speech Enhancement +1

Generation or Replication: Auscultating Audio Latent Diffusion Models

no code implementations16 Oct 2023 Dimitrios Bralios, Gordon Wichern, François G. Germain, Zexu Pan, Sameer Khurana, Chiori Hori, Jonathan Le Roux

The introduction of audio latent diffusion models possessing the ability to generate realistic sound clips on demand from a text description has the potential to revolutionize how we work with audio.

AudioCaps Memorization +1

Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos

no code implementations27 Jun 2023 Chiori Hori, Puyuan Peng, David Harwath, Xinyu Liu, Kei Ota, Siddarth Jain, Radu Corcodel, Devesh Jha, Diego Romeres, Jonathan Le Roux

This paper introduces a method for robot action sequence generation from instruction videos using (1) an audio-visual Transformer that converts audio-visual features and instruction speech to a sequence of robot actions called dynamic movement primitives (DMPs) and (2) style-transfer-based training that employs multi-task learning with video captioning and weakly-supervised learning with a semantic classifier to exploit unpaired video-action data.

Multi-Task Learning Scene Understanding +3

(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering

no code implementations18 Feb 2022 Anoop Cherian, Chiori Hori, Tim K. Marks, Jonathan Le Roux

Spatio-temporal scene-graph approaches to video-based reasoning tasks, such as video question-answering (QA), typically construct such graphs for every video frame.

Question Answering Spatio-temporal Scene Graphs +1

Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning

no code implementations13 Oct 2021 Ankit P. Shah, Shijie Geng, Peng Gao, Anoop Cherian, Takaaki Hori, Tim K. Marks, Jonathan Le Roux, Chiori Hori

In previous work, we have proposed the Audio-Visual Scene-Aware Dialog (AVSD) task, collected an AVSD dataset, developed AVSD technologies, and hosted an AVSD challenge track at both the 7th and 8th Dialog System Technology Challenges (DSTC7, DSTC8).

Region Proposal

Optimizing Latency for Online Video CaptioningUsing Audio-Visual Transformers

no code implementations4 Aug 2021 Chiori Hori, Takaaki Hori, Jonathan Le Roux

A CNN-based timing detector is also trained to detect a proper output timing, where the captions generated by the two Trans-formers become sufficiently close to each other.

Video Captioning

Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers

no code implementations19 Apr 2021 Takaaki Hori, Niko Moritz, Chiori Hori, Jonathan Le Roux

In this paper, we extend our prior work by (1) introducing the Conformer architecture to further improve the accuracy, (2) accelerating the decoding process with a novel activation recycling technique, and (3) enabling streaming decoding with triggered attention.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Multi-Pass Transformer for Machine Translation

no code implementations23 Sep 2020 Peng Gao, Chiori Hori, Shijie Geng, Takaaki Hori, Jonathan Le Roux

In contrast with previous approaches where information flows only towards deeper layers of a stack, we consider a multi-pass transformer (MPT) architecture in which earlier layers are allowed to process information in light of the output of later layers.

Machine Translation Neural Architecture Search +1

Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers

no code implementations8 Jul 2020 Shijie Geng, Peng Gao, Moitreya Chatterjee, Chiori Hori, Jonathan Le Roux, Yongfeng Zhang, Hongsheng Li, Anoop Cherian

Given an input video, its associated audio, and a brief caption, the audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a question-answer dialog with a human about the audio-visual content.

Answer Generation Graph Representation Learning

Spatio-Temporal Ranked-Attention Networks for Video Captioning

no code implementations17 Jan 2020 Anoop Cherian, Jue Wang, Chiori Hori, Tim K. Marks

To this end, we propose a Spatio-Temporal and Temporo-Spatial (STaTS) attention model which, conditioned on the language state, hierarchically combines spatial and temporal attention to videos in two different orders: (i) a spatio-temporal (ST) sub-model, which first attends to regions that have temporal evolution, then temporally pools the features from these regions; and (ii) a temporo-spatial (TS) sub-model, which first decides a single frame to attend to, then applies spatial attention within that frame.

Video Captioning

Multi-Layer Content Interaction Through Quaternion Product For Visual Question Answering

no code implementations3 Jan 2020 Lei Shi, Shijie Geng, Kai Shuang, Chiori Hori, Songxiang Liu, Peng Gao, Sen Su

To solve the issue for the intermediate layers, we propose an efficient Quaternion Block Network (QBN) to learn interaction not only for the last layer but also for all intermediate layers simultaneously.

Question Answering Video Description +1

End-to-end Conversation Modeling Track in DSTC6

1 code implementation22 Jun 2017 Chiori Hori, Takaaki Hori

For example, Ghazvininejad et al. proposed a knowledge grounded neural conversation model [3], where the research is aiming at combining conversational dialogs with task-oriented knowledge using unstructured data such as Twitter data for conversation and Foursquare data for external knowledge. However, the task is still limited to a restaurant information service, and has not yet been tested with a wide variety of dialog tasks.

Attention-Based Multimodal Fusion for Video Description

no code implementations ICCV 2017 Chiori Hori, Takaaki Hori, Teng-Yok Lee, Kazuhiro Sumi, John R. Hershey, Tim K. Marks

Currently successful methods for video description are based on encoder-decoder sentence generation using recur-rent neural networks (RNNs).

Sentence Video Description

Cannot find the paper you are looking for? You can Submit a new open access paper.