1 code implementation • 6 Feb 2025 • Soham Deshmukh, Shuo Han, Rita Singh, Bhiksha Raj
Lastly, we conduct multiple ablation studies to study the effects of cross-projection, language model parameters, position captioning, third stage fine-tuning, and present our findings.
1 code implementation • 24 Jan 2025 • Xiaohao Xu, Tianyi Zhang, Shibo Zhao, Xiang Li, Sibo Wang, Yongqi Chen, Ye Li, Bhiksha Raj, Matthew Johnson-Roberson, Sebastian Scherer, Xiaonan Huang
We aim to redefine robust ego-motion estimation and photorealistic 3D reconstruction by addressing a critical limitation: the reliance on noise-free data in existing models.
no code implementations • 16 Jan 2025 • Dareen Alharthi, Mahsa Zamani, Bhiksha Raj, Rita Singh
Voice biometric tasks, such as age estimation require modeling the often complex relationship between voice features and the biometric variable.
1 code implementation • 14 Dec 2024 • Hao Chen, Ze Wang, Xiang Li, Ximeng Sun, Fangyi Chen, Jiang Liu, Jindong Wang, Bhiksha Raj, Zicheng Liu, Emad Barsoum
With its fully-differentiable design and semantic-rich latent space, our experiment demonstrates that SoftVQ-VAE achieves efficient tokenization without compromising generation quality, paving the way for more efficient generative models.
1 code implementation • 2 Dec 2024 • Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Jindong Wang, Zhe Lin, Bhiksha Raj
Improvements in architecture, quantization techniques, and training recipes have significantly enhanced both image reconstruction and the downstream generation quality.
no code implementations • 27 Nov 2024 • Yichen Wang, Jie Wang, Fulin Wang, Xiang Li, Hao Yin, Bhiksha Raj
In recent years, graph representation learning has undergone a paradigm shift, driven by the emergence and proliferation of graph neural networks (GNNs) and their heterogeneous counterparts.
no code implementations • 25 Oct 2024 • Naga VS Raviteja Chappa, Page Daniel Dobbs, Bhiksha Raj, Khoa Luu
The proliferation of tobacco-related content on social media platforms poses significant challenges for public health monitoring and intervention.
no code implementations • 19 Oct 2024 • Hao Chen, Abdul Waheed, Xiang Li, Yidong Wang, Jindong Wang, Bhiksha Raj, Marah I. Abdin
The rise of Large Language Models (LLMs) has accentuated the need for diverse, high-quality pre-training data.
no code implementations • 16 Oct 2024 • Abdul Waheed, Hanin Atwany, Bhiksha Raj, Rita Singh
The analysis of layer-wise features demonstrates that some models exhibit a convex relationship between the separability of the learned representations and model depth, with different layers capturing task-specific features.
no code implementations • 7 Oct 2024 • Ibrahim Aldarmaki, Thamar Solorio, Bhiksha Raj, Hanan Aldarmaki
Neural multi-channel speech enhancement models, in particular those based on the U-Net architecture, demonstrate promising performance and generalization potential.
no code implementations • 4 Oct 2024 • Ksheeraja Raghavan, Samiran Gode, Ankit Shah, Surabhi Raghavan, Wolfram Burgard, Bhiksha Raj, Rita Singh
The data produced using the framework serves as a benchmark for anomaly detection applications, potentially enhancing the performance of models trained on audio data, particularly in handling out-of-distribution cases.
1 code implementation • 2 Oct 2024 • Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Bhiksha Raj, Zhe Lin
Increasing token length is a common approach to improve the image reconstruction quality.
1 code implementation • 24 Sep 2024 • Jiatong Shi, Jinchuan Tian, Yihan Wu, Jee-weon Jung, Jia Qi Yip, Yoshiki Masuyama, William Chen, Yuning Wu, Yuxun Tang, Massa Baali, Dareen Alharhi, Dong Zhang, Ruifan Deng, Tejes Srivastava, Haibin Wu, Alexander H. Liu, Bhiksha Raj, Qin Jin, Ruihua Song, Shinji Watanabe
Neural codecs have become crucial to recent speech and audio generation research.
no code implementations • 24 Sep 2024 • Muhammad A. Shah, Bhiksha Raj
Automatic Speech Recognition (ASR) systems must be robust to the myriad types of noises present in real-world environments including environmental noise, room impulse response, special effects as well as attacks by malicious actors (adversarial attacks).
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 10 Sep 2024 • Kuang Yuan, Shuo Han, Swarun Kumar, Bhiksha Raj
The quality of audio recordings in outdoor environments is often degraded by the presence of wind.
no code implementations • 9 Sep 2024 • Massa Baali, Abdulhamid Aldoobi, Hira Dhamyal, Rita Singh, Bhiksha Raj
This approach paves the way for more accurate and reliable identity authentication through voice.
1 code implementation • 16 Aug 2024 • Kai Qiu, Xiang Li, Hao Chen, Jie Sun, Jinglu Wang, Zhe Lin, Marios Savvides, Bhiksha Raj
Audio generation has achieved remarkable progress with the advance of sophisticated generative models, such as diffusion models (DMs) and autoregressive (AR) models.
1 code implementation • 12 Aug 2024 • Roshan Sharma, Suwon Shon, Mark Lindsey, Hira Dhamyal, Rita Singh, Bhiksha Raj
Reference summaries for abstractive speech summarization require human annotation, which can be performed by listening to an audio recording or by reading textual transcripts of the recording.
no code implementations • 4 Jul 2024 • Yuxuan Wu, Ziyu Wang, Bhiksha Raj, Gus Xia
We contribute an unsupervised method that effectively learns from raw observation and disentangles its latent space into content and style representations.
no code implementations • 1 Jul 2024 • Abdul Waheed, Karima Kadaoui, Bhiksha Raj, Muhammad Abdul-Mageed
Our models are also 25-50% more compute- and memory-efficient while maintaining performance equal to or better than that of the teacher model.
1 code implementation • 24 Jun 2024 • Xiaohao Xu, Tianyi Zhang, Sibo Wang, Xiang Li, Yongqi Chen, Ye Li, Bhiksha Raj, Matthew Johnson-Roberson, Xiaonan Huang
Embodied agents require robust navigation systems to operate in unstructured environments, making the robustness of Simultaneous Localization and Mapping (SLAM) models critical to embodied agent autonomy.
1 code implementation • 14 Jun 2024 • Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Zhe Lin, Rita Singh, Bhiksha Raj
Conditional visual generation has witnessed remarkable progress with the advent of diffusion models (DMs), especially in tasks like control-to-image generation.
no code implementations • 3 Jun 2024 • Thanh-Dat Truong, Utsav Prabhu, Dongyi Wang, Bhiksha Raj, Susan Gauch, Jeyamkondan Subbiah, Khoa Luu
To address this problem, we introduce a novel Unsupervised Cross-view Adaptation Learning approach to modeling the geometric structural change across views in Semantic Scene Understanding.
Open Vocabulary Semantic Segmentation
Open-Vocabulary Semantic Segmentation
+3
no code implementations • 3 Jun 2024 • Thanh-Dat Truong, Xin Li, Bhiksha Raj, Jackson Cothren, Khoa Luu
This problem has limited the generalizability of the vision-language foundation model to unknown data distributions.
no code implementations • 30 May 2024 • Hao Chen, Yujin Han, Diganta Misra, Xiang Li, Kai Hu, Difan Zou, Masashi Sugiyama, Jindong Wang, Bhiksha Raj
They benefit significantly from extensive pre-training on large-scale datasets, including web-crawled data with paired data and conditions, such as image-text and image-class pairs.
no code implementations • CVPR 2024 • Yizhou Zhao, Tuanfeng Y. Wang, Bhiksha Raj, Min Xu, Jimei Yang, Chun-Hao Paul Huang
Specifically, we design Human-aware Metric SLAM to reconstruct metric-scale camera poses and scene point clouds using camera-frame HMR as a strong prior, addressing depth, scale, and dynamic ambiguities.
no code implementations • 2 May 2024 • Francisco Teixeira, Karla Pizzi, Raphael Olivier, Alberto Abad, Bhiksha Raj, Isabel Trancoso
Membership Inference (MI) poses a substantial privacy threat to the training data of Automatic Speech Recognition (ASR) systems, while also offering an opportunity to audit these models with regard to user data.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 11 Mar 2024 • Hao Chen, Jindong Wang, Zihan Wang, Ran Tao, Hongxin Wei, Xing Xie, Masashi Sugiyama, Bhiksha Raj
Foundation models are usually pre-trained on large-scale datasets and then adapted to downstream tasks through tuning.
1 code implementation • 8 Mar 2024 • Muhammad A. Shah, David Solans Noguero, Mikko A. Heikkila, Bhiksha Raj, Nicolas Kourtellis
As Automatic Speech Recognition (ASR) models become ever more pervasive, it is important to ensure that they make reliable predictions under corruptions present in the physical and digital world.
2 code implementations • 7 Mar 2024 • Xiang Li, Kai Qiu, Jinglu Wang, Xiaohao Xu, Rita Singh, Kashu Yamazak, Hao Chen, Xiaonan Huang, Bhiksha Raj
Referring perception, which aims at grounding visual objects with multimodal referring guidance, is essential for bridging the gap between humans, who provide instructions, and the environment where intelligent systems perceive.
no code implementations • 18 Feb 2024 • Zhaorun Chen, Zhuokai Zhao, Zhihong Zhu, Ruiqi Zhang, Xiang Li, Bhiksha Raj, Huaxiu Yao
Recent advancements in large language models (LLMs) have shown promise in multi-step reasoning tasks, yet their reliance on extensive manual labeling to provide procedural feedback remains a significant impediment.
no code implementations • 16 Feb 2024 • Muqiao Yang, Xiang Li, Umberto Cappellazzo, Shinji Watanabe, Bhiksha Raj
In this work, we propose an evaluation methodology that provides a unified evaluation on stability, plasticity, and generalizability in continual learning.
1 code implementation • 12 Feb 2024 • Xiaohao Xu, Tianyi Zhang, Sibo Wang, Xiang Li, Yongqi Chen, Ye Li, Bhiksha Raj, Matthew Johnson-Roberson, Xiaonan Huang
To this end, we propose a novel, customizable pipeline for noisy data synthesis, aimed at assessing the resilience of multi-modal SLAM models against various perturbations.
no code implementations • 2 Feb 2024 • Hao Chen, Bhiksha Raj, Xing Xie, Jindong Wang
Large foundation models (LFMs) are claiming incredible performances.
1 code implementation • 2 Feb 2024 • Hao Chen, Jindong Wang, Lei Feng, Xiang Li, Yidong Wang, Xing Xie, Masashi Sugiyama, Rita Singh, Bhiksha Raj
Weakly supervised learning generally faces challenges in applicability to various scenarios with diverse weak supervision and in scalability due to the complexity of existing algorithms, thereby hindering the practical deployment.
1 code implementation • 1 Feb 2024 • Soham Deshmukh, Dareen Alharthi, Benjamin Elizalde, Hannes Gamper, Mahmoud Al Ismail, Rita Singh, Bhiksha Raj, Huaming Wang
Here, we exploit this capability and introduce PAM, a no-reference metric for assessing audio quality for different audio processing tasks.
1 code implementation • 10 Jan 2024 • Jee-weon Jung, Roshan Sharma, William Chen, Bhiksha Raj, Shinji Watanabe
We tackle this challenge by proposing AugSumm, a method to leverage large language models (LLMs) as a proxy for human annotators to generate augmented summaries for training and evaluation.
no code implementations • 27 Nov 2023 • Thanh-Dat Truong, Utsav Prabhu, Bhiksha Raj, Jackson Cothren, Khoa Luu
In particular, we first introduce a new Fairness Contrastive Clustering loss to address the problems of catastrophic forgetting and fairness.
1 code implementation • 15 Nov 2023 • Yutian Chen, Hao Kang, Vivian Zhai, Liangze Li, Rita Singh, Bhiksha Raj
This paper introduces a novel approach for identifying the possible large language models (LLMs) involved in text generation.
2 code implementations • ICCV 2023 • Yandong Wen, Weiyang Liu, Yao Feng, Bhiksha Raj, Rita Singh, Adrian Weller, Michael J. Black, Bernhard Schölkopf
In this paper, we focus on a general yet important learning problem, pairwise similarity learning (PSL).
no code implementations • 11 Oct 2023 • Joseph Konan, Shikhar Agnihotri, Ojas Bhargave, Shuo Han, Yunyang Zeng, Ankit Shah, Bhiksha Raj
Within the ambit of VoIP (Voice over Internet Protocol) telecommunications, the complexities introduced by acoustic transformations merit rigorous analysis.
no code implementations • 10 Oct 2023 • Francisco Teixeira, Alberto Abad, Bhiksha Raj, Isabel Trancoso
Speaker embeddings are ubiquitous, with applications ranging from speaker recognition and diarization to speech synthesis and voice anonymisation.
no code implementations • 4 Oct 2023 • Umberto Cappellazzo, Enrico Fini, Muqiao Yang, Daniele Falavigna, Alessio Brutti, Bhiksha Raj
In this paper, we investigate the problem of learning sequence-to-sequence models for spoken language understanding in a class-incremental learning (CIL) setting and we propose COCONUT, a CIL method that relies on the combination of experience replay and contrastive learning.
no code implementations • 3 Oct 2023 • Hira Dhamyal, Benjamin Elizalde, Soham Deshmukh, Huaming Wang, Bhiksha Raj, Rita Singh
In this work, we address the challenge of automatically generating these prompts and training a model to better learn emotion representations from audio and prompt pairs.
no code implementations • 2 Oct 2023 • Muqiao Yang, Chunlei Zhang, Yong Xu, Zhongweiyang Xu, Heming Wang, Bhiksha Raj, Dong Yu
Speech enhancement aims to improve the quality of speech signals in terms of quality and intelligibility, and speech editing refers to the process of editing the speech according to specific user needs.
no code implementations • 2 Oct 2023 • Muhammad Ahmed Shah, Roshan Sharma, Hira Dhamyal, Raphael Olivier, Ankit Shah, Joseph Konan, Dareen Alharthi, Hazim T Bukhari, Massa Baali, Soham Deshmukh, Michael Kuhlmann, Bhiksha Raj, Rita Singh
We hypothesize that for attacks to be transferrable, it is sufficient if the proxy can approximate the target model in the neighborhood of the harmful query.
no code implementations • 1 Oct 2023 • Xiang Li, Yinpeng Chen, Chung-Ching Lin, Hao Chen, Kai Hu, Rita Singh, Bhiksha Raj, Lijuan Wang, Zicheng Liu
This paper presents a novel approach to object completion, with the primary goal of reconstructing a complete object from its partially visible components.
1 code implementation • 1 Oct 2023 • Dareen Alharthi, Roshan Sharma, Hira Dhamyal, Soumi Maiti, Bhiksha Raj, Rita Singh
In this paper, we propose an evaluation technique involving the training of an ASR model on synthetic speech and assessing its performance on real speech.
3 code implementations • CVPR 2024 • Xiang Li, Jinglu Wang, Xiaohao Xu, Xiulian Peng, Rita Singh, Yan Lu, Bhiksha Raj
We propose a semantic decomposition method based on product quantization, where the multi-source semantics can be decomposed and represented by several disentangled and noise-suppressed single-source semantics.
1 code implementation • 29 Sep 2023 • Hao Chen, Jindong Wang, Ankit Shah, Ran Tao, Hongxin Wei, Xing Xie, Masashi Sugiyama, Bhiksha Raj
This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks.
no code implementations • 23 Sep 2023 • Ankit Shah, Fuyu Tang, Zelin Ye, Rita Singh, Bhiksha Raj
Weak-label learning is a challenging task that requires learning from data "bags" containing positive and negative instances, but only the bag labels are known.
1 code implementation • 14 Sep 2023 • Soham Deshmukh, Benjamin Elizalde, Dimitra Emmanouilidou, Bhiksha Raj, Rita Singh, Huaming Wang
During inference, the text encoder is replaced with the pretrained CLAP audio encoder.
no code implementations • 7 Aug 2023 • Muhammad Ahmed Shah, Bhiksha Raj
The vulnerability to adversarial perturbations is a major flaw of Deep Neural Networks (DNNs) that raises question about their reliability when in real-world scenarios.
1 code implementation • 26 Jul 2023 • Liao Qu, Xianwei Zou, Xiang Li, Yandong Wen, Rita Singh, Bhiksha Raj
This work unveils the enigmatic link between phonemes and facial features.
no code implementations • 26 Jul 2023 • Xiang Li, Yandong Wen, Muqiao Yang, Jinglu Wang, Rita Singh, Bhiksha Raj
Previous works on voice-face matching and voice-guided face synthesis demonstrate strong correlations between voice and face, but mainly rely on coarse semantic cues such as gender, age, and emotion.
no code implementations • 17 Jul 2023 • Roshan Sharma, Kenneth Zheng, Siddhant Arora, Shinji Watanabe, Rita Singh, Bhiksha Raj
End-to-end speech summarization has been shown to improve performance over cascade baselines.
no code implementations • 16 Jun 2023 • Pha Nguyen, Kha Gia Quach, John Gauch, Samee U. Khan, Bhiksha Raj, Khoa Luu
Then, a new cross-domain MOT adaptation from existing datasets is proposed without any pre-defined human knowledge in understanding and modeling objects.
1 code implementation • 30 May 2023 • Xiang Li, Chung-Ching Lin, Yinpeng Chen, Zicheng Liu, Jinglu Wang, Bhiksha Raj
The paper introduces PaintSeg, a new unsupervised method for segmenting objects without any training.
1 code implementation • 22 May 2023 • Hao Chen, Ankit Shah, Jindong Wang, Ran Tao, Yidong Wang, Xing Xie, Masashi Sugiyama, Rita Singh, Bhiksha Raj
In this paper, we introduce imprecise label learning (ILL), a framework for the unification of learning with various imprecise label configurations.
Ranked #1 on
Learning with noisy labels
on mini WebVision 1.0
2 code implementations • 13 May 2023 • Yutian Chen, Hao Kang, Vivian Zhai, Liangze Li, Rita Singh, Bhiksha Raj
This paper presents a novel approach for detecting ChatGPT-generated vs. human-written text using language models.
1 code implementation • CVPR 2023 • Thanh-Dat Truong, Ngan Le, Bhiksha Raj, Jackson Cothren, Khoa Luu
Although Domain Adaptation in Semantic Scene Segmentation has shown impressive improvement in recent years, the fairness concerns in the domain adaptation have yet to be well defined and addressed.
Ranked #6 on
Domain Adaptation
on SYNTHIA-to-Cityscapes
no code implementations • 16 Mar 2023 • Joseph Konan, Ojas Bhargave, Shikhar Agnihotri, Hojeong Lee, Ankit Shah, Shuo Han, Yunyang Zeng, Amanda Shu, Haohui Liu, Xuankai Chang, Hamza Khalid, Minseon Gwak, Kawon Lee, Minjeong Kim, Bhiksha Raj
In this paper, we present a method for fine-tuning models trained on the Deep Noise Suppression (DNS) 2020 Challenge to improve their performance on Voice over Internet Protocol (VoIP) applications.
no code implementations • 7 Mar 2023 • Ankit Shah, Shuyi Chen, Kejun Zhou, Yue Chen, Bhiksha Raj
Preliminary results show (1) the proposed BECR can incur a more dispersed embedding on the test set, (2) BECR improves the PaSST model without extra computation complexity, and (3) STFT preprocessing outperforms CQT in all tasks we tested.
no code implementations • 20 Feb 2023 • Laurie M. Heller, Benjamin Elizalde, Bhiksha Raj, Soham Deshmukh
Machine Listening, as usually formalized, attempts to perform a task that is, from our perspective, fundamentally human-performable, and performed by humans.
2 code implementations • 16 Feb 2023 • Yunyang Zeng, Joseph Konan, Shuo Han, David Bick, Muqiao Yang, Anurag Kumar, Shinji Watanabe, Bhiksha Raj
We propose an objective for perceptual quality based on temporal acoustic parameters.
2 code implementations • 16 Feb 2023 • Muqiao Yang, Joseph Konan, David Bick, Yunyang Zeng, Shuo Han, Anurag Kumar, Shinji Watanabe, Bhiksha Raj
We can add this criterion as an auxiliary loss to any model that produces speech, to optimize speech outputs to match the values of clean speech in these features.
4 code implementations • 26 Jan 2023 • Hao Chen, Ran Tao, Yue Fan, Yidong Wang, Jindong Wang, Bernt Schiele, Xing Xie, Bhiksha Raj, Marios Savvides
The critical challenge of Semi-Supervised Learning (SSL) is how to effectively leverage the limited labeled data and massive unlabeled data to improve the model's generalization performance.
no code implementations • 2 Jan 2023 • Samiran Gode, Supreeth Bare, Bhiksha Raj, Hyungon Yoo
To understand the polarization we begin by showing results from some classical language models in Word2Vec and Doc2Vec.
no code implementations • ICCV 2023 • Xiang Li, Jinglu Wang, Xiaohao Xu, Xiao Li, Bhiksha Raj, Yan Lu
Our model achieves state-of-the-art performance on R-VOS benchmarks, Ref-DAVIS17 and Ref-Youtube-VOS, and also our RRYTVOS dataset.
1 code implementation • 28 Nov 2022 • Kashu Yamazaki, Khoa Vo, Sang Truong, Bhiksha Raj, Ngan Le
Video paragraph captioning aims to generate a multi-sentence description of an untrimmed video with several temporal event locations in coherent storytelling.
Ranked #2 on
Video Captioning
on ActivityNet Captions
no code implementations • 26 Nov 2022 • Xiang Li, Haoyuan Cao, Shijie Zhao, Junlin Li, Li Zhang, Bhiksha Raj
In this paper, we aim to tackle the video salient object detection problem for panoramic videos, with their corresponding ambisonic audios.
no code implementations • 20 Nov 2022 • Hao Chen, Yue Fan, Yidong Wang, Jindong Wang, Bernt Schiele, Xing Xie, Marios Savvides, Bhiksha Raj
While standard SSL assumes uniform data distribution, we consider a more realistic and challenging setting called imbalanced SSL, where imbalanced class distributions occur in both labeled and unlabeled data.
no code implementations • 14 Nov 2022 • Hira Dhamyal, Benjamin Elizalde, Soham Deshmukh, Huaming Wang, Bhiksha Raj, Rita Singh
We investigate how the model can learn to associate the audio with the descriptions, resulting in performance improvement of Speech Emotion Recognition and Speech Audio Retrieval.
no code implementations • 29 Oct 2022 • Roshan Sharma, Hira Dhamyal, Bhiksha Raj, Rita Singh
Accordingly, models that have been proposed for emotion detection use one or the other of these label types.
no code implementations • 29 Oct 2022 • Roshan Sharma, Bhiksha Raj
Transformers are among the state of the art for many tasks in speech, vision, and natural language processing, among others.
1 code implementation • 26 Oct 2022 • Raphael Olivier, Bhiksha Raj
Whisper is a recent Automatic Speech Recognition (ASR) model displaying impressive robustness to both out-of-distribution inputs and random noise.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 26 Oct 2022 • Francisco Teixeira, Alberto Abad, Bhiksha Raj, Isabel Trancoso
Automatic Speaker Diarization (ASD) is an enabling technology with numerous applications, which deals with recordings of multiple speakers, raising special concerns in terms of privacy.
1 code implementation • 5 Oct 2022 • Khoa Vo, Sang Truong, Kashu Yamazaki, Bhiksha Raj, Minh-Triet Tran, Ngan Le
PMR module represents each video snippet by a visual-linguistic feature, in which main actors and surrounding environment are represented by visual information, whereas relevant objects are depicted by linguistic features through an image-text model.
1 code implementation • 17 Sep 2022 • Raphael Olivier, Hadi Abdullah, Bhiksha Raj
To exploit ASR models in real-world, black-box settings, an adversary can leverage the transferability property, i. e. that an adversarial sample produced for a proxy ASR can also fool a different remote ASR.
5 code implementations • 12 Aug 2022 • Yidong Wang, Hao Chen, Yue Fan, Wang Sun, Ran Tao, Wenxin Hou, RenJie Wang, Linyi Yang, Zhi Zhou, Lan-Zhe Guo, Heli Qi, Zhen Wu, Yu-Feng Li, Satoshi Nakamura, Wei Ye, Marios Savvides, Bhiksha Raj, Takahiro Shinozaki, Bernt Schiele, Jindong Wang, Xing Xie, Yue Zhang
We further provide the pre-trained versions of the state-of-the-art neural models for CV tasks to make the cost affordable for further tuning.
no code implementations • 12 Jul 2022 • Xiang Li, Jinglu Wang, Xiaohao Xu, Bhiksha Raj, Yan Lu
We propose a robust context fusion network to tackle VIS in an online fashion, which predicts instance segmentation frame-by-frame with a few preceding frames.
1 code implementation • 8 Jul 2022 • Raphael Olivier, Bhiksha Raj
Finally, with sparsity we can measure increases in robustness that do not affect accuracy: we show for example that data augmentation can by itself increase adversarial robustness, without using adversarial training.
1 code implementation • 4 Jul 2022 • Xiang Li, Jinglu Wang, Xiaohao Xu, Xiao Li, Bhiksha Raj, Yan Lu
Referring Video Object Segmentation (R-VOS) is a challenging task that aims to segment an object in a video based on a linguistic expression.
Ranked #13 on
Referring Video Object Segmentation
on Refer-YouTube-VOS
Referring Expression Segmentation
Referring Video Object Segmentation
+2
1 code implementation • 1 Jul 2022 • Muqiao Yang, Joseph Konan, David Bick, Anurag Kumar, Shinji Watanabe, Bhiksha Raj
We first identify key acoustic parameters that have been found to correlate well with voice quality (e. g. jitter, shimmer, and spectral flux) and then propose objective functions which are aimed at reducing the difference between clean speech and enhanced speech with respect to these features.
no code implementations • 25 Jun 2022 • Roshan Sharma, Tyler Vuong, Mark Lindsey, Hira Dhamyal, Rita Singh, Bhiksha Raj
This work presents a multitask approach to the simultaneous estimation of age, country of origin, and emotion given vocal burst audio for the 2022 ICML Expressive Vocalizations Challenge ExVo-MultiTask track.
no code implementations • 23 Jun 2022 • Francisco Teixeira, Alberto Abad, Bhiksha Raj, Isabel Trancoso
This poses two important issues: first, knowledge of the speaker embedding extraction model may create security and robustness liabilities for the authentication system, as this knowledge might help attackers in crafting adversarial examples able to mislead the system; second, from the point of view of a service provider the speaker embedding extraction model is arguably one of the most valuable components in the system and, as such, disclosing it would be highly undesirable.
no code implementations • 18 Jun 2022 • Chonghan Chen, Qi Jiang, Chih-Hao Wang, Noel Chen, Haohan Wang, Xiang Li, Bhiksha Raj
With our proposed QCM, the downstream fusion module receives visual features that are more discriminative and focused on the desired object described in the expression, leading to more accurate predictions.
5 code implementations • 15 May 2022 • Yidong Wang, Hao Chen, Qiang Heng, Wenxin Hou, Yue Fan, Zhen Wu, Jindong Wang, Marios Savvides, Takahiro Shinozaki, Bhiksha Raj, Bernt Schiele, Xing Xie
Semi-supervised Learning (SSL) has witnessed great success owing to the impressive performances brought by various methods based on pseudo labeling and consistency regularization.
no code implementations • 11 Apr 2022 • Ankit Shah, Hira Dhamyal, Yang Gao, Daniel Arancibia, Mario Arancibia, Bhiksha Raj, Rita Singh
Lately, there has been a global effort by multiple research groups to detect COVID-19 from voice.
1 code implementation • 29 Mar 2022 • Raphael Olivier, Bhiksha Raj
Like many other tasks involving neural networks, Speech Recognition models are vulnerable to adversarial attacks.
no code implementations • 20 Mar 2022 • Shentong Mo, Jingfei Xia, Xiaoqing Tan, Bhiksha Raj
Our Point3D consists of a Point Head for action localization and a 3D Head for action classification.
3 code implementations • 6 Mar 2022 • Joseph Turian, Jordie Shier, Humair Raj Khan, Bhiksha Raj, Björn W. Schuller, Christian J. Steinmetz, Colin Malloy, George Tzanetakis, Gissel Velarde, Kirk McNally, Max Henry, Nicolas Pinto, Camille Noufi, Christian Clough, Dorien Herremans, Eduardo Fonseca, Jesse Engel, Justin Salamon, Philippe Esling, Pranay Manocha, Shinji Watanabe, Zeyu Jin, Yonatan Bisk
The aim of the HEAR benchmark is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios.
no code implementations • 4 Mar 2022 • Larry Tang, Po Hao Chou, Yi Yu Zheng, Ziqian Ge, Ankit Shah, Bhiksha Raj
We find that the baseline Siamese does not perform better by incorporating ontology information in the weak and multi-label scenario, but that the GCN does capture the ontology knowledge better for weak, multi-labeled data.
1 code implementation • EMNLP 2021 • Raphael Olivier, Bhiksha Raj
We apply adaptive versions of state-of-the-art attacks, such as the Imperceptible ASR attack, to our model, and show that our strongest defense is robust to all attacks that use inaudible noise, and can only be broken with very high distortion.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • ICCV 2021 • Yandong Wen, Weiyang Liu, Bhiksha Raj, Rita Singh
We present a conditional estimation (CEST) framework to learn 3D facial parameters from 2D single-view images by self-supervised training from videos.
Ranked #16 on
3D Face Reconstruction
on REALY
1 code implementation • 12 Sep 2021 • Weiyang Liu, Yandong Wen, Bhiksha Raj, Rita Singh, Adrian Weller
As one of the earliest works in hyperspherical face recognition, SphereFace explicitly proposed to learn face embeddings with large inter-class angular margin.
1 code implementation • ICCV 2021 • Thanh-Dat Truong, Chi Nhan Duong, The De Vu, Hoang Anh Pham, Bhiksha Raj, Ngan Le, Khoa Luu
Therefore, this work introduces a new Audio-Visual Transformer approach to the problem of localization and highlighting the main speaker in both audio and visual channels of a multi-speaker conversation video in the wild.
no code implementations • ICLR 2022 • Yandong Wen, Weiyang Liu, Adrian Weller, Bhiksha Raj, Rita Singh
In this paper, we start by identifying the discrepancy between training and evaluation in the existing multi-class classification framework and then discuss the potential limitations caused by the "competitive" nature of softmax normalization.
no code implementations • 16 Jul 2021 • Hao Liang, Lulan Yu, Guikang Xu, Bhiksha Raj, Rita Singh
With this in perspective, we propose a framework to morph a target face in response to a given voice in a way that facial features are implicitly guided by learned voice-face correlation in this paper.
1 code implementation • 12 Jun 2021 • Soham Deshmukh, Bhiksha Raj, Rita Singh
To that extent, we propose a shared encoder architecture with sound event detection as a primary task and an additional secondary decoder for a self-supervised auxiliary task.
no code implementations • 19 Mar 2021 • Anxiang Zhang, Ankit Shah, Bhiksha Raj
Thus, this paper introduces a novel semi-weak label learning paradigm as a middle ground to mitigate the problem.
1 code implementation • 15 Mar 2021 • Bronya Roni Chernyak, Bhiksha Raj, Tamir Hazan, Joseph Keshet
This paper proposes an attack-independent (non-adversarial training) technique for improving adversarial robustness of neural network models, with minimal loss of standard accuracy.
no code implementations • ICCV 2021 • Kai Hu, Jie Shao, YuAn Liu, Bhiksha Raj, Marios Savvides, Zhiqiang Shen
To address this, we present a contrast-and-order representation (CORP) framework for learning self-supervised video representations that can automatically capture both the appearance information within each frame and temporal information across different frames.
Ranked #3 on
Self-Supervised Action Recognition Linear
on UCF101
Action Recognition
Self-Supervised Action Recognition Linear
+1
1 code implementation • NeurIPS 2020 • Jie Shao, Kai Hu, Changhu Wang, xiangyang xue, Bhiksha Raj
In this paper, we study what would happen when normalization layers are removed from the network, and show how to train deep neural networks without normalization layers and without performance degradation.
2 code implementations • 17 Nov 2020 • Ali Shahin Shamsabadi, Francisco Sepúlveda Teixeira, Alberto Abad, Bhiksha Raj, Andrea Cavallaro, Isabel Trancoso
Speaker identification models are vulnerable to carefully designed adversarial perturbations of their input signals that induce misclassification.
1 code implementation • 9 Nov 2020 • Jiachen Lian, Aiswarya Vinod Kumar, Hira Dhamyal, Bhiksha Raj, Rita Singh
We further propose Multinomial Masked Proxy (MMP) loss to leverage the hardness of speaker pairs.
1 code implementation • 17 Aug 2020 • Soham Deshmukh, Bhiksha Raj, Rita Singh
Weakly Labelled learning has garnered lot of attention in recent years due to its potential to scale Sound Event Detection (SED) and is formulated as Multiple Instance Learning (MIL) problem.
no code implementations • 28 May 2020 • Muhammad A. Shah, Raphael Olivier, Bhiksha Raj
Deploying deep learning models, comprising of non-linear combination of millions, even billions, of parameters is challenging given the memory, power and compute constraints of the real world.
no code implementations • LREC 2020 • Joana Correia, Isabel Trancoso, Bhiksha Raj
The automation of the diagnosis and monitoring of speech affecting diseases in real life situations, such as Depression or Parkinson{'}s disease, depends on the existence of rich and large datasets that resemble real life conditions, such as those collected from in-the-wild multimedia repositories like YouTube.
1 code implementation • NeurIPS 2019 • Yandong Wen, Bhiksha Raj, Rita Singh
The network learns to generate faces from voices by matching the identities of generated faces to those of the speakers, on a training set.
no code implementations • 13 Nov 2019 • Hira Dhamyal, Shahan Ali Memon, Bhiksha Raj, Rita Singh
Our tests show significant differences in the manner and choice of phonemes in acted and natural speech, concluding moderate to low validity and value in using acted speech databases for emotion classification tasks.
no code implementations • 24 Oct 2019 • Shahan Ali Memon, Hira Dhamyal, Oren Wright, Daniel Justice, Vijaykumar Palat, William Boler, Bhiksha Raj, Rita Singh
While we limit ourselves to a single modality (i. e. speech), our framework is applicable to studies of emotion perception from all such loosely annotated data in general.
no code implementations • 26 May 2019 • Daanish Ali Khan, Linhong Li, Ninghao Sha, Zhuoran Liu, Abelino Jimenez, Bhiksha Raj, Rita Singh
Recent breakthroughs in the field of deep learning have led to advancements in a broad spectrum of tasks in computer vision, audio processing, natural language processing and other areas.
1 code implementation • 25 May 2019 • Yandong Wen, Rita Singh, Bhiksha Raj
Voice profiling aims at inferring various human parameters from their speech, e. g. gender, age, etc.
1 code implementation • 14 May 2019 • Chirag Nagpal, Rohan Sangave, Amit Chahar, Parth Shah, Artur Dubrawski, Bhiksha Raj
Semi-parametric survival analysis methods like the Cox Proportional Hazards (CPH) regression (Cox, 1972) are a popular approach for survival analysis.
no code implementations • 18 Mar 2019 • Wenbo Zhao, Yang Gao, Shahan Ali Memon, Bhiksha Raj, Rita Singh
Addressing these problems, we propose a binary tree-structured hierarchical routing mixture of experts (HRME) model that has classifiers as non-leaf node experts and simple regression models as leaf node experts.
1 code implementation • 7 Feb 2019 • Felix Kreuk, Yossi Adi, Bhiksha Raj, Rita Singh, Joseph Keshet
Steganography is the science of hiding a secret message within an ordinary public message, which is referred to as Carrier.
1 code implementation • 28th International Joint Conference on Artificial Intelligence 2019 • Anurag Kumar, Ankit Shah, Alex Hauptmann, Bhiksha Raj
In the last couple of years, weakly labeled learning for sound events has turned out to be an exciting approach for audio event detection.
no code implementations • 19 Nov 2018 • Kai Hu, Bhiksha Raj
Capturing spatiotemporal dynamics is an essential topic in video recognition.
no code implementations • 1 Oct 2018 • Shahan Ali Memon, Wenbo Zhao, Bhiksha Raj, Rita Singh
Regression-via-Classification (RvC) is the process of converting a regression problem to a classification one.
no code implementations • 27 Sep 2018 • Wenbo Zhao, Shahan Ali Memon, Bhiksha Raj, Rita Singh
Regression-via-Classification (RvC) is the process of converting a regression problem to a classification one.
no code implementations • ICLR 2019 • Yandong Wen, Mahmoud Al Ismail, Weiyang Liu, Bhiksha Raj, Rita Singh
We propose a novel framework, called Disjoint Mapping Network (DIMNet), for cross-modal biometric matching, in particular of voices and faces.
no code implementations • 12 Jul 2018 • Yandong Wen, Mahmoud Al Ismail, Bhiksha Raj, Rita Singh
In many retrieval problems, where we must retrieve one or more entries from a gallery in response to a probe, it is common practice to learn to do by directly comparing the probe and gallery entries to one another.
1 code implementation • 24 Apr 2018 • Ankit Shah, Anurag Kumar, Alexander G. Hauptmann, Bhiksha Raj
In this work, we first describe a CNN based approach for weakly supervised training of audio events.
no code implementations • 19 Feb 2018 • Yang Gao, Rita Singh, Bhiksha Raj
In voice impersonation, the resultant voice must convincingly convey the impression of having been naturally produced by the target speaker, mimicking not only the pitch and other perceivable signal qualities, but also the style of the target speaker.
Sound Audio and Speech Processing
no code implementations • 2 Nov 2017 • Rohan Badlani, Ankit Shah, Benjamin Elizalde, Anurag Kumar, Bhiksha Raj
The framework crawls videos using search queries corresponding to 78 sound event labels drawn from three datasets.
no code implementations • 13 Jul 2017 • Anders Oland, Aayush Bansal, Roger B. Dannenberg, Bhiksha Raj
To this end, we demonstrate faster convergence and better performance on diverse classification tasks: image classification using CIFAR-10 and ImageNet, and semantic segmentation using PASCAL VOC 2012.
no code implementations • 9 Jul 2017 • Anurag Kumar, Bhiksha Raj
We propose that learning algorithms that can exploit weak labels offer an effective method to learn from web data.
21 code implementations • CVPR 2017 • Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, Le Song
This paper addresses deep face recognition (FR) problem under open-set protocol, where ideal face features are expected to have smaller maximal intra-class distance than minimal inter-class distance under a suitably chosen metric space.
Ranked #1 on
Face Verification
on CK+
no code implementations • 24 Feb 2017 • Haohan Wang, Bhiksha Raj
This paper is a review of the evolutionary history of deep learning models.
no code implementations • 16 Jan 2017 • Aditya Sharma, Nikolas Wolfe, Bhiksha Raj
How much can pruning algorithms teach us about the fundamentals of learning representations in neural networks?
no code implementations • 12 Nov 2016 • Anurag Kumar, Bhiksha Raj
In this paper we propose a novel learning framework called Supervised and Weakly Supervised Learning where the goal is to learn simultaneously from weakly and strongly labeled data.
no code implementations • 23 Sep 2016 • Anurag Kumar, Bhiksha Raj, Ndapandula Nakashole
In this paper we describe approaches for discovering acoustic concepts and relations in text.
no code implementations • 20 Sep 2016 • Benjamin Elizalde, Ankit Shah, Siddharth Dalmia, Min Hun Lee, Rohan Badlani, Anurag Kumar, Bhiksha Raj, Ian Lane
The audio event detectors are trained on the labeled audio and ran on the unlabeled audio downloaded from YouTube.
no code implementations • 19 Jul 2016 • Anurag Kumar, Bhiksha Raj
One of the most important problems in audio event detection research is absence of benchmark results for comparison with any proposed method.
Sound Multimedia
no code implementations • 13 Jul 2016 • Sebastian Sager, Benjamin Elizalde, Damian Borth, Christian Schulze, Bhiksha Raj, Ian Lane
One contribution is the previously unavailable documentation of the challenges and implications of collecting audio recordings with these type of labels.
no code implementations • 9 Jul 2016 • Anurag Kumar, Bhiksha Raj
In this paper we propose strategies for estimating performance of a classifier when labels cannot be obtained for the whole test set.
no code implementations • 12 Jun 2016 • Anurag Kumar, Bhiksha Raj
Audio Event Detection is an important task for content analysis of multimedia data.
no code implementations • 9 May 2016 • Anurag Kumar, Bhiksha Raj
This helps in obtaining a complete description of the recording and is notable since temporal information was never known in the first place in weakly labeled data.
no code implementations • 27 Feb 2016 • Rahul Radhakrishnan Iyer, Sanjeel Parekh, Vikas Mohandoss, Anush Ramsurat, Bhiksha Raj, Rita Singh
Existing video indexing and retrieval methods on popular web-based multimedia sharing websites are based on user-provided sparse tagging.
no code implementations • 11 Jan 2016 • Suyoun Kim, Bhiksha Raj, Ian Lane
We propose a novel deep neural network architecture for speech recognition that explicitly employs knowledge of the background environmental noise within a deep neural network acoustic model.
no code implementations • 16 Nov 2015 • Zhenzhong Lan, Shoou-I Yu, Ming Lin, Bhiksha Raj, Alexander G. Hauptmann
We approach this problem by first showing that local handcrafted features and Convolutional Neural Networks (CNNs) share the same convolution-pooling network structure.
no code implementations • 16 Oct 2015 • Haohan Wang, Bhiksha Raj
Further, we will also look into the development history of modelling time series data with neural networks.
no code implementations • 6 Aug 2015 • Luís Marujo, José Portêlo, Wang Ling, David Martins de Matos, João P. Neto, Anatole Gershman, Jaime Carbonell, Isabel Trancoso, Bhiksha Raj
State-of-the-art extractive multi-document summarization systems are usually designed without any concern about privacy issues, meaning that all documents are open to third parties.
no code implementations • 27 Feb 2015 • Soham De, Indradyumna Roy, Tarunima Prabhakar, Kriti Suneja, Sourish Chaudhuri, Rita Singh, Bhiksha Raj
Given the large number of new musical tracks released each year, automated approaches to plagiarism detection are essential to help us track potential violations of copyright.
no code implementations • 6 Feb 2015 • Anurag Kumar, Bhiksha Raj
We also introduce a novel metric for ranking instances based on an index which depends upon the rank of weighted scores of test points among the weighted scores of training points.
no code implementations • CVPR 2015 • Zhenzhong Lan, Ming Lin, Xuanchong Li, Alexander G. Hauptmann, Bhiksha Raj
MIFS compensates for information lost from using differential operators by recapturing information at coarse scales.
no code implementations • NeurIPS 2012 • Sourish Chaudhuri, Bhiksha Raj
Approaches to audio classification and retrieval tasks largely rely on detection-based discriminative models.
no code implementations • 7 Sep 2012 • Sohail Bahmani, Petros T. Boufounos, Bhiksha Raj
As an example we elaborate on application of the main results to estimation in Generalized Linear Model.
no code implementations • NeurIPS 2010 • Manas Pathak, Shantanu Rane, Bhiksha Raj
As increasing amounts of sensitive personal information finds its way into data repositories, it is important to develop analysis mechanisms that can derive aggregate information from these repositories without revealing information about individual data instances.
no code implementations • NeurIPS 2009 • Paris Smaragdis, Madhusudana Shashanka, Bhiksha Raj
In this paper we present an algorithm for separating mixed sounds from a monophonic recording.
no code implementations • NeurIPS 2007 • Madhusudana Shashanka, Bhiksha Raj, Paris Smaragdis
An important problem in many fields is the analysis of counts data to extract meaningful latent components.