1 code implementation • 14 Apr 2025 • Xiao Wang, Haiyang Wang, Shiao Wang, Qiang Chen, Jiandong Jin, Haoyu Song, Bo Jiang, Chenglong Li
In this paper, we revisit these issues and propose a novel multi-modal RGB-Event attribute recognition task by drawing inspiration from the advantages of event cameras in low-light, high-speed, and low-power consumption.
1 code implementation • 28 Dec 2024 • Lan Chen, Haoxiang Yang, Pengpeng Shao, Haoyu Song, Xiao Wang, Zhicheng Zhao, YaoWei Wang, Yonghong Tian
Inspired by the successful application of large models, the introduction of such large models can also be considered to further enhance the performance of multi-modal tasks.
no code implementations • 26 Oct 2024 • Haoyu Song, Wei-Nan Zhang, Kaiyan Zhang, Ting Liu
To this end, we propose a novel stack-propagation framework for learning a generation and understanding pipeline. Specifically, the framework stacks a Transformer encoder and two Transformer decoders, where the first decoder models response generation and the second serves as a regularizer and jointly models response generation and consistency understanding.
1 code implementation • 10 Oct 2024 • Haiyang Wang, Qian Zhu, Mowen She, Yabo Li, Haoyu Song, Minghe Xu, Xiao Wang
To address this issue, in this paper, we propose a Spiking Neural Network (SNN) based framework for energy-efficient attribute recognition.
1 code implementation • 16 Aug 2024 • Pengfei Cai, Yan Song, Kang Li, Haoyu Song, Ian McLoughlin
Sound event detection (SED) methods that leverage a large pre-trained Transformer encoder network have shown promising performance in recent DCASE challenges.
Ranked #1 on
Sound Event Detection
on DESED
(PSDS1 metric, using extra
training data)
1 code implementation • 13 Jun 2022 • Yaru Hao, Haoyu Song, Li Dong, Shaohan Huang, Zewen Chi, Wenhui Wang, Shuming Ma, Furu Wei
Experimental results across various language-only and vision-language benchmarks show that our model outperforms or is competitive with specialized models on finetuning, zero-shot generalization, and few-shot learning.
Ranked #2 on
Image Captioning
on nocaps val
1 code implementation • 20 May 2022 • Weizhi Wang, Li Dong, Hao Cheng, Haoyu Song, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, Furu Wei
With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling by attending to both text context and visual knowledge in images.
no code implementations • ACL 2022 • Haoyu Song, Li Dong, Wei-Nan Zhang, Ting Liu, Furu Wei
We first evaluate CLIP's zero-shot performance on a typical visual question answering task and demonstrate a zero-shot cross-modality transfer capability of CLIP on the visual entailment task.
1 code implementation • ACL 2021 • Haoyu Song, Yan Wang, Kaiyan Zhang, Wei-Nan Zhang, Ting Liu
Maintaining consistent personas is essential for dialogue agents.
1 code implementation • EMNLP 2020 • Haoyu Song, Yan Wang, Wei-Nan Zhang, Zhengyu Zhao, Ting Liu, Xiaojiang Liu
Maintaining a consistent attribute profile is crucial for dialogue agents to naturally converse with humans.
no code implementations • ACL 2020 • Haoyu Song, Yan Wang, Wei-Nan Zhang, Xiaojiang Liu, Ting Liu
Maintaining a consistent personality in conversations is quite natural for human beings, but is still a non-trivial task for machines.
1 code implementation • 14 Nov 2019 • Haoyu Song, Wei-Nan Zhang, Jingwen Hu, Ting Liu
Consistency is one of the major challenges faced by dialogue agents.
1 code implementation • 29 May 2019 • Haoyu Song, Wei-Nan Zhang, Yiming Cui, Dong Wang, Ting Liu
Giving conversational context with persona information to a chatbot, how to exploit the information to generate diverse and sustainable conversations is still a non-trivial task.