no code implementations • 17 Sep 2014 • Xixi He, Xirong Li, Gang Yang, Jieping Xu, Qin Jin
The key insight is to divide the vocabulary into two disjoint subsets, namely a seen set consisting of tags having ground truth available for optimizing their thresholds and a novel set consisting of tags without any ground truth.
no code implementations • 27 Apr 2016 • Xirong Li, Yujia Huo, Jieping Xu, Qin Jin
We enrich the MediaEval 2015 violence dataset by \emph{manually} labeling violence videos with respect to the subclasses.
no code implementations • 3 May 2016 • Xirong Li, Qin Jin
This paper describes our winning entry in the ImageCLEF 2015 image sentence generation task.
no code implementations • 31 Aug 2017 • Shizhe Chen, Jia Chen, Qin Jin
In addition to predefined topics, i. e., category tags crawled from the web, we also mine topics in a data-driven way based on training captions by an unsupervised topic mining model.
no code implementations • 31 Aug 2017 • Shizhe Chen, Jia Chen, Qin Jin, Alexander Hauptmann
For the topic prediction task, we use the mined topics as the teacher to train a student topic prediction model, which learns to predict the latent topics from multimodal contents of videos.
no code implementations • 4 Sep 2017 • Shizhe Chen, Qin Jin
Continuous dimensional emotion prediction is a challenging task where the fusion of various modalities usually achieves state-of-the-art performance such as early fusion or late fusion.
no code implementations • 22 Jun 2018 • Shizhe Chen, Yuqing Song, Yida Zhao, Jiarong Qiu, Qin Jin, Alexander Hauptmann
This notebook paper presents our system in the ActivityNet Dense Captioning in Video task (task 3).
no code implementations • 2 Jun 2019 • Shizhe Chen, Qin Jin, Alexander Hauptmann
The linguistic feature is learned from the sentence contexts with visual semantic constraints, which is beneficial to learn translation for words that are less visual-relevant.
no code implementations • 3 Jun 2019 • Shizhe Chen, Qin Jin, Jianlong Fu
However, a picture tells a thousand words, which makes multi-lingual sentences pivoted by the same image noisy as mutual translations and thus hinders the translation model learning.
no code implementations • 11 Jul 2019 • Shizhe Chen, Yuqing Song, Yida Zhao, Qin Jin, Zhaoyang Zeng, Bei Liu, Jianlong Fu, Alexander Hauptmann
The overall system achieves the state-of-the-art performance on the dense-captioning events in video task with 9. 91 METEOR score on the challenge testing set.
no code implementations • 15 Aug 2019 • Yuqing Song, Shi-Zhe Chen, Yida Zhao, Qin Jin
We employ self-supervision from mono-lingual corpus in the target language to provide fluency reward, and propose a multi-level visual semantic matching model to provide both sentence-level and concept-level visual relevancy rewards.
no code implementations • 15 Oct 2019 • Shizhe Chen, Yida Zhao, Yuqing Song, Qin Jin, Qi Wu
This notebook paper presents our model in the VATEX video captioning challenge.
no code implementations • IJCNLP 2019 • Weiying Wang, Yongcheng Wang, Shi-Zhe Chen, Qin Jin
Multimodal semantic comprehension has attracted increasing research interests recently such as visual question answering and caption generation.
no code implementations • 24 Nov 2019 • Shizhe Chen, Bei Liu, Jianlong Fu, Ruihua Song, Qin Jin, Pingping Lin, Xiaoyu Qi, Chunting Wang, Jin Zhou
A storyboard is a sequence of images to illustrate a story containing multiple sentences, which has been a key process to create different story products.
4 code implementations • CVPR 2020 • Shizhe Chen, Yida Zhao, Qin Jin, Qi Wu
To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model, which decomposes video-text matching into global-to-local levels.
1 code implementation • CVPR 2020 • Shizhe Chen, Qin Jin, Peng Wang, Qi Wu
From the ASG, we propose a novel ASG2Caption model, which is able to recognise user intentions and semantics in the graph, and therefore generate desired captions according to the graph structure.
no code implementations • CVPR 2020 • Jia Chen, Qin Jin
In this work, we show the limitation of the current sequence-level learning objective for captioning tasks from both theory and empirical result.
1 code implementation • 12 Apr 2020 • Shizhe Chen, Weiying Wang, Ludan Ruan, Linli Yao, Qin Jin
The goal of the YouMakeup VQA Challenge 2020 is to provide a common benchmark for fine-grained action understanding in domain-specific videos e. g. makeup instructional videos.
no code implementations • 14 Jun 2020 • Yuqing Song, Shi-Zhe Chen, Yida Zhao, Qin Jin
Detecting meaningful events in an untrimmed video is essential for dense video captioning.
Ranked #3 on Dense Video Captioning on ActivityNet Captions
1 code implementation • 3 Aug 2020 • Samuel Albanie, Yang Liu, Arsha Nagrani, Antoine Miech, Ernesto Coto, Ivan Laptev, Rahul Sukthankar, Bernard Ghanem, Andrew Zisserman, Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid, Shi-Zhe Chen, Yida Zhao, Qin Jin, Kaixu Cui, Hui Liu, Chen Wang, Yudong Jiang, Xiaoshuai Hao
This report summarizes the results of the first edition of the challenge together with the findings of the participants.
1 code implementation • 22 Oct 2020 • Jiatong Shi, Shuai Guo, Nan Huo, Yuekai Zhang, Qin Jin
The neural network (NN) based singing voice synthesis (SVS) systems require sufficient data to train well and are prone to over-fitting due to data scarcity.
2 code implementations • 11 Mar 2021 • Yuqi Huo, Manli Zhang, Guangzhen Liu, Haoyu Lu, Yizhao Gao, Guoxing Yang, Jingyuan Wen, Heng Zhang, Baogui Xu, Weihao Zheng, Zongzheng Xi, Yueqian Yang, Anwen Hu, Jinming Zhao, Ruichen Li, Yida Zhao, Liang Zhang, Yuqing Song, Xin Hong, Wanqing Cui, Danyang Hou, Yingyan Li, Junyi Li, Peiyu Liu, Zheng Gong, Chuhao Jin, Yuchong Sun, ShiZhe Chen, Zhiwu Lu, Zhicheng Dou, Qin Jin, Yanyan Lan, Wayne Xin Zhao, Ruihua Song, Ji-Rong Wen
We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model.
Ranked #1 on Image Retrieval on RUC-CAS-WenLan
1 code implementation • CVPR 2021 • Yuqing Song, ShiZhe Chen, Qin Jin
Video paragraph captioning aims to describe multiple events in untrimmed videos with descriptive paragraphs.
1 code implementation • 11 Jun 2021 • Ludan Ruan, Jieting Chen, Yuqing Song, ShiZhe Chen, Qin Jin
For the object grounding, we fine-tune the state-of-the-art detection model MDETR and design a post processing method to make the grounding results more faithful.
no code implementations • 14 Jun 2021 • Xu Han, Zhengyan Zhang, Ning Ding, Yuxian Gu, Xiao Liu, Yuqi Huo, Jiezhong Qiu, Yuan YAO, Ao Zhang, Liang Zhang, Wentao Han, Minlie Huang, Qin Jin, Yanyan Lan, Yang Liu, Zhiyuan Liu, Zhiwu Lu, Xipeng Qiu, Ruihua Song, Jie Tang, Ji-Rong Wen, Jinhui Yuan, Wayne Xin Zhao, Jun Zhu
Large-scale pre-trained models (PTMs) such as BERT and GPT have recently achieved great success and become a milestone in the field of artificial intelligence (AI).
1 code implementation • ACL 2021 • Jingwen Hu, Yuchen Liu, Jinming Zhao, Qin Jin
Emotion recognition in conversation (ERC) is a crucial component in affective dialogue systems, which helps the system understand users' emotions and generate empathetic responses.
1 code implementation • ACL 2021 • Jinming Zhao, Ruichen Li, Qin Jin
However, in real-world applications, we often encounter the problem of missing modality, and which modalities will be missing is uncertain.
1 code implementation • 4 Aug 2021 • Anwen Hu, ShiZhe Chen, Qin Jin
To explore how to generate personalized text-aware captions, we define a new challenging task, namely Question-controlled Text-aware Image Captioning (Qc-TextCap).
1 code implementation • 4 Aug 2021 • Anwen Hu, ShiZhe Chen, Qin Jin
In this work, we focus on the entity-aware news image captioning task which aims to generate informative captions by leveraging the associated news articles to provide background knowledge about the target image.
1 code implementation • 25 Aug 2021 • Yuqing Song, ShiZhe Chen, Qin Jin, Wei Luo, Jun Xie, Fei Huang
Firstly, there are many specialized jargons in the product description, which are ambiguous to translate without the product image.
no code implementations • 21 Sep 2021 • Ludan Ruan, Qin Jin
Inspired by the success of transformer-based pre-training methods on natural language tasks and further computer vision tasks, researchers have begun to apply transformer to video processing.
no code implementations • 27 Oct 2021 • Jinming Zhao, Ruichen Li, Qin Jin, Xinchao Wang, Haizhou Li
Multimodal emotion recognition study is hindered by the lack of labelled corpora in terms of scale and diversity, due to the high annotation cost and label ambiguity.
no code implementations • CVPR 2022 • Sipeng Zheng, ShiZhe Chen, Qin Jin
Most previous works adopt a multi-stage framework for video visual relation detection (VidVRD), which cannot capture long-term spatiotemporal contexts in different stages and also suffers from inefficiency.
1 code implementation • 9 Feb 2022 • Linli Yao, Weiying Wang, Qin Jin
The Image Difference Captioning (IDC) task aims to describe the visual differences between two similar images with natural language.
no code implementations • 24 Mar 2022 • Liyu Meng, Yuchen Liu, Xiaolong Liu, Zhaopei Huang, Yuan Cheng, Meng Wang, Chuanhe Liu, Qin Jin
In this paper, we briefly introduce our submission to the Valence-Arousal Estimation Challenge of the 3rd Affective Behavior Analysis in-the-wild (ABAW) competition.
no code implementations • 26 Mar 2022 • Sha Yuan, Hanyu Zhao, Shuai Zhao, Jiahong Leng, Yangxiao Liang, Xiaozhi Wang, Jifan Yu, Xin Lv, Zhou Shao, Jiaao He, Yankai Lin, Xu Han, Zhenghao Liu, Ning Ding, Yongming Rao, Yizhao Gao, Liang Zhang, Ming Ding, Cong Fang, Yisen Wang, Mingsheng Long, Jing Zhang, Yinpeng Dong, Tianyu Pang, Peng Cui, Lingxiao Huang, Zheng Liang, HuaWei Shen, HUI ZHANG, Quanshi Zhang, Qingxiu Dong, Zhixing Tan, Mingxuan Wang, Shuo Wang, Long Zhou, Haoran Li, Junwei Bao, Yingwei Pan, Weinan Zhang, Zhou Yu, Rui Yan, Chence Shi, Minghao Xu, Zuobai Zhang, Guoqiang Wang, Xiang Pan, Mengjie Li, Xiaoyu Chu, Zijun Yao, Fangwei Zhu, Shulin Cao, Weicheng Xue, Zixuan Ma, Zhengyan Zhang, Shengding Hu, Yujia Qin, Chaojun Xiao, Zheni Zeng, Ganqu Cui, Weize Chen, Weilin Zhao, Yuan YAO, Peng Li, Wenzhao Zheng, Wenliang Zhao, Ziyi Wang, Borui Zhang, Nanyi Fei, Anwen Hu, Zenan Ling, Haoyang Li, Boxi Cao, Xianpei Han, Weidong Zhan, Baobao Chang, Hao Sun, Jiawen Deng, Chujie Zheng, Juanzi Li, Lei Hou, Xigang Cao, Jidong Zhai, Zhiyuan Liu, Maosong Sun, Jiwen Lu, Zhiwu Lu, Qin Jin, Ruihua Song, Ji-Rong Wen, Zhouchen Lin, LiWei Wang, Hang Su, Jun Zhu, Zhifang Sui, Jiajun Zhang, Yang Liu, Xiaodong He, Minlie Huang, Jian Tang, Jie Tang
With the rapid development of deep learning, training Big Models (BMs) for multiple downstream tasks becomes a popular paradigm.
no code implementations • 31 Mar 2022 • Shuai Guo, Jiatong Shi, Tao Qian, Shinji Watanabe, Qin Jin
Deep learning based singing voice synthesis (SVS) systems have been demonstrated to flexibly generate singing with better qualities, compared to conventional statistical parametric based methods.
no code implementations • 24 Apr 2022 • Yida Zhao, Yuqing Song, Qin Jin
Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR), is a retrieval task where the search intention is expressed in a more complex query format, involving both vision and text modalities.
1 code implementation • ACL 2022 • Jinming Zhao, Tenggan Zhang, Jingwen Hu, Yuchen Liu, Qin Jin, Xinchao Wang, Haizhou Li
In this work, we propose a Multi-modal Multi-scene Multi-label Emotional Dialogue dataset, M3ED, which contains 990 dyadic emotional dialogues from 56 different TV series, a total of 9, 082 turns and 24, 449 utterances.
Cultural Vocal Bursts Intensity Prediction Emotion Recognition
no code implementations • 29 May 2022 • Liang Zhang, Anwen Hu, Qin Jin
Specifically, we design a lightweight language acquisition encoder based on state-of-the-art monolingual VLP models.
1 code implementation • 16 Jul 2022 • Yuqi Liu, Pengfei Xiong, Luhui Xu, Shengming Cao, Qin Jin
In this paper, we propose Token Shift and Selection Network (TS2-Net), a novel token shift and selection transformer architecture, which dynamically adjusts the token sequence and selects informative tokens in both temporal and spatial dimensions from input video samples.
Ranked #8 on Video Retrieval on VATEX
1 code implementation • 18 Jul 2022 • Qi Zhang, Yuqing Song, Qin Jin
Dense video captioning aims to generate corresponding text descriptions for a series of events in the untrimmed video, which can be divided into two sub-tasks, event detection and event captioning.
1 code implementation • 19 Jul 2022 • Tenggan Zhang, Chuanhe Liu, Xiaolong Liu, Yuchen Liu, Liyu Meng, Lei Sun, Wenqiang Jiang, Fengyuan Zhang, Jinming Zhao, Qin Jin
This paper presents our system for the Multi-Task Learning (MTL) Challenge in the 4th Affective Behavior Analysis in-the-wild (ABAW) competition.
no code implementations • 10 Aug 2022 • Sipeng Zheng, Qi Zhang, Bei Liu, Qin Jin, Jianlong Fu
In this paper we provide the technique report of Ego4D natural language query challenge in CVPR 2022.
1 code implementation • 17 Nov 2022 • Linli Yao, Weijing Chen, Qin Jin
Automatically generating textual descriptions for massive unlabeled images on the web can greatly benefit realistic web applications, e. g. multimodal retrieval and recommendation.
1 code implementation • CVPR 2023 • Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, Baining Guo
To generate joint audio-video pairs, we propose a novel Multi-Modal Diffusion model (i. e., MM-Diffusion), with two-coupled denoising autoencoders.
no code implementations • CVPR 2023 • Sipeng Zheng, Boshen Xu, Qin Jin
Human-object interaction (HOI) has long been plagued by the conflict between limited supervised data and a vast number of possible interaction combinations in real life.
1 code implementation • 14 Jan 2023 • Hongpeng Lin, Ludan Ruan, Wenke Xia, Peiyu Liu, Jingyuan Wen, Yixin Xu, Di Hu, Ruihua Song, Wayne Xin Zhao, Qin Jin, Zhiwu Lu
Experimental results indicate that the models incorporating large language models (LLM) can generate more diverse responses, while the model utilizing knowledge graphs to introduce external knowledge performs the best overall.
1 code implementation • 12 Mar 2023 • Ludan Ruan, Anwen Hu, Yuqing Song, Liang Zhang, Sipeng Zheng, Qin Jin
In this paper, we extend the stateof-the-art Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing.
1 code implementation • 19 Apr 2023 • Liang Zhang, Anwen Hu, Jing Zhang, Shuo Hu, Qin Jin
Taking into account the length of product manuals and the fact that a question is always related to a small number of pages, MPMQA can be naturally split into two subtasks: retrieving most related pages and then generating multimodal answers.
1 code implementation • 21 Apr 2023 • Weijing Chen, Linli Yao, Qin Jin
The reason is that a large amount of images and texts in the benchmarks are coarse-grained.
1 code implementation • 28 Apr 2023 • Jieting Chen, Junkai Ding, Wenping Chen, Qin Jin
Live video commenting is popular on video media platforms, as it can create a chatting atmosphere and provide supplementary information for users while watching videos.
1 code implementation • 10 May 2023 • Anwen Hu, ShiZhe Chen, Liang Zhang, Qin Jin
Existing metrics only provide a single score to measure caption qualities, which are less explainable and informative.
no code implementations • 15 May 2023 • Linli Yao, Yuanmeng Zhang, Ziheng Wang, Xinglin Hou, Tiezheng Ge, Yuning Jiang, Qin Jin
In this paper, we propose a novel Video Description Editing (VDEdit) task to automatically revise an existing video description guided by flexible user requests.
1 code implementation • 20 May 2023 • Zihao Yue, Qi Zhang, Anwen Hu, Liang Zhang, Ziheng Wang, Qin Jin
Closer to real scenarios, the Movie Clip Narrating (MCN) task in our benchmark asks models to generate role-aware narration paragraphs for complete movie clips where no actors are speaking.
no code implementations • 20 Jul 2023 • Qi Zhang, Sipeng Zheng, Qin Jin
Temporal video grounding (TVG) aims to retrieve the time interval of a language query from an untrimmed video.
no code implementations • 31 Jul 2023 • Dingyi Yang, Hongyu Chen, Xinglin Hou, Tiezheng Ge, Yuning Jiang, Qin Jin
To address these limitations, we explore the problem of Few-Shot Stylized Visual Captioning, which aims to generate captions in any desired style, using only a few examples as guidance during inference, without requiring further training.
no code implementations • ICCV 2023 • Anwen Hu, ShiZhe Chen, Liang Zhang, Qin Jin
To overcome this limitation, we propose a novel task called Embodied Captioning, which equips visual captioning models with navigation capabilities, enabling them to actively explore the scene and reduce visual ambiguity from suboptimal viewpoints.
2 code implementations • 8 Oct 2023 • Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, Qin Jin, Liang He, Xin Alex Lin, Fei Huang
Text is ubiquitous in our visual world, conveying crucial information, such as in documents, websites, and everyday photographs.
1 code implementation • 22 Feb 2024 • Zihao Yue, Liang Zhang, Qin Jin
In this paper, we explore a new angle of this issue: overly detailed training data hinders the model's ability to timely terminate generation, leading to continued outputs beyond visual perception limits.
1 code implementation • 9 Mar 2024 • Boshen Xu, Sipeng Zheng, Qin Jin
We humans are good at translating third-person observations of hand-object interactions (HOI) into an egocentric view.
1 code implementation • 9 Mar 2024 • Boshen Xu, Sipeng Zheng, Qin Jin
We introduce SPAFormer, an innovative model designed to overcome the combinatorial explosion challenge in the 3D Part Assembly (3D-PA) task.
1 code implementation • 19 Mar 2024 • Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou
In this work, we emphasize the importance of structure information in Visual Document Understanding and propose the Unified Structure Learning to boost the performance of MLLMs.
1 code implementation • 20 Apr 2024 • Zihao Yue, Yepeng Zhang, Ziheng Wang, Qin Jin
Automatic movie narration targets at creating video-aligned plot descriptions to assist visually impaired audiences.
no code implementations • 23 Apr 2024 • Qingrong He, Kejun Lin, ShiZhe Chen, Anwen Hu, Qin Jin
The Think phase first decomposes the compositional question into a sequence of steps, and then the Program phase grounds each step to a piece of code and calls carefully designed 3D visual perception modules.
1 code implementation • 25 Apr 2024 • Liang Zhang, Anwen Hu, Haiyang Xu, Ming Yan, Yichen Xu, Qin Jin, Ji Zhang, Fei Huang
Charts are important for presenting and explaining complex data relationships.
no code implementations • COLING 2022 • Yuchen Liu, Jinming Zhao, Jingwen Hu, Ruichen Li, Qin Jin
Emotion Recognition in Conversation (ERC) has attracted increasing attention in the affective computing research field.
no code implementations • Findings (EMNLP) 2021 • Jia Chen, Yike Wu, Shiwan Zhao, Qin Jin
Our analysis of caption models with SC loss shows that the performance degradation is caused by the increasingly noisy estimation of reward and baseline with fewer language resources.