Search Results for author: Qin Jin

Found 49 papers, 20 papers with code

Language Resource Efficient Learning for Captioning

no code implementations Findings (EMNLP) 2021 Jia Chen, Yike Wu, Shiwan Zhao, Qin Jin

Our analysis of caption models with SC loss shows that the performance degradation is caused by the increasingly noisy estimation of reward and baseline with fewer language resources.

TikTalk: A Multi-Modal Dialogue Dataset for Real-World Chitchat

1 code implementation14 Jan 2023 Hongpeng Lin, Ludan Ruan, Wenke Xia, Peiyu Liu, Jingyuan Wen, Yixin Xu, Di Hu, Ruihua Song, Wayne Xin Zhao, Qin Jin, Zhiwu Lu

Compared with previous image-based dialogue datasets, the richer sources of context in TikTalk lead to a greater diversity of conversations.

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

1 code implementation19 Dec 2022 Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, Baining Guo

To generate joint audio-video pairs, we propose a novel Multi-Modal Diffusion model (i. e., MM-Diffusion), with two-coupled denoising autoencoders.

Denoising FAD +1

CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge

no code implementations17 Nov 2022 Linli Yao, Weijing Chen, Qin Jin

Automatically generating textual descriptions for massive unlabeled images on the web can greatly benefit realistic web applications, e. g. multimodal retrieval and recommendation.

Retrieval

Exploring Anchor-based Detection for Ego4D Natural Language Query

no code implementations10 Aug 2022 Sipeng Zheng, Qi Zhang, Bei Liu, Qin Jin, Jianlong Fu

In this paper we provide the technique report of Ego4D natural language query challenge in CVPR 2022.

Video Understanding

Multi-Task Learning Framework for Emotion Recognition in-the-wild

1 code implementation19 Jul 2022 Tenggan Zhang, Chuanhe Liu, Xiaolong Liu, Yuchen Liu, Liyu Meng, Lei Sun, Wenqiang Jiang, Fengyuan Zhang, Jinming Zhao, Qin Jin

This paper presents our system for the Multi-Task Learning (MTL) Challenge in the 4th Affective Behavior Analysis in-the-wild (ABAW) competition.

Emotion Recognition Multi-Task Learning +1

Unifying Event Detection and Captioning as Sequence Generation via Pre-Training

1 code implementation18 Jul 2022 Qi Zhang, Yuqing Song, Qin Jin

Dense video captioning aims to generate corresponding text descriptions for a series of events in the untrimmed video, which can be divided into two sub-tasks, event detection and event captioning.

Association Dense Video Captioning +1

TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval

1 code implementation16 Jul 2022 Yuqi Liu, Pengfei Xiong, Luhui Xu, Shengming Cao, Qin Jin

In this paper, we propose Token Shift and Selection Network (TS2-Net), a novel token shift and selection transformer architecture, which dynamically adjusts the token sequence and selects informative tokens in both temporal and spatial dimensions from input video samples.

Retrieval Video Retrieval

Generalizing Multimodal Pre-training into Multilingual via Language Acquisition

no code implementations29 May 2022 Liang Zhang, Anwen Hu, Qin Jin

Specifically, we design a lightweight language acquisition encoder based on state-of-the-art monolingual VLP models.

Language Acquisition Retrieval +2

M3ED: Multi-modal Multi-scene Multi-label Emotional Dialogue Database

1 code implementation ACL 2022 Jinming Zhao, Tenggan Zhang, Jingwen Hu, Yuchen Liu, Qin Jin, Xinchao Wang, Haizhou Li

In this work, we propose a Multi-modal Multi-scene Multi-label Emotional Dialogue dataset, M3ED, which contains 990 dyadic emotional dialogues from 56 different TV series, a total of 9, 082 turns and 24, 449 utterances.

Emotion Recognition

Progressive Learning for Image Retrieval with Hybrid-Modality Queries

no code implementations24 Apr 2022 Yida Zhao, Yuqing Song, Qin Jin

Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR), is a retrieval task where the search intention is expressed in a more complex query format, involving both vision and text modalities.

Image Retrieval Retrieval +1

SingAug: Data Augmentation for Singing Voice Synthesis with Cycle-consistent Training Strategy

no code implementations31 Mar 2022 Shuai Guo, Jiatong Shi, Tao Qian, Shinji Watanabe, Qin Jin

Deep learning based singing voice synthesis (SVS) systems have been demonstrated to flexibly generate singing with better qualities, compared to conventional statistical parametric based methods.

Data Augmentation

Multi-modal Emotion Estimation for in-the-wild Videos

no code implementations24 Mar 2022 Liyu Meng, Yuchen Liu, Xiaolong Liu, Zhaopei Huang, Yuan Cheng, Meng Wang, Chuanhe Liu, Qin Jin

In this paper, we briefly introduce our submission to the Valence-Arousal Estimation Challenge of the 3rd Affective Behavior Analysis in-the-wild (ABAW) competition.

Arousal Estimation

Image Difference Captioning with Pre-training and Contrastive Learning

1 code implementation9 Feb 2022 Linli Yao, Weiying Wang, Qin Jin

The Image Difference Captioning (IDC) task aims to describe the visual differences between two similar images with natural language.

Association Contrastive Learning +1

VRDFormer: End-to-End Video Visual Relation Detection With Transformers

no code implementations CVPR 2022 Sipeng Zheng, ShiZhe Chen, Qin Jin

Most previous works adopt a multi-stage framework for video visual relation detection (VidVRD), which cannot capture long-term spatiotemporal contexts in different stages and also suffers from inefficiency.

Relation Classification Video Understanding +1

MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal Emotion Recognition

no code implementations27 Oct 2021 Jinming Zhao, Ruichen Li, Qin Jin, Xinchao Wang, Haizhou Li

Multimodal emotion recognition study is hindered by the lack of labelled corpora in terms of scale and diversity, due to the high annotation cost and label ambiguity.

Emotion Classification Multimodal Emotion Recognition +1

Survey: Transformer based Video-Language Pre-training

no code implementations21 Sep 2021 Ludan Ruan, Qin Jin

Inspired by the success of transformer-based pre-training methods on natural language tasks and further computer vision tasks, researchers have begun to apply transformer to video processing.

Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training

1 code implementation25 Aug 2021 Yuqing Song, ShiZhe Chen, Qin Jin, Wei Luo, Jun Xie, Fei Huang

Firstly, there are many specialized jargons in the product description, which are ambiguous to translate without the product image.

Machine Translation Translation

ICECAP: Information Concentrated Entity-aware Image Captioning

1 code implementation4 Aug 2021 Anwen Hu, ShiZhe Chen, Qin Jin

In this work, we focus on the entity-aware news image captioning task which aims to generate informative captions by leveraging the associated news articles to provide background knowledge about the target image.

Image Captioning Retrieval

Question-controlled Text-aware Image Captioning

1 code implementation4 Aug 2021 Anwen Hu, ShiZhe Chen, Qin Jin

To explore how to generate personalized text-aware captions, we define a new challenging task, namely Question-controlled Text-aware Image Captioning (Qc-TextCap).

Image Captioning Question Answering

Missing Modality Imagination Network for Emotion Recognition with Uncertain Missing Modalities

1 code implementation ACL 2021 Jinming Zhao, Ruichen Li, Qin Jin

However, in real-world applications, we often encounter the problem of missing modality, and which modalities will be missing is uncertain.

Emotion Recognition

MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation

1 code implementation ACL 2021 Jingwen Hu, Yuchen Liu, Jinming Zhao, Qin Jin

Emotion recognition in conversation (ERC) is a crucial component in affective dialogue systems, which helps the system understand users' emotions and generate empathetic responses.

Emotion Recognition in Conversation

Team RUC_AIM3 Technical Report at ActivityNet 2021: Entities Object Localization

1 code implementation11 Jun 2021 Ludan Ruan, Jieting Chen, Yuqing Song, ShiZhe Chen, Qin Jin

For the object grounding, we fine-tune the state-of-the-art detection model MDETR and design a post processing method to make the grounding results more faithful.

Object Localization

Towards Diverse Paragraph Captioning for Untrimmed Videos

1 code implementation CVPR 2021 Yuqing Song, ShiZhe Chen, Qin Jin

Video paragraph captioning aims to describe multiple events in untrimmed videos with descriptive paragraphs.

Event Detection

Sequence-to-sequence Singing Voice Synthesis with Perceptual Entropy Loss

1 code implementation22 Oct 2020 Jiatong Shi, Shuai Guo, Nan Huo, Yuekai Zhang, Qin Jin

The neural network (NN) based singing voice synthesis (SVS) systems require sufficient data to train well and are prone to over-fitting due to data scarcity.

YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in Domain-Specific Videos

1 code implementation12 Apr 2020 Shizhe Chen, Weiying Wang, Ludan Ruan, Linli Yao, Qin Jin

The goal of the YouMakeup VQA Challenge 2020 is to provide a common benchmark for fine-grained action understanding in domain-specific videos e. g. makeup instructional videos.

Action Understanding Question Answering +1

Better Captioning with Sequence-Level Exploration

no code implementations CVPR 2020 Jia Chen, Qin Jin

In this work, we show the limitation of the current sequence-level learning objective for captioning tasks from both theory and empirical result.

Image Captioning

Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning

4 code implementations CVPR 2020 Shizhe Chen, Yida Zhao, Qin Jin, Qi Wu

To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model, which decomposes video-text matching into global-to-local levels.

Cross-Modal Retrieval Retrieval +3

Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

1 code implementation CVPR 2020 Shizhe Chen, Qin Jin, Peng Wang, Qi Wu

From the ASG, we propose a novel ASG2Caption model, which is able to recognise user intentions and semantics in the graph, and therefore generate desired captions according to the graph structure.

Image Captioning

Neural Storyboard Artist: Visualizing Stories with Coherent Image Sequences

no code implementations24 Nov 2019 Shizhe Chen, Bei Liu, Jianlong Fu, Ruihua Song, Qin Jin, Pingping Lin, Xiaoyu Qi, Chunting Wang, Jin Zhou

A storyboard is a sequence of images to illustrate a story containing multiple sentences, which has been a key process to create different story products.

Integrating Temporal and Spatial Attentions for VATEX Video Captioning Challenge 2019

no code implementations15 Oct 2019 Shizhe Chen, Yida Zhao, Yuqing Song, Qin Jin, Qi Wu

This notebook paper presents our model in the VATEX video captioning challenge.

Video Captioning

Unpaired Cross-lingual Image Caption Generation with Self-Supervised Rewards

no code implementations15 Aug 2019 Yuqing Song, Shi-Zhe Chen, Yida Zhao, Qin Jin

We employ self-supervision from mono-lingual corpus in the target language to provide fluency reward, and propose a multi-level visual semantic matching model to provide both sentence-level and concept-level visual relevancy rewards.

Image Captioning Machine Translation +1

Activitynet 2019 Task 3: Exploring Contexts for Dense Captioning Events in Videos

no code implementations11 Jul 2019 Shizhe Chen, Yuqing Song, Yida Zhao, Qin Jin, Zhaoyang Zeng, Bei Liu, Jianlong Fu, Alexander Hauptmann

The overall system achieves the state-of-the-art performance on the dense-captioning events in video task with 9. 91 METEOR score on the challenge testing set.

Dense Captioning Dense Video Captioning

From Words to Sentences: A Progressive Learning Approach for Zero-resource Machine Translation with Visual Pivots

no code implementations3 Jun 2019 Shizhe Chen, Qin Jin, Jianlong Fu

However, a picture tells a thousand words, which makes multi-lingual sentences pivoted by the same image noisy as mutual translations and thus hinders the translation model learning.

Machine Translation Translation +1

Unsupervised Bilingual Lexicon Induction from Mono-lingual Multimodal Data

no code implementations2 Jun 2019 Shizhe Chen, Qin Jin, Alexander Hauptmann

The linguistic feature is learned from the sentence contexts with visual semantic constraints, which is beneficial to learn translation for words that are less visual-relevant.

Bilingual Lexicon Induction Translation +1

RUC+CMU: System Report for Dense Captioning Events in Videos

no code implementations22 Jun 2018 Shizhe Chen, Yuqing Song, Yida Zhao, Jiarong Qiu, Qin Jin, Alexander Hauptmann

This notebook paper presents our system in the ActivityNet Dense Captioning in Video task (task 3).

Dense Captioning Dense Video Captioning

Multi-modal Conditional Attention Fusion for Dimensional Emotion Prediction

no code implementations4 Sep 2017 Shizhe Chen, Qin Jin

Continuous dimensional emotion prediction is a challenging task where the fusion of various modalities usually achieves state-of-the-art performance such as early fusion or late fusion.

Video Captioning with Guidance of Multimodal Latent Topics

no code implementations31 Aug 2017 Shizhe Chen, Jia Chen, Qin Jin, Alexander Hauptmann

For the topic prediction task, we use the mined topics as the teacher to train a student topic prediction model, which learns to predict the latent topics from multimodal contents of videos.

Multi-Task Learning Video Captioning

Generating Video Descriptions with Topic Guidance

no code implementations31 Aug 2017 Shizhe Chen, Jia Chen, Qin Jin

In addition to predefined topics, i. e., category tags crawled from the web, we also mine topics in a data-driven way based on training captions by an unsupervised topic mining model.

Image Captioning Video Captioning

Improving Image Captioning by Concept-based Sentence Reranking

no code implementations3 May 2016 Xirong Li, Qin Jin

This paper describes our winning entry in the ImageCLEF 2015 image sentence generation task.

Image Captioning Language Modelling

Detecting Violence in Video using Subclasses

no code implementations27 Apr 2016 Xirong Li, Yujia Huo, Jieping Xu, Qin Jin

We enrich the MediaEval 2015 violence dataset by \emph{manually} labeling violence videos with respect to the subclasses.

Adaptive Tag Selection for Image Annotation

no code implementations17 Sep 2014 Xixi He, Xirong Li, Gang Yang, Jieping Xu, Qin Jin

The key insight is to divide the vocabulary into two disjoint subsets, namely a seen set consisting of tags having ground truth available for optimizing their thresholds and a novel set consisting of tags without any ground truth.

TAG

Cannot find the paper you are looking for? You can Submit a new open access paper.