Search Results for author: Qin Jin

Found 30 papers, 9 papers with code

Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training

1 code implementation25 Aug 2021 Yuqing Song, ShiZhe Chen, Qin Jin, Wei Luo, Jun Xie, Fei Huang

Firstly, there are many specialized jargons in the product description, which are ambiguous to translate without the product image.

Machine Translation

ICECAP: Information Concentrated Entity-aware Image Captioning

1 code implementation4 Aug 2021 Anwen Hu, ShiZhe Chen, Qin Jin

In this work, we focus on the entity-aware news image captioning task which aims to generate informative captions by leveraging the associated news articles to provide background knowledge about the target image.

Image Captioning

Question-controlled Text-aware Image Captioning

no code implementations4 Aug 2021 Anwen Hu, ShiZhe Chen, Qin Jin

To explore how to generate personalized text-aware captions, we define a new challenging task, namely Question-controlled Text-aware Image Captioning (Qc-TextCap).

Image Captioning Question Answering +1

Missing Modality Imagination Network for Emotion Recognition with Uncertain Missing Modalities

1 code implementation ACL 2021 Jinming Zhao, Ruichen Li, Qin Jin

However, in real-world applications, we often encounter the problem of missing modality, and which modalities will be missing is uncertain.

Emotion Recognition

MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation

no code implementations ACL 2021 Jingwen Hu, Yuchen Liu, Jinming Zhao, Qin Jin

Emotion recognition in conversation (ERC) is a crucial component in affective dialogue systems, which helps the system understand users' emotions and generate empathetic responses.

Emotion Recognition in Conversation

Team RUC_AIM3 Technical Report at ActivityNet 2021: Entities Object Localization

no code implementations11 Jun 2021 Ludan Ruan, Jieting Chen, Yuqing Song, ShiZhe Chen, Qin Jin

For the object grounding, we fine-tune the state-of-the-art detection model MDETR and design a post processing method to make the grounding results more faithful.

Object Localization

Towards Diverse Paragraph Captioning for Untrimmed Videos

1 code implementation CVPR 2021 Yuqing Song, ShiZhe Chen, Qin Jin

Video paragraph captioning aims to describe multiple events in untrimmed videos with descriptive paragraphs.

Event Detection

Sequence-to-sequence Singing Voice Synthesis with Perceptual Entropy Loss

no code implementations22 Oct 2020 Jiatong Shi, Shuai Guo, Nan Huo, Yuekai Zhang, Qin Jin

The neural network (NN) based singing voice synthesis (SVS) systems require sufficient data to train well and are prone to over-fitting due to data scarcity.

Singing Voice Synthesis

YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in Domain-Specific Videos

1 code implementation12 Apr 2020 Shizhe Chen, Weiying Wang, Ludan Ruan, Linli Yao, Qin Jin

The goal of the YouMakeup VQA Challenge 2020 is to provide a common benchmark for fine-grained action understanding in domain-specific videos e. g. makeup instructional videos.

Action Understanding Question Answering +1

Better Captioning with Sequence-Level Exploration

no code implementations CVPR 2020 Jia Chen, Qin Jin

In this work, we show the limitation of the current sequence-level learning objective for captioning tasks from both theory and empirical result.

Image Captioning

Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

1 code implementation CVPR 2020 Shizhe Chen, Qin Jin, Peng Wang, Qi Wu

From the ASG, we propose a novel ASG2Caption model, which is able to recognise user intentions and semantics in the graph, and therefore generate desired captions according to the graph structure.

Image Captioning

Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning

2 code implementations CVPR 2020 Shizhe Chen, Yida Zhao, Qin Jin, Qi Wu

To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model, which decomposes video-text matching into global-to-local levels.

Cross-Modal Retrieval Text Matching +1

Neural Storyboard Artist: Visualizing Stories with Coherent Image Sequences

no code implementations24 Nov 2019 Shizhe Chen, Bei Liu, Jianlong Fu, Ruihua Song, Qin Jin, Pingping Lin, Xiaoyu Qi, Chunting Wang, Jin Zhou

A storyboard is a sequence of images to illustrate a story containing multiple sentences, which has been a key process to create different story products.

Integrating Temporal and Spatial Attentions for VATEX Video Captioning Challenge 2019

no code implementations15 Oct 2019 Shizhe Chen, Yida Zhao, Yuqing Song, Qin Jin, Qi Wu

This notebook paper presents our model in the VATEX video captioning challenge.

Video Captioning

Unpaired Cross-lingual Image Caption Generation with Self-Supervised Rewards

no code implementations15 Aug 2019 Yuqing Song, Shi-Zhe Chen, Yida Zhao, Qin Jin

We employ self-supervision from mono-lingual corpus in the target language to provide fluency reward, and propose a multi-level visual semantic matching model to provide both sentence-level and concept-level visual relevancy rewards.

Image Captioning Machine Translation

Activitynet 2019 Task 3: Exploring Contexts for Dense Captioning Events in Videos

no code implementations11 Jul 2019 Shizhe Chen, Yuqing Song, Yida Zhao, Qin Jin, Zhaoyang Zeng, Bei Liu, Jianlong Fu, Alexander Hauptmann

The overall system achieves the state-of-the-art performance on the dense-captioning events in video task with 9. 91 METEOR score on the challenge testing set.

Dense Video Captioning

From Words to Sentences: A Progressive Learning Approach for Zero-resource Machine Translation with Visual Pivots

no code implementations3 Jun 2019 Shizhe Chen, Qin Jin, Jianlong Fu

However, a picture tells a thousand words, which makes multi-lingual sentences pivoted by the same image noisy as mutual translations and thus hinders the translation model learning.

Machine Translation

Unsupervised Bilingual Lexicon Induction from Mono-lingual Multimodal Data

no code implementations2 Jun 2019 Shizhe Chen, Qin Jin, Alexander Hauptmann

The linguistic feature is learned from the sentence contexts with visual semantic constraints, which is beneficial to learn translation for words that are less visual-relevant.

Bilingual Lexicon Induction

RUC+CMU: System Report for Dense Captioning Events in Videos

no code implementations22 Jun 2018 Shizhe Chen, Yuqing Song, Yida Zhao, Jiarong Qiu, Qin Jin, Alexander Hauptmann

This notebook paper presents our system in the ActivityNet Dense Captioning in Video task (task 3).

Dense Video Captioning

Multi-modal Conditional Attention Fusion for Dimensional Emotion Prediction

no code implementations4 Sep 2017 Shizhe Chen, Qin Jin

Continuous dimensional emotion prediction is a challenging task where the fusion of various modalities usually achieves state-of-the-art performance such as early fusion or late fusion.

Video Captioning with Guidance of Multimodal Latent Topics

no code implementations31 Aug 2017 Shizhe Chen, Jia Chen, Qin Jin, Alexander Hauptmann

For the topic prediction task, we use the mined topics as the teacher to train a student topic prediction model, which learns to predict the latent topics from multimodal contents of videos.

Multi-Task Learning Video Captioning

Generating Video Descriptions with Topic Guidance

no code implementations31 Aug 2017 Shizhe Chen, Jia Chen, Qin Jin

In addition to predefined topics, i. e., category tags crawled from the web, we also mine topics in a data-driven way based on training captions by an unsupervised topic mining model.

Image Captioning Video Captioning

Improving Image Captioning by Concept-based Sentence Reranking

no code implementations3 May 2016 Xirong Li, Qin Jin

This paper describes our winning entry in the ImageCLEF 2015 image sentence generation task.

Image Captioning Language Modelling

Detecting Violence in Video using Subclasses

no code implementations27 Apr 2016 Xirong Li, Yujia Huo, Jieping Xu, Qin Jin

We enrich the MediaEval 2015 violence dataset by \emph{manually} labeling violence videos with respect to the subclasses.

Adaptive Tag Selection for Image Annotation

no code implementations17 Sep 2014 Xixi He, Xirong Li, Gang Yang, Jieping Xu, Qin Jin

The key insight is to divide the vocabulary into two disjoint subsets, namely a seen set consisting of tags having ground truth available for optimizing their thresholds and a novel set consisting of tags without any ground truth.

Cannot find the paper you are looking for? You can Submit a new open access paper.