Search Results for author: Qin Jin

Found 65 papers, 32 papers with code

Language Resource Efficient Learning for Captioning

no code implementations Findings (EMNLP) 2021 Jia Chen, Yike Wu, Shiwan Zhao, Qin Jin

Our analysis of caption models with SC loss shows that the performance degradation is caused by the increasingly noisy estimation of reward and baseline with fewer language resources.

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

1 code implementation19 Mar 2024 Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou

In this work, we emphasize the importance of structure information in Visual Document Understanding and propose the Unified Structure Learning to boost the performance of MLLMs.

document understanding Optical Character Recognition (OCR)

POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-View World

1 code implementation9 Mar 2024 Boshen Xu, Sipeng Zheng, Qin Jin

We humans are good at translating third-person observations of hand-object interactions (HOI) into an egocentric view.

SPAFormer: Sequential 3D Part Assembly with Transformers

1 code implementation9 Mar 2024 Boshen Xu, Sipeng Zheng, Qin Jin

We introduce SPAFormer, an innovative model designed to overcome the combinatorial explosion challenge in the 3D Part Assembly (3D-PA) task.

Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective

1 code implementation22 Feb 2024 Zihao Yue, Liang Zhang, Qin Jin

In this paper, we explore a new angle of this issue: overly detailed training data hinders the model's ability to timely terminate generation, leading to continued outputs beyond visual perception limits.

Hallucination Sentence

Explore and Tell: Embodied Visual Captioning in 3D Environments

no code implementations ICCV 2023 Anwen Hu, ShiZhe Chen, Liang Zhang, Qin Jin

To overcome this limitation, we propose a novel task called Embodied Captioning, which equips visual captioning models with navigation capabilities, enabling them to actively explore the scene and reduce visual ambiguity from suboptimal viewpoints.

Image Captioning Navigate +1

Visual Captioning at Will: Describing Images and Videos Guided by a Few Stylized Sentences

no code implementations31 Jul 2023 Dingyi Yang, Hongyu Chen, Xinglin Hou, Tiezheng Ge, Yuning Jiang, Qin Jin

To address these limitations, we explore the problem of Few-Shot Stylized Visual Captioning, which aims to generate captions in any desired style, using only a few examples as guidance during inference, without requiring further training.

Image Captioning Language Modelling

No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection

no code implementations20 Jul 2023 Qi Zhang, Sipeng Zheng, Qin Jin

Temporal video grounding (TVG) aims to retrieve the time interval of a language query from an untrimmed video.

Boundary Detection Video Grounding

Movie101: A New Movie Understanding Benchmark

1 code implementation20 May 2023 Zihao Yue, Qi Zhang, Anwen Hu, Liang Zhang, Ziheng Wang, Qin Jin

Closer to real scenarios, the Movie Clip Narrating (MCN) task in our benchmark asks models to generate role-aware narration paragraphs for complete movie clips where no actors are speaking.

Video Captioning

Edit As You Wish: Video Description Editing with Multi-grained Commands

no code implementations15 May 2023 Linli Yao, Yuanmeng Zhang, Ziheng Wang, Xinglin Hou, Tiezheng Ge, Yuning Jiang, Qin Jin

In this paper, we propose a novel Video Description Editing (VDEdit) task to automatically revise an existing video description guided by flexible user requests.

Attribute Position +3

InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation

1 code implementation10 May 2023 Anwen Hu, ShiZhe Chen, Liang Zhang, Qin Jin

Existing metrics only provide a single score to measure caption qualities, which are less explainable and informative.

Benchmarking Image Captioning

Knowledge Enhanced Model for Live Video Comment Generation

1 code implementation28 Apr 2023 Jieting Chen, Junkai Ding, Wenping Chen, Qin Jin

Live video commenting is popular on video media platforms, as it can create a chatting atmosphere and provide supplementary information for users while watching videos.

Comment Generation

Rethinking Benchmarks for Cross-modal Image-text Retrieval

1 code implementation21 Apr 2023 Weijing Chen, Linli Yao, Qin Jin

The reason is that a large amount of images and texts in the benchmarks are coarse-grained.

Cross-Modal Retrieval Image-to-Text Retrieval +3

MPMQA: Multimodal Question Answering on Product Manuals

1 code implementation19 Apr 2023 Liang Zhang, Anwen Hu, Jing Zhang, Shuo Hu, Qin Jin

Taking into account the length of product manuals and the fact that a question is always related to a small number of pages, MPMQA can be naturally split into two subtasks: retrieving most related pages and then generating multimodal answers.

Question Answering Sentence

Accommodating Audio Modality in CLIP for Multimodal Processing

1 code implementation12 Mar 2023 Ludan Ruan, Anwen Hu, Yuqing Song, Liang Zhang, Sipeng Zheng, Qin Jin

In this paper, we extend the stateof-the-art Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing.

AudioCaps Contrastive Learning +4

TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real World

1 code implementation14 Jan 2023 Hongpeng Lin, Ludan Ruan, Wenke Xia, Peiyu Liu, Jingyuan Wen, Yixin Xu, Di Hu, Ruihua Song, Wayne Xin Zhao, Qin Jin, Zhiwu Lu

Experimental results indicate that the models incorporating large language models (LLM) can generate more diverse responses, while the model utilizing knowledge graphs to introduce external knowledge performs the best overall.

Knowledge Graphs

Open-Category Human-Object Interaction Pre-Training via Language Modeling Framework

no code implementations CVPR 2023 Sipeng Zheng, Boshen Xu, Qin Jin

Human-object interaction (HOI) has long been plagued by the conflict between limited supervised data and a vast number of possible interaction combinations in real life.

Human-Object Interaction Detection Language Modelling

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

1 code implementation CVPR 2023 Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, Baining Guo

To generate joint audio-video pairs, we propose a novel Multi-Modal Diffusion model (i. e., MM-Diffusion), with two-coupled denoising autoencoders.

Denoising FAD +1

CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge

1 code implementation17 Nov 2022 Linli Yao, Weijing Chen, Qin Jin

Automatically generating textual descriptions for massive unlabeled images on the web can greatly benefit realistic web applications, e. g. multimodal retrieval and recommendation.

Concept Alignment Retrieval

Exploring Anchor-based Detection for Ego4D Natural Language Query

no code implementations10 Aug 2022 Sipeng Zheng, Qi Zhang, Bei Liu, Qin Jin, Jianlong Fu

In this paper we provide the technique report of Ego4D natural language query challenge in CVPR 2022.

Video Understanding

Multi-Task Learning Framework for Emotion Recognition in-the-wild

1 code implementation19 Jul 2022 Tenggan Zhang, Chuanhe Liu, Xiaolong Liu, Yuchen Liu, Liyu Meng, Lei Sun, Wenqiang Jiang, Fengyuan Zhang, Jinming Zhao, Qin Jin

This paper presents our system for the Multi-Task Learning (MTL) Challenge in the 4th Affective Behavior Analysis in-the-wild (ABAW) competition.

Emotion Recognition Multi-Task Learning +1

Unifying Event Detection and Captioning as Sequence Generation via Pre-Training

1 code implementation18 Jul 2022 Qi Zhang, Yuqing Song, Qin Jin

Dense video captioning aims to generate corresponding text descriptions for a series of events in the untrimmed video, which can be divided into two sub-tasks, event detection and event captioning.

Dense Video Captioning Event Detection

TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval

1 code implementation16 Jul 2022 Yuqi Liu, Pengfei Xiong, Luhui Xu, Shengming Cao, Qin Jin

In this paper, we propose Token Shift and Selection Network (TS2-Net), a novel token shift and selection transformer architecture, which dynamically adjusts the token sequence and selects informative tokens in both temporal and spatial dimensions from input video samples.

Retrieval Video Retrieval

Generalizing Multimodal Pre-training into Multilingual via Language Acquisition

no code implementations29 May 2022 Liang Zhang, Anwen Hu, Qin Jin

Specifically, we design a lightweight language acquisition encoder based on state-of-the-art monolingual VLP models.

Language Acquisition Retrieval +2

M3ED: Multi-modal Multi-scene Multi-label Emotional Dialogue Database

1 code implementation ACL 2022 Jinming Zhao, Tenggan Zhang, Jingwen Hu, Yuchen Liu, Qin Jin, Xinchao Wang, Haizhou Li

In this work, we propose a Multi-modal Multi-scene Multi-label Emotional Dialogue dataset, M3ED, which contains 990 dyadic emotional dialogues from 56 different TV series, a total of 9, 082 turns and 24, 449 utterances.

Cultural Vocal Bursts Intensity Prediction Emotion Recognition

Progressive Learning for Image Retrieval with Hybrid-Modality Queries

no code implementations24 Apr 2022 Yida Zhao, Yuqing Song, Qin Jin

Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR), is a retrieval task where the search intention is expressed in a more complex query format, involving both vision and text modalities.

Image Retrieval Retrieval +1

SingAug: Data Augmentation for Singing Voice Synthesis with Cycle-consistent Training Strategy

no code implementations31 Mar 2022 Shuai Guo, Jiatong Shi, Tao Qian, Shinji Watanabe, Qin Jin

Deep learning based singing voice synthesis (SVS) systems have been demonstrated to flexibly generate singing with better qualities, compared to conventional statistical parametric based methods.

Data Augmentation Singing Voice Synthesis

Multi-modal Emotion Estimation for in-the-wild Videos

no code implementations24 Mar 2022 Liyu Meng, Yuchen Liu, Xiaolong Liu, Zhaopei Huang, Yuan Cheng, Meng Wang, Chuanhe Liu, Qin Jin

In this paper, we briefly introduce our submission to the Valence-Arousal Estimation Challenge of the 3rd Affective Behavior Analysis in-the-wild (ABAW) competition.

Arousal Estimation

Image Difference Captioning with Pre-training and Contrastive Learning

1 code implementation9 Feb 2022 Linli Yao, Weiying Wang, Qin Jin

The Image Difference Captioning (IDC) task aims to describe the visual differences between two similar images with natural language.

Contrastive Learning Fine-Grained Image Classification

VRDFormer: End-to-End Video Visual Relation Detection With Transformers

no code implementations CVPR 2022 Sipeng Zheng, ShiZhe Chen, Qin Jin

Most previous works adopt a multi-stage framework for video visual relation detection (VidVRD), which cannot capture long-term spatiotemporal contexts in different stages and also suffers from inefficiency.

Object Relation +3

MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal Emotion Recognition

no code implementations27 Oct 2021 Jinming Zhao, Ruichen Li, Qin Jin, Xinchao Wang, Haizhou Li

Multimodal emotion recognition study is hindered by the lack of labelled corpora in terms of scale and diversity, due to the high annotation cost and label ambiguity.

Emotion Classification Multimodal Emotion Recognition +1

Survey: Transformer based Video-Language Pre-training

no code implementations21 Sep 2021 Ludan Ruan, Qin Jin

Inspired by the success of transformer-based pre-training methods on natural language tasks and further computer vision tasks, researchers have begun to apply transformer to video processing.

Position

Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training

1 code implementation25 Aug 2021 Yuqing Song, ShiZhe Chen, Qin Jin, Wei Luo, Jun Xie, Fei Huang

Firstly, there are many specialized jargons in the product description, which are ambiguous to translate without the product image.

Machine Translation Translation

ICECAP: Information Concentrated Entity-aware Image Captioning

1 code implementation4 Aug 2021 Anwen Hu, ShiZhe Chen, Qin Jin

In this work, we focus on the entity-aware news image captioning task which aims to generate informative captions by leveraging the associated news articles to provide background knowledge about the target image.

Image Captioning Retrieval +1

Question-controlled Text-aware Image Captioning

1 code implementation4 Aug 2021 Anwen Hu, ShiZhe Chen, Qin Jin

To explore how to generate personalized text-aware captions, we define a new challenging task, namely Question-controlled Text-aware Image Captioning (Qc-TextCap).

Image Captioning Question Answering

Missing Modality Imagination Network for Emotion Recognition with Uncertain Missing Modalities

1 code implementation ACL 2021 Jinming Zhao, Ruichen Li, Qin Jin

However, in real-world applications, we often encounter the problem of missing modality, and which modalities will be missing is uncertain.

Emotion Recognition

MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation

1 code implementation ACL 2021 Jingwen Hu, Yuchen Liu, Jinming Zhao, Qin Jin

Emotion recognition in conversation (ERC) is a crucial component in affective dialogue systems, which helps the system understand users' emotions and generate empathetic responses.

Emotion Recognition in Conversation

Team RUC_AIM3 Technical Report at ActivityNet 2021: Entities Object Localization

1 code implementation11 Jun 2021 Ludan Ruan, Jieting Chen, Yuqing Song, ShiZhe Chen, Qin Jin

For the object grounding, we fine-tune the state-of-the-art detection model MDETR and design a post processing method to make the grounding results more faithful.

Caption Generation Object +1

Towards Diverse Paragraph Captioning for Untrimmed Videos

1 code implementation CVPR 2021 Yuqing Song, ShiZhe Chen, Qin Jin

Video paragraph captioning aims to describe multiple events in untrimmed videos with descriptive paragraphs.

Descriptive Event Detection

Sequence-to-sequence Singing Voice Synthesis with Perceptual Entropy Loss

1 code implementation22 Oct 2020 Jiatong Shi, Shuai Guo, Nan Huo, Yuekai Zhang, Qin Jin

The neural network (NN) based singing voice synthesis (SVS) systems require sufficient data to train well and are prone to over-fitting due to data scarcity.

Singing Voice Synthesis

YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in Domain-Specific Videos

1 code implementation12 Apr 2020 Shizhe Chen, Weiying Wang, Ludan Ruan, Linli Yao, Qin Jin

The goal of the YouMakeup VQA Challenge 2020 is to provide a common benchmark for fine-grained action understanding in domain-specific videos e. g. makeup instructional videos.

Action Understanding Question Answering +2

Better Captioning with Sequence-Level Exploration

no code implementations CVPR 2020 Jia Chen, Qin Jin

In this work, we show the limitation of the current sequence-level learning objective for captioning tasks from both theory and empirical result.

Image Captioning

Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

1 code implementation CVPR 2020 Shizhe Chen, Qin Jin, Peng Wang, Qi Wu

From the ASG, we propose a novel ASG2Caption model, which is able to recognise user intentions and semantics in the graph, and therefore generate desired captions according to the graph structure.

Attribute Caption Generation +1

Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning

4 code implementations CVPR 2020 Shizhe Chen, Yida Zhao, Qin Jin, Qi Wu

To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model, which decomposes video-text matching into global-to-local levels.

Cross-Modal Retrieval Retrieval +3

Neural Storyboard Artist: Visualizing Stories with Coherent Image Sequences

no code implementations24 Nov 2019 Shizhe Chen, Bei Liu, Jianlong Fu, Ruihua Song, Qin Jin, Pingping Lin, Xiaoyu Qi, Chunting Wang, Jin Zhou

A storyboard is a sequence of images to illustrate a story containing multiple sentences, which has been a key process to create different story products.

Integrating Temporal and Spatial Attentions for VATEX Video Captioning Challenge 2019

no code implementations15 Oct 2019 Shizhe Chen, Yida Zhao, Yuqing Song, Qin Jin, Qi Wu

This notebook paper presents our model in the VATEX video captioning challenge.

Video Captioning

Unpaired Cross-lingual Image Caption Generation with Self-Supervised Rewards

no code implementations15 Aug 2019 Yuqing Song, Shi-Zhe Chen, Yida Zhao, Qin Jin

We employ self-supervision from mono-lingual corpus in the target language to provide fluency reward, and propose a multi-level visual semantic matching model to provide both sentence-level and concept-level visual relevancy rewards.

Caption Generation Image Captioning +3

Activitynet 2019 Task 3: Exploring Contexts for Dense Captioning Events in Videos

no code implementations11 Jul 2019 Shizhe Chen, Yuqing Song, Yida Zhao, Qin Jin, Zhaoyang Zeng, Bei Liu, Jianlong Fu, Alexander Hauptmann

The overall system achieves the state-of-the-art performance on the dense-captioning events in video task with 9. 91 METEOR score on the challenge testing set.

Dense Captioning Dense Video Captioning

From Words to Sentences: A Progressive Learning Approach for Zero-resource Machine Translation with Visual Pivots

no code implementations3 Jun 2019 Shizhe Chen, Qin Jin, Jianlong Fu

However, a picture tells a thousand words, which makes multi-lingual sentences pivoted by the same image noisy as mutual translations and thus hinders the translation model learning.

Machine Translation Sentence +2

Unsupervised Bilingual Lexicon Induction from Mono-lingual Multimodal Data

no code implementations2 Jun 2019 Shizhe Chen, Qin Jin, Alexander Hauptmann

The linguistic feature is learned from the sentence contexts with visual semantic constraints, which is beneficial to learn translation for words that are less visual-relevant.

Bilingual Lexicon Induction Sentence +2

RUC+CMU: System Report for Dense Captioning Events in Videos

no code implementations22 Jun 2018 Shizhe Chen, Yuqing Song, Yida Zhao, Jiarong Qiu, Qin Jin, Alexander Hauptmann

This notebook paper presents our system in the ActivityNet Dense Captioning in Video task (task 3).

Caption Generation Dense Captioning +1

Multi-modal Conditional Attention Fusion for Dimensional Emotion Prediction

no code implementations4 Sep 2017 Shizhe Chen, Qin Jin

Continuous dimensional emotion prediction is a challenging task where the fusion of various modalities usually achieves state-of-the-art performance such as early fusion or late fusion.

Video Captioning with Guidance of Multimodal Latent Topics

no code implementations31 Aug 2017 Shizhe Chen, Jia Chen, Qin Jin, Alexander Hauptmann

For the topic prediction task, we use the mined topics as the teacher to train a student topic prediction model, which learns to predict the latent topics from multimodal contents of videos.

Caption Generation Multi-Task Learning +1

Generating Video Descriptions with Topic Guidance

no code implementations31 Aug 2017 Shizhe Chen, Jia Chen, Qin Jin

In addition to predefined topics, i. e., category tags crawled from the web, we also mine topics in a data-driven way based on training captions by an unsupervised topic mining model.

Image Captioning Video Captioning

Improving Image Captioning by Concept-based Sentence Reranking

no code implementations3 May 2016 Xirong Li, Qin Jin

This paper describes our winning entry in the ImageCLEF 2015 image sentence generation task.

Image Captioning Language Modelling +1

Detecting Violence in Video using Subclasses

no code implementations27 Apr 2016 Xirong Li, Yujia Huo, Jieping Xu, Qin Jin

We enrich the MediaEval 2015 violence dataset by \emph{manually} labeling violence videos with respect to the subclasses.

Adaptive Tag Selection for Image Annotation

no code implementations17 Sep 2014 Xixi He, Xirong Li, Gang Yang, Jieping Xu, Qin Jin

The key insight is to divide the vocabulary into two disjoint subsets, namely a seen set consisting of tags having ground truth available for optimizing their thresholds and a novel set consisting of tags without any ground truth.

TAG

Cannot find the paper you are looking for? You can Submit a new open access paper.