Search Results for author: Qin Jin

Found 68 papers, 34 papers with code

Adaptive Tag Selection for Image Annotation

no code implementations • 17 Sep 2014 • Xixi He, Xirong Li, Gang Yang, Jieping Xu, Qin Jin

The key insight is to divide the vocabulary into two disjoint subsets, namely a seen set consisting of tags having ground truth available for optimizing their thresholds and a novel set consisting of tags without any ground truth.

TAG

Paper
Add Code

Detecting Violence in Video using Subclasses

no code implementations • 27 Apr 2016 • Xirong Li, Yujia Huo, Jieping Xu, Qin Jin

We enrich the MediaEval 2015 violence dataset by \emph{manually} labeling violence videos with respect to the subclasses.

Paper
Add Code

Improving Image Captioning by Concept-based Sentence Reranking

no code implementations • 3 May 2016 • Xirong Li, Qin Jin

This paper describes our winning entry in the ImageCLEF 2015 image sentence generation task.

Image Captioning Language Modelling +1

Paper
Add Code

Generating Video Descriptions with Topic Guidance

no code implementations • 31 Aug 2017 • Shizhe Chen, Jia Chen, Qin Jin

In addition to predefined topics, i. e., category tags crawled from the web, we also mine topics in a data-driven way based on training captions by an unsupervised topic mining model.

Decoder Image Captioning +1

Paper
Add Code

Video Captioning with Guidance of Multimodal Latent Topics

no code implementations • 31 Aug 2017 • Shizhe Chen, Jia Chen, Qin Jin, Alexander Hauptmann

For the topic prediction task, we use the mined topics as the teacher to train a student topic prediction model, which learns to predict the latent topics from multimodal contents of videos.

Caption Generation Decoder +2

Paper
Add Code

Multi-modal Conditional Attention Fusion for Dimensional Emotion Prediction

no code implementations • 4 Sep 2017 • Shizhe Chen, Qin Jin

Continuous dimensional emotion prediction is a challenging task where the fusion of various modalities usually achieves state-of-the-art performance such as early fusion or late fusion.

Paper
Add Code

RUC+CMU: System Report for Dense Captioning Events in Videos

no code implementations • 22 Jun 2018 • Shizhe Chen, Yuqing Song, Yida Zhao, Jiarong Qiu, Qin Jin, Alexander Hauptmann

This notebook paper presents our system in the ActivityNet Dense Captioning in Video task (task 3).

Caption Generation Dense Captioning +1

Paper
Add Code

Unsupervised Bilingual Lexicon Induction from Mono-lingual Multimodal Data

no code implementations • 2 Jun 2019 • Shizhe Chen, Qin Jin, Alexander Hauptmann

The linguistic feature is learned from the sentence contexts with visual semantic constraints, which is beneficial to learn translation for words that are less visual-relevant.

Bilingual Lexicon Induction Sentence +2

Paper
Add Code

From Words to Sentences: A Progressive Learning Approach for Zero-resource Machine Translation with Visual Pivots

no code implementations • 3 Jun 2019 • Shizhe Chen, Qin Jin, Jianlong Fu

However, a picture tells a thousand words, which makes multi-lingual sentences pivoted by the same image noisy as mutual translations and thus hinders the translation model learning.

Machine Translation Sentence +2

Paper
Add Code

Activitynet 2019 Task 3: Exploring Contexts for Dense Captioning Events in Videos

no code implementations • 11 Jul 2019 • Shizhe Chen, Yuqing Song, Yida Zhao, Qin Jin, Zhaoyang Zeng, Bei Liu, Jianlong Fu, Alexander Hauptmann

The overall system achieves the state-of-the-art performance on the dense-captioning events in video task with 9. 91 METEOR score on the challenge testing set.

Dense Captioning Dense Video Captioning

Paper
Add Code

Unpaired Cross-lingual Image Caption Generation with Self-Supervised Rewards

no code implementations • 15 Aug 2019 • Yuqing Song, Shi-Zhe Chen, Yida Zhao, Qin Jin

We employ self-supervision from mono-lingual corpus in the target language to provide fluency reward, and propose a multi-level visual semantic matching model to provide both sentence-level and concept-level visual relevancy rewards.

Caption Generation Image Captioning +3

Paper
Add Code

Integrating Temporal and Spatial Attentions for VATEX Video Captioning Challenge 2019

no code implementations • 15 Oct 2019 • Shizhe Chen, Yida Zhao, Yuqing Song, Qin Jin, Qi Wu

This notebook paper presents our model in the VATEX video captioning challenge.

Video Captioning

Paper
Add Code

YouMakeup: A Large-Scale Domain-Specific Multimodal Dataset for Fine-Grained Semantic Comprehension

no code implementations • IJCNLP 2019 • Weiying Wang, Yongcheng Wang, Shi-Zhe Chen, Qin Jin

Multimodal semantic comprehension has attracted increasing research interests recently such as visual question answering and caption generation.

Caption Generation Question Answering +1

Paper
Add Code

Neural Storyboard Artist: Visualizing Stories with Coherent Image Sequences

no code implementations • 24 Nov 2019 • Shizhe Chen, Bei Liu, Jianlong Fu, Ruihua Song, Qin Jin, Pingping Lin, Xiaoyu Qi, Chunting Wang, Jin Zhou

A storyboard is a sequence of images to illustrate a story containing multiple sentences, which has been a key process to create different story products.

Paper
Add Code

Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning

4 code implementations • CVPR 2020 • Shizhe Chen, Yida Zhao, Qin Jin, Qi Wu

To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model, which decomposes video-text matching into global-to-local levels.

Cross-Modal Retrieval Retrieval +3

220

Paper
Code

Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

1 code implementation • CVPR 2020 • Shizhe Chen, Qin Jin, Peng Wang, Qi Wu

From the ASG, we propose a novel ASG2Caption model, which is able to recognise user intentions and semantics in the graph, and therefore generate desired captions according to the graph structure.

Attribute Caption Generation +1

198

Paper
Code

Better Captioning with Sequence-Level Exploration

no code implementations • CVPR 2020 • Jia Chen, Qin Jin

In this work, we show the limitation of the current sequence-level learning objective for captioning tasks from both theory and empirical result.

Image Captioning

Paper
Add Code

YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in Domain-Specific Videos

1 code implementation • 12 Apr 2020 • Shizhe Chen, Weiying Wang, Ludan Ruan, Linli Yao, Qin Jin

The goal of the YouMakeup VQA Challenge 2020 is to provide a common benchmark for fine-grained action understanding in domain-specific videos e. g. makeup instructional videos.

Action Understanding Question Answering +2

Paper
Code

Team RUC_AIM3 Technical Report at Activitynet 2020 Task 2: Exploring Sequential Events Detection for Dense Video Captioning

no code implementations • 14 Jun 2020 • Yuqing Song, Shi-Zhe Chen, Yida Zhao, Qin Jin

Detecting meaningful events in an untrimmed video is essential for dense video captioning.

Ranked #3 on Dense Video Captioning on ActivityNet Captions

Dense Captioning Dense Video Captioning +1

Paper
Add Code

The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020)

1 code implementation • 3 Aug 2020 • Samuel Albanie, Yang Liu, Arsha Nagrani, Antoine Miech, Ernesto Coto, Ivan Laptev, Rahul Sukthankar, Bernard Ghanem, Andrew Zisserman, Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid, Shi-Zhe Chen, Yida Zhao, Qin Jin, Kaixu Cui, Hui Liu, Chen Wang, Yudong Jiang, Xiaoshuai Hao

This report summarizes the results of the first edition of the challenge together with the findings of the participants.

Natural Language Queries Retrieval +3

328

Paper
Code

Sequence-to-sequence Singing Voice Synthesis with Perceptual Entropy Loss

1 code implementation • 22 Oct 2020 • Jiatong Shi, Shuai Guo, Nan Huo, Yuekai Zhang, Qin Jin

The neural network (NN) based singing voice synthesis (SVS) systems require sufficient data to train well and are prone to over-fitting due to data scarcity.

Singing Voice Synthesis

Paper
Code

WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

2 code implementations • 11 Mar 2021 • Yuqi Huo, Manli Zhang, Guangzhen Liu, Haoyu Lu, Yizhao Gao, Guoxing Yang, Jingyuan Wen, Heng Zhang, Baogui Xu, Weihao Zheng, Zongzheng Xi, Yueqian Yang, Anwen Hu, Jinming Zhao, Ruichen Li, Yida Zhao, Liang Zhang, Yuqing Song, Xin Hong, Wanqing Cui, Danyang Hou, Yingyan Li, Junyi Li, Peiyu Liu, Zheng Gong, Chuhao Jin, Yuchong Sun, ShiZhe Chen, Zhiwu Lu, Zhicheng Dou, Qin Jin, Yanyan Lan, Wayne Xin Zhao, Ruihua Song, Ji-Rong Wen

We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model.

Ranked #1 on Image Retrieval on RUC-CAS-WenLan

Contrastive Learning Image Captioning +2

274

Paper
Code

Towards Diverse Paragraph Captioning for Untrimmed Videos

1 code implementation • CVPR 2021 • Yuqing Song, ShiZhe Chen, Qin Jin

Video paragraph captioning aims to describe multiple events in untrimmed videos with descriptive paragraphs.

Descriptive Event Detection

Paper
Code

Team RUC_AIM3 Technical Report at ActivityNet 2021: Entities Object Localization

1 code implementation • 11 Jun 2021 • Ludan Ruan, Jieting Chen, Yuqing Song, ShiZhe Chen, Qin Jin

For the object grounding, we fine-tune the state-of-the-art detection model MDETR and design a post processing method to make the grounding results more faithful.

Caption Generation Object +1

155

Paper
Code

Pre-Trained Models: Past, Present and Future

no code implementations • 14 Jun 2021 • Xu Han, Zhengyan Zhang, Ning Ding, Yuxian Gu, Xiao Liu, Yuqi Huo, Jiezhong Qiu, Yuan YAO, Ao Zhang, Liang Zhang, Wentao Han, Minlie Huang, Qin Jin, Yanyan Lan, Yang Liu, Zhiyuan Liu, Zhiwu Lu, Xipeng Qiu, Ruihua Song, Jie Tang, Ji-Rong Wen, Jinhui Yuan, Wayne Xin Zhao, Jun Zhu

Large-scale pre-trained models (PTMs) such as BERT and GPT have recently achieved great success and become a milestone in the field of artificial intelligence (AI).

Computational Efficiency Self-Supervised Learning +1

Paper
Add Code

MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation

1 code implementation • ACL 2021 • Jingwen Hu, Yuchen Liu, Jinming Zhao, Qin Jin

Emotion recognition in conversation (ERC) is a crucial component in affective dialogue systems, which helps the system understand users' emotions and generate empathetic responses.

Emotion Recognition in Conversation

Paper
Code

Missing Modality Imagination Network for Emotion Recognition with Uncertain Missing Modalities

1 code implementation • ACL 2021 • Jinming Zhao, Ruichen Li, Qin Jin

However, in real-world applications, we often encounter the problem of missing modality, and which modalities will be missing is uncertain.

Emotion Recognition

Paper
Code

Question-controlled Text-aware Image Captioning

1 code implementation • 4 Aug 2021 • Anwen Hu, ShiZhe Chen, Qin Jin

To explore how to generate personalized text-aware captions, we define a new challenging task, namely Question-controlled Text-aware Image Captioning (Qc-TextCap).

Decoder Image Captioning +1

Paper
Code

ICECAP: Information Concentrated Entity-aware Image Captioning

1 code implementation • 4 Aug 2021 • Anwen Hu, ShiZhe Chen, Qin Jin

In this work, we focus on the entity-aware news image captioning task which aims to generate informative captions by leveraging the associated news articles to provide background knowledge about the target image.

Image Captioning Retrieval +1

Paper
Code

Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training

1 code implementation • 25 Aug 2021 • Yuqing Song, ShiZhe Chen, Qin Jin, Wei Luo, Jun Xie, Fei Huang

Firstly, there are many specialized jargons in the product description, which are ambiguous to translate without the product image.

Machine Translation Translation

Paper
Code

Survey: Transformer based Video-Language Pre-training

no code implementations • 21 Sep 2021 • Ludan Ruan, Qin Jin

Inspired by the success of transformer-based pre-training methods on natural language tasks and further computer vision tasks, researchers have begun to apply transformer to video processing.

Position

Paper
Add Code

MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal Emotion Recognition

no code implementations • 27 Oct 2021 • Jinming Zhao, Ruichen Li, Qin Jin, Xinchao Wang, Haizhou Li

Multimodal emotion recognition study is hindered by the lack of labelled corpora in terms of scale and diversity, due to the high annotation cost and label ambiguity.

Emotion Classification Multimodal Emotion Recognition +1

Paper
Add Code

VRDFormer: End-to-End Video Visual Relation Detection With Transformers

no code implementations • CVPR 2022 • Sipeng Zheng, ShiZhe Chen, Qin Jin

Most previous works adopt a multi-stage framework for video visual relation detection (VidVRD), which cannot capture long-term spatiotemporal contexts in different stages and also suffers from inefficiency.

Object Relation +3

Paper
Add Code

Image Difference Captioning with Pre-training and Contrastive Learning

1 code implementation • 9 Feb 2022 • Linli Yao, Weiying Wang, Qin Jin

The Image Difference Captioning (IDC) task aims to describe the visual differences between two similar images with natural language.

Contrastive Learning Fine-Grained Image Classification

Paper
Code

Multi-modal Emotion Estimation for in-the-wild Videos

no code implementations • 24 Mar 2022 • Liyu Meng, Yuchen Liu, Xiaolong Liu, Zhaopei Huang, Yuan Cheng, Meng Wang, Chuanhe Liu, Qin Jin

In this paper, we briefly introduce our submission to the Valence-Arousal Estimation Challenge of the 3rd Affective Behavior Analysis in-the-wild (ABAW) competition.

Arousal Estimation

Paper
Add Code

A Roadmap for Big Model

no code implementations • 26 Mar 2022 • Sha Yuan, Hanyu Zhao, Shuai Zhao, Jiahong Leng, Yangxiao Liang, Xiaozhi Wang, Jifan Yu, Xin Lv, Zhou Shao, Jiaao He, Yankai Lin, Xu Han, Zhenghao Liu, Ning Ding, Yongming Rao, Yizhao Gao, Liang Zhang, Ming Ding, Cong Fang, Yisen Wang, Mingsheng Long, Jing Zhang, Yinpeng Dong, Tianyu Pang, Peng Cui, Lingxiao Huang, Zheng Liang, HuaWei Shen, HUI ZHANG, Quanshi Zhang, Qingxiu Dong, Zhixing Tan, Mingxuan Wang, Shuo Wang, Long Zhou, Haoran Li, Junwei Bao, Yingwei Pan, Weinan Zhang, Zhou Yu, Rui Yan, Chence Shi, Minghao Xu, Zuobai Zhang, Guoqiang Wang, Xiang Pan, Mengjie Li, Xiaoyu Chu, Zijun Yao, Fangwei Zhu, Shulin Cao, Weicheng Xue, Zixuan Ma, Zhengyan Zhang, Shengding Hu, Yujia Qin, Chaojun Xiao, Zheni Zeng, Ganqu Cui, Weize Chen, Weilin Zhao, Yuan YAO, Peng Li, Wenzhao Zheng, Wenliang Zhao, Ziyi Wang, Borui Zhang, Nanyi Fei, Anwen Hu, Zenan Ling, Haoyang Li, Boxi Cao, Xianpei Han, Weidong Zhan, Baobao Chang, Hao Sun, Jiawen Deng, Chujie Zheng, Juanzi Li, Lei Hou, Xigang Cao, Jidong Zhai, Zhiyuan Liu, Maosong Sun, Jiwen Lu, Zhiwu Lu, Qin Jin, Ruihua Song, Ji-Rong Wen, Zhouchen Lin, LiWei Wang, Hang Su, Jun Zhu, Zhifang Sui, Jiajun Zhang, Yang Liu, Xiaodong He, Minlie Huang, Jian Tang, Jie Tang

With the rapid development of deep learning, training Big Models (BMs) for multiple downstream tasks becomes a popular paradigm.

Language Modelling Machine Translation +1

Paper
Add Code

SingAug: Data Augmentation for Singing Voice Synthesis with Cycle-consistent Training Strategy

no code implementations • 31 Mar 2022 • Shuai Guo, Jiatong Shi, Tao Qian, Shinji Watanabe, Qin Jin

Deep learning based singing voice synthesis (SVS) systems have been demonstrated to flexibly generate singing with better qualities, compared to conventional statistical parametric based methods.

Data Augmentation Singing Voice Synthesis

Paper
Add Code

Progressive Learning for Image Retrieval with Hybrid-Modality Queries

no code implementations • 24 Apr 2022 • Yida Zhao, Yuqing Song, Qin Jin

Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR), is a retrieval task where the search intention is expressed in a more complex query format, involving both vision and text modalities.

Image Retrieval Retrieval +1

Paper
Add Code

M3ED: Multi-modal Multi-scene Multi-label Emotional Dialogue Database

1 code implementation • ACL 2022 • Jinming Zhao, Tenggan Zhang, Jingwen Hu, Yuchen Liu, Qin Jin, Xinchao Wang, Haizhou Li

In this work, we propose a Multi-modal Multi-scene Multi-label Emotional Dialogue dataset, M3ED, which contains 990 dyadic emotional dialogues from 56 different TV series, a total of 9, 082 turns and 24, 449 utterances.

Cultural Vocal Bursts Intensity Prediction Emotion Recognition

Paper
Code

Generalizing Multimodal Pre-training into Multilingual via Language Acquisition

no code implementations • 29 May 2022 • Liang Zhang, Anwen Hu, Qin Jin

Specifically, we design a lightweight language acquisition encoder based on state-of-the-art monolingual VLP models.

Language Acquisition Retrieval +2

Paper
Add Code

TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval

1 code implementation • 16 Jul 2022 • Yuqi Liu, Pengfei Xiong, Luhui Xu, Shengming Cao, Qin Jin

In this paper, we propose Token Shift and Selection Network (TS2-Net), a novel token shift and selection transformer architecture, which dynamically adjusts the token sequence and selects informative tokens in both temporal and spatial dimensions from input video samples.

Ranked #8 on Video Retrieval on VATEX

Retrieval Video Retrieval

Paper
Code

Unifying Event Detection and Captioning as Sequence Generation via Pre-Training

1 code implementation • 18 Jul 2022 • Qi Zhang, Yuqing Song, Qin Jin

Dense video captioning aims to generate corresponding text descriptions for a series of events in the untrimmed video, which can be divided into two sub-tasks, event detection and event captioning.

Dense Video Captioning Event Detection

Paper
Code

Multi-Task Learning Framework for Emotion Recognition in-the-wild

1 code implementation • 19 Jul 2022 • Tenggan Zhang, Chuanhe Liu, Xiaolong Liu, Yuchen Liu, Liyu Meng, Lei Sun, Wenqiang Jiang, Fengyuan Zhang, Jinming Zhao, Qin Jin

This paper presents our system for the Multi-Task Learning (MTL) Challenge in the 4th Affective Behavior Analysis in-the-wild (ABAW) competition.

Emotion Recognition Multi-Task Learning +1

Paper
Code

Exploring Anchor-based Detection for Ego4D Natural Language Query

no code implementations • 10 Aug 2022 • Sipeng Zheng, Qi Zhang, Bei Liu, Qin Jin, Jianlong Fu

In this paper we provide the technique report of Ego4D natural language query challenge in CVPR 2022.

Video Understanding

Paper
Add Code

CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge

1 code implementation • 17 Nov 2022 • Linli Yao, Weijing Chen, Qin Jin

Automatically generating textual descriptions for massive unlabeled images on the web can greatly benefit realistic web applications, e. g. multimodal retrieval and recommendation.

Concept Alignment Retrieval

Paper
Code

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

1 code implementation • CVPR 2023 • Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, Baining Guo

To generate joint audio-video pairs, we propose a novel Multi-Modal Diffusion model (i. e., MM-Diffusion), with two-coupled denoising autoencoders.

Denoising FAD +1

339

Paper
Code

Open-Category Human-Object Interaction Pre-Training via Language Modeling Framework

no code implementations • CVPR 2023 • Sipeng Zheng, Boshen Xu, Qin Jin

Human-object interaction (HOI) has long been plagued by the conflict between limited supervised data and a vast number of possible interaction combinations in real life.

Human-Object Interaction Detection Language Modelling

Paper
Add Code

TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real World

1 code implementation • 14 Jan 2023 • Hongpeng Lin, Ludan Ruan, Wenke Xia, Peiyu Liu, Jingyuan Wen, Yixin Xu, Di Hu, Ruihua Song, Wayne Xin Zhao, Qin Jin, Zhiwu Lu

Experimental results indicate that the models incorporating large language models (LLM) can generate more diverse responses, while the model utilizing knowledge graphs to introduce external knowledge performs the best overall.

Knowledge Graphs

Paper
Code

Accommodating Audio Modality in CLIP for Multimodal Processing

1 code implementation • 12 Mar 2023 • Ludan Ruan, Anwen Hu, Yuqing Song, Liang Zhang, Sipeng Zheng, Qin Jin

In this paper, we extend the stateof-the-art Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing.

AudioCaps Contrastive Learning +4

Paper
Code

MPMQA: Multimodal Question Answering on Product Manuals

1 code implementation • 19 Apr 2023 • Liang Zhang, Anwen Hu, Jing Zhang, Shuo Hu, Qin Jin

Taking into account the length of product manuals and the fact that a question is always related to a small number of pages, MPMQA can be naturally split into two subtasks: retrieving most related pages and then generating multimodal answers.

Question Answering Sentence

Paper
Code

Rethinking Benchmarks for Cross-modal Image-text Retrieval

1 code implementation • 21 Apr 2023 • Weijing Chen, Linli Yao, Qin Jin

The reason is that a large amount of images and texts in the benchmarks are coarse-grained.

Cross-Modal Retrieval Image-to-Text Retrieval +3

Paper
Code

Knowledge Enhanced Model for Live Video Comment Generation

1 code implementation • 28 Apr 2023 • Jieting Chen, Junkai Ding, Wenping Chen, Qin Jin

Live video commenting is popular on video media platforms, as it can create a chatting atmosphere and provide supplementary information for users while watching videos.

Comment Generation Decoder

Paper
Code

InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation

1 code implementation • 10 May 2023 • Anwen Hu, ShiZhe Chen, Liang Zhang, Qin Jin

Existing metrics only provide a single score to measure caption qualities, which are less explainable and informative.

Benchmarking Image Captioning

Paper
Code

Edit As You Wish: Video Description Editing with Multi-grained Commands

no code implementations • 15 May 2023 • Linli Yao, Yuanmeng Zhang, Ziheng Wang, Xinglin Hou, Tiezheng Ge, Yuning Jiang, Qin Jin

In this paper, we propose a novel Video Description Editing (VDEdit) task to automatically revise an existing video description guided by flexible user requests.

Attribute Position +3

Paper
Add Code

Movie101: A New Movie Understanding Benchmark

1 code implementation • 20 May 2023 • Zihao Yue, Qi Zhang, Anwen Hu, Liang Zhang, Ziheng Wang, Qin Jin

Closer to real scenarios, the Movie Clip Narrating (MCN) task in our benchmark asks models to generate role-aware narration paragraphs for complete movie clips where no actors are speaking.

Video Captioning

Paper
Code

No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection

no code implementations • 20 Jul 2023 • Qi Zhang, Sipeng Zheng, Qin Jin

Temporal video grounding (TVG) aims to retrieve the time interval of a language query from an untrimmed video.

Boundary Detection Video Grounding

Paper
Add Code

Visual Captioning at Will: Describing Images and Videos Guided by a Few Stylized Sentences

no code implementations • 31 Jul 2023 • Dingyi Yang, Hongyu Chen, Xinglin Hou, Tiezheng Ge, Yuning Jiang, Qin Jin

To address these limitations, we explore the problem of Few-Shot Stylized Visual Captioning, which aims to generate captions in any desired style, using only a few examples as guidance during inference, without requiring further training.

Decoder Image Captioning +1

Paper
Add Code

Explore and Tell: Embodied Visual Captioning in 3D Environments

no code implementations • ICCV 2023 • Anwen Hu, ShiZhe Chen, Liang Zhang, Qin Jin

To overcome this limitation, we propose a novel task called Embodied Captioning, which equips visual captioning models with navigation capabilities, enabling them to actively explore the scene and reduce visual ambiguity from suboptimal viewpoints.

Image Captioning Navigate +1

Paper
Add Code

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

2 code implementations • 8 Oct 2023 • Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, Qin Jin, Liang He, Xin Alex Lin, Fei Huang

Text is ubiquitous in our visual world, conveying crucial information, such as in documents, websites, and everyday photographs.

Decoder Language Modelling +2

947

Paper
Code

Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective

1 code implementation • 22 Feb 2024 • Zihao Yue, Liang Zhang, Qin Jin

In this paper, we explore a new angle of this issue: overly detailed training data hinders the model's ability to timely terminate generation, leading to continued outputs beyond visual perception limits.

Hallucination Sentence

Paper
Code

POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-View World

1 code implementation • 9 Mar 2024 • Boshen Xu, Sipeng Zheng, Qin Jin

We humans are good at translating third-person observations of hand-object interactions (HOI) into an egocentric view.

Paper
Code

SPAFormer: Sequential 3D Part Assembly with Transformers

1 code implementation • 9 Mar 2024 • Boshen Xu, Sipeng Zheng, Qin Jin

We introduce SPAFormer, an innovative model designed to overcome the combinatorial explosion challenge in the 3D Part Assembly (3D-PA) task.

Paper
Code

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

1 code implementation • 19 Mar 2024 • Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou

In this work, we emphasize the importance of structure information in Visual Document Understanding and propose the Unified Structure Learning to boost the performance of MLLMs.

document understanding Optical Character Recognition (OCR)

947

Paper
Code

Movie101v2: Improved Movie Narration Benchmark

1 code implementation • 20 Apr 2024 • Zihao Yue, Yepeng Zhang, Ziheng Wang, Qin Jin

Automatic movie narration targets at creating video-aligned plot descriptions to assist visually impaired audiences.

Video Captioning

Paper
Code

Think-Program-reCtify: 3D Situated Reasoning with Large Language Models

no code implementations • 23 Apr 2024 • Qingrong He, Kejun Lin, ShiZhe Chen, Anwen Hu, Qin Jin

The Think phase first decomposes the compositional question into a sequence of steps, and then the Program phase grounds each step to a piece of code and calls carefully designed 3D visual perception modules.

Visual Reasoning

Paper
Add Code

TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

1 code implementation • 25 Apr 2024 • Liang Zhang, Anwen Hu, Haiyang Xu, Ming Yan, Yichen Xu, Qin Jin, Ji Zhang, Fei Huang

Charts are important for presenting and explaining complex data relationships.

947

Paper
Code

DialogueEIN: Emotion Interaction Network for Dialogue Affective Analysis

no code implementations • COLING 2022 • Yuchen Liu, Jinming Zhao, Jingwen Hu, Ruichen Li, Qin Jin

Emotion Recognition in Conversation (ERC) has attracted increasing attention in the affective computing research field.

Emotion Recognition in Conversation

Paper
Add Code

Language Resource Efficient Learning for Captioning

no code implementations • Findings (EMNLP) 2021 • Jia Chen, Yike Wu, Shiwan Zhao, Qin Jin

Our analysis of caption models with SC loss shows that the performance degradation is caused by the increasingly noisy estimation of reward and baseline with fewer language resources.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.