1 code implementation • 12 Mar 2023 • Ludan Ruan, Anwen Hu, Yuqing Song, Liang Zhang, Sipeng Zheng, Qin Jin
In this paper, we extend the stateof-the-art Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing.
1 code implementation • 18 Jul 2022 • Qi Zhang, Yuqing Song, Qin Jin
Dense video captioning aims to generate corresponding text descriptions for a series of events in the untrimmed video, which can be divided into two sub-tasks, event detection and event captioning.
no code implementations • 24 Jun 2022 • Yuqing Song
Contours are defined on a continuous scalar field.
no code implementations • 24 Apr 2022 • Yida Zhao, Yuqing Song, Qin Jin
Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR), is a retrieval task where the search intention is expressed in a more complex query format, involving both vision and text modalities.
1 code implementation • 25 Aug 2021 • Yuqing Song, ShiZhe Chen, Qin Jin, Wei Luo, Jun Xie, Fei Huang
Firstly, there are many specialized jargons in the product description, which are ambiguous to translate without the product image.
1 code implementation • 11 Jun 2021 • Ludan Ruan, Jieting Chen, Yuqing Song, ShiZhe Chen, Qin Jin
For the object grounding, we fine-tune the state-of-the-art detection model MDETR and design a post processing method to make the grounding results more faithful.
1 code implementation • CVPR 2021 • Yuqing Song, ShiZhe Chen, Qin Jin
Video paragraph captioning aims to describe multiple events in untrimmed videos with descriptive paragraphs.
2 code implementations • 11 Mar 2021 • Yuqi Huo, Manli Zhang, Guangzhen Liu, Haoyu Lu, Yizhao Gao, Guoxing Yang, Jingyuan Wen, Heng Zhang, Baogui Xu, Weihao Zheng, Zongzheng Xi, Yueqian Yang, Anwen Hu, Jinming Zhao, Ruichen Li, Yida Zhao, Liang Zhang, Yuqing Song, Xin Hong, Wanqing Cui, Danyang Hou, Yingyan Li, Junyi Li, Peiyu Liu, Zheng Gong, Chuhao Jin, Yuchong Sun, ShiZhe Chen, Zhiwu Lu, Zhicheng Dou, Qin Jin, Yanyan Lan, Wayne Xin Zhao, Ruihua Song, Ji-Rong Wen
We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model.
Ranked #1 on Image Retrieval on RUC-CAS-WenLan
no code implementations • 14 Jun 2020 • Yuqing Song, Shi-Zhe Chen, Yida Zhao, Qin Jin
Detecting meaningful events in an untrimmed video is essential for dense video captioning.
Ranked #3 on Dense Video Captioning on ActivityNet Captions
no code implementations • 15 Oct 2019 • Shizhe Chen, Yida Zhao, Yuqing Song, Qin Jin, Qi Wu
This notebook paper presents our model in the VATEX video captioning challenge.
no code implementations • 15 Aug 2019 • Yuqing Song, Shi-Zhe Chen, Yida Zhao, Qin Jin
We employ self-supervision from mono-lingual corpus in the target language to provide fluency reward, and propose a multi-level visual semantic matching model to provide both sentence-level and concept-level visual relevancy rewards.
no code implementations • 11 Jul 2019 • Shizhe Chen, Yuqing Song, Yida Zhao, Qin Jin, Zhaoyang Zeng, Bei Liu, Jianlong Fu, Alexander Hauptmann
The overall system achieves the state-of-the-art performance on the dense-captioning events in video task with 9. 91 METEOR score on the challenge testing set.
no code implementations • 22 Jun 2018 • Shizhe Chen, Yuqing Song, Yida Zhao, Jiarong Qiu, Qin Jin, Alexander Hauptmann
This notebook paper presents our system in the ActivityNet Dense Captioning in Video task (task 3).