no code implementations • NAACL (SIGTYP) 2022 • Sihan Chen, Richard Futrell, Kyle Mahowald
Using data from Nintemann et al. (2020), we explore the variability in complexity and informativity across spatial demonstrative systems using spatial deictic lexicons from 223 languages.
no code implementations • 18 Aug 2023 • Yichen Yan, Xingjian He, Wenxuan Wang, Sihan Chen, Jing Liu
In previous approaches, fused vision-language features are directly fed into a decoder and pass through a convolution with a fixed kernel to obtain the result, which follows a similar pattern as traditional image segmentation.
2 code implementations • 15 Jun 2023 • Sihan Chen, Xingjian He, Handong Li, Xiaojie Jin, Jiashi Feng, Jing Liu
Due to the limited scale and quality of video-text training corpus, most vision-language foundation models employ image-text datasets for pretraining and primarily focus on modeling visually semantic representations while disregarding temporal semantic representations and correlations.
1 code implementation • 25 May 2023 • Zijia Zhao, Longteng Guo, Tongtian Yue, Sihan Chen, Shuai Shao, Xinxin Zhu, Zehuan Yuan, Jing Liu
We show that only language-paired two-modality data is sufficient to connect all modalities.
no code implementations • 22 May 2023 • Xingjian He, Sihan Chen, Fan Ma, Zhicheng Huang, Xiaojie Jin, Zikang Liu, Dongmei Fu, Yi Yang, Jing Liu, Jiashi Feng
Towards this goal, we propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature Adapting and Blending, which transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
Ranked #1 on
TGIF-Frame
on TGIF-QA
(using extra training data)
1 code implementation • 19 May 2023 • Zikang Liu, Sihan Chen, Longteng Guo, Handong Li, Xingjian He, Jing Liu
In this paper, we propose a novel method called Joint QA and DC GEneration (JADE), which utilizes a pre-trained multimodal model and easily-crawled image-text pairs to automatically generate and filter large-scale VQA and dense captioning datasets.
1 code implementation • 17 Apr 2023 • Sihan Chen, Xingjian He, Longteng Guo, Xinxin Zhu, Weining Wang, Jinhui Tang, Jing Liu
Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner.
Ranked #1 on
Video Captioning
on VATEX
(using extra training data)
1 code implementation • 29 Mar 2023 • Jiawei Liu, Weining Wang, Sihan Chen, Xinxin Zhu, Jing Liu
In this work, we concentrate on a rarely investigated problem of text guided sounding video generation and propose the Sounding Video Generator (SVG), a unified framework for generating realistic videos along with audio signals.
1 code implementation • 28 Apr 2022 • Lianqing Zheng, Zhixiong Ma, Xichan Zhu, Bin Tan, Sen Li, Kai Long, Weiqi Sun, Sihan Chen, Lu Zhang, Mengyue Wan, Libo Huang, Jie Bai
The next-generation high-resolution automotive radar (4D radar) can provide additional elevation measurement and denser point clouds, which has great potential for 3D sensing in autonomous driving.
no code implementations • 26 Jan 2021 • Wei Liu, Sihan Chen, Longteng Guo, Xinxin Zhu, Jing Liu
Besides, we provide detailed visualizations of the self-attention between patches in the encoder and the "words-to-patches" attention in the decoder thanks to the full Transformer architecture.
no code implementations • 26 Jan 2021 • Sihan Chen, Xinxin Zhu, Wei Liu, Xingjian He, Jing Liu
Depth information matters in RGB-D semantic segmentation task for providing additional geometric information to color images.