no code implementations • NAACL (SIGTYP) 2022 • Sihan Chen, Richard Futrell, Kyle Mahowald
Using data from Nintemann et al. (2020), we explore the variability in complexity and informativity across spatial demonstrative systems using spatial deictic lexicons from 223 languages.
1 code implementation • 11 Dec 2024 • Sihan Chen, Zhuangzhuang Qian, Wingchun Siu, Xingcan Hu, Jiaqi Li, Shawn Li, Yuehan Qin, Tiankai Yang, Zhuo Xiao, Wanghao Ye, Yichi Zhang, Yushun Dong, Yue Zhao
Outlier detection (OD), also known as anomaly detection, is a critical machine learning (ML) task with applications in fraud detection, network intrusion detection, clickstream analysis, recommendation systems, and social network moderation.
no code implementations • 18 Jun 2024 • Wuxinlin Cheng, Cheng Wan, Yupeng Cao, Sihan Chen
RITA presents a high-quality real-time interactive framework built upon generative models, designed with practical applications in mind.
no code implementations • 18 May 2024 • Yichen Yan, Xingjian He, Sihan Chen, Shichen Lu, Jing Liu
Referring Image Segmentation (RIS) aims to segment an object described in natural language from an image, with the main challenge being a text-to-pixel correlation.
1 code implementation • 15 Apr 2024 • Zheng Chen, Zongwei Wu, Eduard Zamfir, Kai Zhang, Yulun Zhang, Radu Timofte, Xiaokang Yang, Hongyuan Yu, Cheng Wan, Yuxin Hong, Zhijuan Huang, Yajun Zou, Yuan Huang, Jiamin Lin, Bingnan Han, Xianyu Guan, Yongsheng Yu, Daoan Zhang, Xuanwu Yin, Kunlong Zuo, Jinhua Hao, Kai Zhao, Kun Yuan, Ming Sun, Chao Zhou, Hongyu An, Xinfeng Zhang, Zhiyuan Song, Ziyue Dong, Qing Zhao, Xiaogang Xu, Pengxu Wei, Zhi-chao Dou, Gui-ling Wang, Chih-Chung Hsu, Chia-Ming Lee, Yi-Shiuan Chou, Cansu Korkmaz, A. Murat Tekalp, Yubin Wei, Xiaole Yan, Binren Li, Haonan Chen, Siqi Zhang, Sihan Chen, Amogh Joshi, Nikhil Akalwadi, Sampada Malagi, Palani Yashaswini, Chaitra Desai, Ramesh Ashok Tabib, Ujwala Patil, Uma Mudenagudi, Anjali Sarvaiya, Pooja Choksy, Jagrit Joshi, Shubh Kawa, Kishor Upla, Sushrut Patwardhan, Raghavendra Ramachandra, Sadat Hossain, Geongi Park, S. M. Nadim Uddin, Hao Xu, Yanhui Guo, Aman Urumbekov, Xingzhuo Yan, Wei Hao, Minghan Fu, Isaac Orais, Samuel Smith, Ying Liu, Wangwang Jia, Qisheng Xu, Kele Xu, Weijun Yuan, Zhan Li, Wenqin Kuang, Ruijin Guan, Ruting Deng, Zhao Zhang, Bo wang, Suiyi Zhao, Yan Luo, Yanyan Wei, Asif Hussain Khan, Christian Micheloni, Niki Martinel
This paper reviews the NTIRE 2024 challenge on image super-resolution ($\times$4), highlighting the solutions proposed and the outcomes obtained.
no code implementations • 12 Apr 2024 • Yichen Yan, Xingjian He, Sihan Chen, Jing Liu
In this paper, we introduce CRFormer, a model that iteratively calibrates multi-modal features in the transformer decoder.
no code implementations • 20 Mar 2024 • Yanyuan Qiao, Zheng Yu, Longteng Guo, Sihan Chen, Zijia Zhao, Mingzhen Sun, Qi Wu, Jing Liu
The extensive experiments on diverse multimodal benchmarks with competitive performance show the effectiveness of our proposed VL-Mamba and demonstrate the great potential of applying state space models for multimodal learning tasks.
Ranked #175 on
Visual Question Answering
on MM-Vet
1 code implementation • NeurIPS 2023 • Mingzhen Sun, Weining Wang, Zihan Qin, Jiahui Sun, Sihan Chen, Jing Liu
Specifically, we propose a video auto-encoder, where a video encoder encodes videos into global features, and a video decoder, built on a diffusion model, decodes the global features and synthesizes video frames in a non-autoregressive manner.
no code implementations • 18 Aug 2023 • Yichen Yan, Xingjian He, Wenxuan Wang, Sihan Chen, Jing Liu
Our method harnesses the potential of the multi-modal features in the segmentation stage and aligns language features of different emphases with image features to achieve fine-grained text-to-pixel correlation.
1 code implementation • 15 Jun 2023 • Sihan Chen, Xingjian He, Handong Li, Xiaojie Jin, Jiashi Feng, Jing Liu
Due to the limited scale and quality of video-text training corpus, most vision-language foundation models employ image-text datasets for pretraining and primarily focus on modeling visually semantic representations while disregarding temporal semantic representations and correlations.
Ranked #1 on
TGIF-Frame
on TGIF-QA
(using extra training data)
2 code implementations • NeurIPS 2023 • Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, Jing Liu
Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA).
Ranked #1 on
Image Captioning
on COCO Captions
(SPICE metric, using extra
training data)
1 code implementation • 25 May 2023 • Zijia Zhao, Longteng Guo, Tongtian Yue, Sihan Chen, Shuai Shao, Xinxin Zhu, Zehuan Yuan, Jing Liu
We show that only language-paired two-modality data is sufficient to connect all modalities.
no code implementations • IEEE Transactions on Instrumentation and Measurement 2023 • Lianqing Zheng, Sen Li, Bin Tan, Long Yan, Sihan Chen, Libo Huang, Jie Bai, Xichan Zhu, Zhixiong Ma
Meanwhile, in the 4-D radar stream, a newly designed component named radar PillarNet efficiently encodes the radar features to generate radar pseudo-images, which are fed into the point cloud backbone to create radar BEV features.
Ranked #6 on
3D Object Detection (RoI)
on View-of-Delft (val)
no code implementations • 22 May 2023 • Xingjian He, Sihan Chen, Fan Ma, Zhicheng Huang, Xiaojie Jin, Zikang Liu, Dongmei Fu, Yi Yang, Jing Liu, Jiashi Feng
Towards this goal, we propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature Adapting and Blending, which transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
Ranked #1 on
Visual Question Answering (VQA)
on MSVD-QA
(using extra training data)
1 code implementation • 19 May 2023 • Zikang Liu, Sihan Chen, Longteng Guo, Handong Li, Xingjian He, Jing Liu
In this paper, we propose a novel method called Joint QA and DC GEneration (JADE), which utilizes a pre-trained multimodal model and easily-crawled image-text pairs to automatically generate and filter large-scale VQA and dense captioning datasets.
1 code implementation • 17 Apr 2023 • Jing Liu, Sihan Chen, Xingjian He, Longteng Guo, Xinxin Zhu, Weining Wang, Jinhui Tang
Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner.
Ranked #1 on
Video Captioning
on VATEX
(using extra training data)
1 code implementation • 29 Mar 2023 • Jiawei Liu, Weining Wang, Sihan Chen, Xinxin Zhu, Jing Liu
In this work, we concentrate on a rarely investigated problem of text guided sounding video generation and propose the Sounding Video Generator (SVG), a unified framework for generating realistic videos along with audio signals.
1 code implementation • 28 Apr 2022 • Lianqing Zheng, Zhixiong Ma, Xichan Zhu, Bin Tan, Sen Li, Kai Long, Weiqi Sun, Sihan Chen, Lu Zhang, Mengyue Wan, Libo Huang, Jie Bai
The next-generation high-resolution automotive radar (4D radar) can provide additional elevation measurement and denser point clouds, which has great potential for 3D sensing in autonomous driving.
no code implementations • 26 Jan 2021 • Wei Liu, Sihan Chen, Longteng Guo, Xinxin Zhu, Jing Liu
Besides, we provide detailed visualizations of the self-attention between patches in the encoder and the "words-to-patches" attention in the decoder thanks to the full Transformer architecture.
no code implementations • 26 Jan 2021 • Sihan Chen, Xinxin Zhu, Wei Liu, Xingjian He, Jing Liu
Depth information matters in RGB-D semantic segmentation task for providing additional geometric information to color images.