no code implementations • 22 Dec 2023 • Kun Yan, Lei Ji, Zeyu Wang, Yuntao Wang, Nan Duan, Shuai Ma
In this paper, we introduce gaze information, feasibly collected by AR or VR devices, as a proxy for human attention to guide VLMs and propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications.
no code implementations • 10 Jul 2023 • Gangwoo Kim, Hajung Kim, Lei Ji, Seongsu Bae, Chanhwi Kim, Mujeen Sung, Hyunjae Kim, Kun Yan, Eric Chang, Jaewoo Kang
In this paper, we introduce CheXOFA, a new pre-trained vision-language model (VLM) for the chest X-ray domain.
1 code implementation • 27 Jun 2023 • Zhijian Hou, Lei Ji, Difei Gao, Wanjun Zhong, Kun Yan, Chao Li, Wing-Kwong Chan, Chong-Wah Ngo, Nan Duan, Mike Zheng Shou
Motivated by this, we leverage a two-stage pre-training strategy to train egocentric feature extractors and the grounding model on video narrations, and further fine-tune the model on annotated data.
1 code implementation • CVPR 2023 • Kun Yan, Xiao Li, Fangyun Wei, Jinglu Wang, Chenbin Zhang, Ping Wang, Yan Lu
The underlying idea is to generate pseudo labels for unlabeled frames during training and to optimize the model on the combination of labeled and pseudo-labeled data.
no code implementations • 16 Nov 2022 • Zhijian Hou, Wanjun Zhong, Lei Ji, Difei Gao, Kun Yan, Wing-Kwong Chan, Chong-Wah Ngo, Zheng Shou, Nan Duan
This technical report describes the CONE approach for Ego4D Natural Language Queries (NLQ) Challenge in ECCV 2022.
no code implementations • 10 Oct 2022 • Kun Yan, Lei Ji, Chenfei Wu, Jian Liang, Ming Zhou, Nan Duan, Shuai Ma
Panorama synthesis endeavors to craft captivating 360-degree visual landscapes, immersing users in the heart of virtual worlds.
1 code implementation • 22 Sep 2022 • Zhijian Hou, Wanjun Zhong, Lei Ji, Difei Gao, Kun Yan, Wing-Kwong Chan, Chong-Wah Ngo, Zheng Shou, Nan Duan
This paper tackles an emerging and challenging problem of long video temporal grounding~(VTG) that localizes video moments related to a natural language (NL) query.
no code implementations • 2 Dec 2021 • Kun Yan, Chenbin Zhang, Jun Hou, Ping Wang, Zied Bouraoui, Shoaib Jameel, Steven Schockaert
A key feature of the multi-label setting is that images often have multiple labels, which typically refer to different regions of the image.
no code implementations • ACL 2021 • Kun Yan, Lei Ji, Huaishao Luo, Ming Zhou, Nan Duan, Shuai Ma
Moreover, the controllability and explainability of LoopCAG are validated by analyzing spatial and temporal sensitivity during the generation process.
Ranked #1 on Image Captioning on Localized Narratives
no code implementations • 19 Jul 2021 • Zhenyu Guo, Shuai Zheng, Zhizhe Liu, Kun Yan, Zhenfeng Zhu
Treatment effect estimation, which refers to the estimation of causal effects and aims to measure the strength of the causal relationship, is of great importance in many fields but is a challenging problem in practice.
no code implementations • 21 May 2021 • Kun Yan, Zied Bouraoui, Ping Wang, Shoaib Jameel, Steven Schockaert
While the use of class names has already been explored in previous work, our approach differs in two key aspects.
no code implementations • 1 Feb 2021 • Kun Yan, Zied Bouraoui, Ping Wang, Shoaib Jameel, Steven Schockaert
The aim of few-shot learning (FSL) is to learn how to recognize image categories from a small number of training examples.
2 code implementations • NeurIPS 2019 • Zhonghui You, Kun Yan, Jinmian Ye, Meng Ma, Ping Wang
When the scaling factor is set to zero, it is equivalent to removing the corresponding filter.