1 code implementation • 3 Apr 2025 • Jinhui Ye, Zihan Wang, Haosen Sun, Keshigeyan Chandrasegaran, Zane Durante, Cristobal Eyzaguirre, Yonatan Bisk, Juan Carlos Niebles, Ehsan Adeli, Li Fei-Fei, Jiajun Wu, Manling Li
Specifically, under an inference budget of 32 frames, T* improves GPT-4o's performance from 50. 5% to 53. 1% and LLaVA-OneVision-72B's performance from 56. 5% to 62. 4% on LongVideoBench XL subset.
no code implementations • 17 Mar 2025 • Weiyu Guo, Ziyang Chen, Shaoguang Wang, Jianxiang He, Yijie Xu, Jinhui Ye, Ying Sun, Hui Xiong
Understanding long video content is a complex endeavor that often relies on densely sampled frame captions or end-to-end feature selectors, yet these techniques commonly overlook the logical relationships between textual queries and visual elements.
1 code implementation • 3 Mar 2025 • Lu Dai, Yijie Xu, Jinhui Ye, Hao liu, Hui Xiong
Large Language Models (LLMs) have demonstrated improved generation performance by incorporating externally retrieved knowledge, a process known as retrieval-augmented generation (RAG).
1 code implementation • 23 May 2024 • Jinhui Ye, Xing Wang, Wenxiang Jiao, Junwei Liang, Hui Xiong
In this paper, we identify a representation density problem that could be a bottleneck in restricting the performance of gloss-free SLT.
Contrastive Learning
Gloss-free Sign Language Translation
+2
no code implementations • 29 Nov 2023 • Jinhui Ye, Jiaming Zhou, Hui Xiong, Junwei Liang
Specifically, at the core of GeoDeformer is the Geometric Deformation Predictor, a module designed to identify and quantify potential spatial and temporal geometric deformations within the given video.
no code implementations • 19 Aug 2023 • Jinhui Ye, Junwei Liang
This paper studies introducing viewpoint invariant feature representations in existing action recognition architecture.
1 code implementation • 18 May 2023 • Jinhui Ye, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, Hui Xiong
To tackle these challenges, we propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation (i. e. video-to-text) by exploiting pseudo gloss-text pairs from the sign gloss translation model.
Ranked #4 on
Sign Language Translation
on CSL-Daily
1 code implementation • 13 Oct 2022 • Jinhui Ye, Wenxiang Jiao, Xing Wang, Zhaopeng Tu
In this paper, to overcome the limitation, we propose a Prompt based domain text Generation (PGEN) approach to produce the large-scale in-domain spoken language text data.