no code implementations • 23 Jan 2024 • Ee Yeo Keat, Zhang Hao, Alexander Matyasko, Basura Fernando
We introduce VidTFS, a Training-free, open-vocabulary video goal and action inference framework that combines the frozen vision foundational model (VFM) and large language model (LLM) with a novel dynamic Frame Selection module.