no code implementations • 27 Nov 2023 • Elahe Vahdani, YingLi Tian
This paper addresses the challenge of point-supervised temporal action detection, in which only one frame per action instance is annotated in the training set.
no code implementations • 20 Oct 2023 • Elahe Vahdani, YingLi Tian
This paper tackles the challenge of point-supervised temporal action detection, wherein only a single frame is annotated for each action instance in the training set.
no code implementations • 30 Sep 2021 • Elahe Vahdani, YingLi Tian
The task of temporal activity detection in untrimmed videos aims to localize the temporal boundary of actions and classify the action categories.
no code implementations • CVPR 2021 • Longlong Jing, Elahe Vahdani, Jiaxing Tan, YingLi Tian
Cross-modal retrieval aims to learn discriminative and modal-invariant features for data from different modalities.
no code implementations • 8 Aug 2020 • Longlong Jing, Elahe Vahdani, Jiaxing Tan, YingLi Tian
Cross-modal retrieval aims to learn discriminative and modal-invariant features for data from different modalities.
no code implementations • LREC 2020 • Saad Hassan, Larwan Berke, Elahe Vahdani, Longlong Jing, YingLi Tian, Matt Huenerfauth
We have collected a new dataset consisting of color and depth videos of fluent American Sign Language (ASL) signers performing sequences of 100 ASL signs from a Kinect v2 sensor.
no code implementations • 1 May 2020 • Elahe Vahdani, Longlong Jing, YingLi Tian, Matt Huenerfauth
Our system is able to recognize grammatical elements on ASL-HW-RGBD from manual gestures, facial expressions, and head movements and successfully detect 8 ASL grammatical mistakes.
no code implementations • 7 Jun 2019 • Longlong Jing, Elahe Vahdani, Matt Huenerfauth, YingLi Tian
In this paper, we propose a 3D Convolutional Neural Network (3DCNN) based multi-stream framework to recognize American Sign Language (ASL) manual signs (consisting of movements of the hands, as well as non-manual face movements in some cases) in real-time from RGB-D videos, by fusing multimodality features including hand gestures, facial expressions, and body poses from multi-channels (RGB, depth, motion, and skeleton joints).