30 papers with code • 0 benchmarks • 9 datasets
Detecting activities in extended videos.
We obtain strong results on the new fine-grained task and state-of-the-art on the 4-way task: our best model obtains frame-level error rates of 6. 2%, 7. 7% and 28. 0% when generalizing to unseen instructors for the 4-way, 5-way, and 9-way classification tasks, respectively (relative reductions of 35. 4%, 48. 3% and 21. 6% over a strong baseline).
In the speaker extraction problem, it is found that additional information from the target speaker contributes to the tracking and extraction of the target speaker, which includes voiceprint, lip movement, facial expression, and spatial information.
Experiments on multiple speaker diarization datasets conclude that our model can be used with great success on both voice activity detection and overlapped speech detection.
In this paper, we introduce Coarse-Fine Networks, a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion.
Ranked #3 on Action Detection on Charades
This work aims at building a large scale dataset with daily-living activities performed in a natural manner.