The ViTT dataset consists of human produced segment-level annotations for 8,169 videos. Of these, 5,840 videos have been annotated once, and the rest of the videos have been annotated twice or more. A total of 12,461 sets of annotations are released. The videos in the dataset are from the Youtube-8M dataset.
An annotation has the following format:
{
"id": "FmTp",
"annotations": [
{
"timestamp": 260,
"tag": "Opening"
},
{
"timestamp": 16000,
"tag": "Displaying technique"
},
{
"timestamp": 23990,
"tag": "Showing foot positioning"
},
{
"timestamp": 55530,
"tag": "Demonstrating crossover"
},
{
"timestamp": 114100,
"tag": "Closing"
}
]
}
Paper | Code | Results | Date | Stars |
---|