…This dense visual grounding takes the form of a mouse trace segment per word and is unique to our data.
55 PAPERS • 5 BENCHMARKS