…This dense visual grounding takes the form of a mouse trace segment per word and is unique to our data.
54 PAPERS • 5 BENCHMARKS