2.1. Visual Grounding Visual grounding is a field focused on predicting the lo- cation within an image that corresponds to a natural lan- guage expression. Benchmark datasets for VG have pre- dominantly centered on small-scale scenes. For exam- ple, the...